1 Introduction

Fruits are essential to a healthy lifestyle due to their high concentration of vitamins [1] and essential fiber [2]. The nutrients in fruits have various health benefits such as disease prevention, and this is why people buy them daily to promote well-being [3, 4]. Therefore, the growing demand for fruits necessitates continuous production and supply. Currently, most fruits are still harvested manually because this task requires the knowledge and experience of seasoned orchard workers. The manual harvesting process often results in mistakes, omissions, and damage to fruits. Consequently, this method of harvesting has resulted in increased costs throughout the agriculture industry [5] and has also worsened yield depression [6]. In the agricultural sector, modern technology, such as the use of harvesting robots, can assist farmers in overcoming these challenges and increasing their level of productivity. In recent years, computer vision technology has been implemented in harvesting fruits to detect and locate produce more efficiently [7,8,9,10].

The computer vision system initially captures raw image data through sensors or cameras, in which feature extraction [11], machine learning [12,13,14], and deep learning techniques [15,16,17] are employed to segment and detect fruit images. Once the location of the target is detected, it becomes the input for the control system. The manipulator then moves according to the planned trajectory, ultimately grabbing the target with the end-effector. The control system receives the input signal again to guide the manipulator to the next target.

Researchers have developed various types of harvesting robots that offer a novel approach to intelligently harvesting fruits. However, the unpredictability of their performance, low efficiency, and high costs currently prevent the large-scale adoption and replacement of skilled orchard workers. Robotic grasping has been proven inaccurate and inefficient due to several reasons: fruit obscured by branches and leaves in unstructured orchard environments [18], environmental factors such as illumination changes, wind, and rain that interfere with robot functionality and contact with leaves [19], and inadequate color differentiation between the orchard background and fruit [20, 21].

To overcome the objectively imposed disturbances, researchers have increased the accuracy of fruit recognition and localization, as well as the performance of vision-based control techniques, to improve the overall performance of the robot harvesting [22]. In this paper, the techniques proposed by researchers are broadly classified into the following three categories:

  1. (1)

    Traditional image processing techniques are based on low-level features such as color, texture, and shape [23];

  2. (2)

    Classification algorithms that are based on machine learning, such as the Bayesian classifier algorithm [24], Support Vector Machine (SVM) algorithm [25], K-Nearest Neighbors (KNN) clustering algorithms [26], and so on;

  3. (3)

    Object detection algorithms that are based on deep learning, such as Convolutional Neural Networks (CNN) [27,28,29], Faster Regions with Convolutional Neural Networks (Faster R-CNN) [30,31,32], You Only Look Once (YOLO) network [33,34,35,36], Fully Convolutional Network (FCN) [37,38,39], and Single Shot Multi-Box Detector (SSD) Network [40,41,42], etc..

It is worth noting that deep learning has become the favored technique for contemporary agricultural researchers to identify and detect fruit, thereby replacing conventional machine learning algorithms. Conventional machine learning algorithms require a manual feature extractor to extract underlying features like color, texture, and shape from image data, which is highly complex and time-consuming [43]. Data and algorithms are interdependent in computer vision, and deep neural networks require large amounts of high-quality data. Insufficient high-quality data makes the generation of an ideal model difficult, even with advanced algorithmic training. Various sensors are used for image data acquisition, such as black-and-white, RGB, hyperspectral, multispectral, and thermal cameras. Most researchers prefer using RGB cameras for image acquisition, which solely provides 2D location information without real-world location mapping [44,45,46,47,48]. To obtain depth information for fruit localization, researchers measure depth indirectly through binocular stereo-vision methods or physical means [44]. Previous results have reported the development of vision-based control technology with application to robotic harvesting, however, the low successful rate of fruit recognition, inefficiency localization, and inaccurate control limit the performance of harvesting robots. Therefore, a review of vision-based sensing and control technology is necessary to promote further developments for harvesting robots.

This paper presents a methodical review of recent vision-based research on harvesting robots, intending to propose solutions for target recognition and hand–eye coordination control. To provide a comprehensive overview of the topic, the subsequent sections of this paper are organized as follows: Sect. 2 introduces the key components of harvesting robots, and Sect. 3 discusses fruit detection and identification techniques in detail. Furthermore, Sect. 4 presents localization methods for fruits and their associated sensors, while Sect. 5 focuses on vision control techniques for the harvesting robot. The challenges and future trends of harvesting robots are highlighted in Sect. 6. Finally, Sect. 7 provides conclusions.

2 Key Components of Harvesting Robots

Fruit detection, positioning, and separation are three fundamental tasks that harvesting robots must execute [45]. The robotic system employs sensors to collect environmental data, which identifies and locates the target fruit. The robot control scheme then utilizes this data to maneuver the manipulator to reach the cutting point of the fruit for harvesting [46,47,48]. In addition, machine vision systems recognize and locate the fruit, enabling precise control of the movements. Ultimately, the manipulator and end-effector operate in tandem to divide and harvest the fruit [49]. This section describes the primary components of harvesting robots mentioned earlier. As depicted in Figs. 1 and 2, the three central technical components of harvesting robots in a laboratory or natural environment are highlighted.

Fig. 1
figure 1

Harvesting robots in the laboratory: A Tomato harvesting robot (adapted from [50]); B and C Apple harvesting robots (adapted from [51]); D Sweet-pepper harvesting robot (adapted from [52])

Fig. 2
figure 2

Harvesting robots in the external natural environment: A Strawberry harvesting robot (adapted from [53]); B Cherry-tomato harvesting robot (adapted from [54]); C Guava harvesting robot (adapted from [55]); D Apple harvesting robot (adapted from [56])

2.1 Machine Vision System

Identifying the target fruit is the primary task of a harvesting robot, and taking the position information of the detected fruit as input to the robot control system is a key step in robot movement. The unstructured ambient circumstances of orchards, as well as the shadowing of fruit by tree canopies, provide massive issues for machine vision systems.

Over the past few decades, researchers have developed and deployed various machine vision-based methods for fruit detection [57,58,59]. Standard image identification systems are highly sensitive to light fluctuations and require costly sensors to produce high-quality images. As deep learning applications in agriculture continue to expand, researchers have focused on exploring and validating novel algorithms to tackle the previously described issues [60,61,62]. Studies have utilized diverse techniques, e.g., monocular cameras, binocular stereo vision cameras, RGB-D cameras, and ground laser scanners, to provide depth information and accurately locate the fruit. Refer to Sects. 3 and 3 for detailed insights into methods for fruit detection and localization.

2.2 Manipulators

Researchers have conducted extensive studies in the fields of manipulator path planning and motion planning. Maneuvering the manipulators towards the target fruit while finding the optimal cutting direction for the end-effector is a significant challenge because of the complexity and variety of the environment, including branches, leaves, and other obstructions surrounding the fruit, as well as strong winds that can arise at any time [63,64,65]. Manipulators have degrees of freedom and joint types (rotational or prismatic), which significantly impact kinematic flexibility, obstacle avoidance, and the space requirements needed to obtain the desired position and orientation for the end-effector during maneuvering. Recently, researchers have utilized vision-based, servo-controlled manipulators for movement planning, enabling them to harvest fruits while avoiding damaging their surroundings and navigating around obstacles [50]. According to the design requirements of the manipulator robot arm, the harvesting robot manipulator arm can be decomposed into different degrees of freedom, of which the linkage parameters and joint position parameters can determine six degrees of freedom (rotation, translation, and slewing) [66]. The trajectory of the manipulator arm is obtained by interpolating the trajectory of the joint space. By transforming the coordinate system of the six joints in the area, it can be mapped to the right-angle coordinate system of the workspace, and then using the Lagrangian method, the force that can be withstood by the six joints in the process of movement is solved, to obtain the relationship between the force exerted on the six joints of the robotic arm and the parameters of the joints.

2.3 End-Effectors

The last operation for a harvesting robotic system is fruit harvesting with the end-effector. To satisfy the requirements of actual applications, the end-effector is designed with the following aspects in mind:

  1. (a)

    Reasonable gripping force. Excessive force would damage the stem and destroy the orchard condition;

  2. (b)

    Harvesting efficiency;

  3. (c)

    Circulation time;

  4. (d)

    The sensible structure that avoids damage to the fruit and canopy structure due to the bulky mechanical parts of the end-effector [56].

For fruit harvesting, researchers have invented end-effectors with various shapes and sizes. There are two versions of automatic harvesting methods: (1) End-effectors apply mechanical force (twisting, stretching, or bending) to the fruit to separate it from the stem. (2) New cutting techniques are being sought that cut the peduncles immediately when the end-effector grips the fruit, as certain fruits have hard peduncles that make detachment difficult [67]. Soft robot end-effectors are increasingly replacing rigid end-effectors in robotic systems. They can bend to match the angle of the fruit, helping to prevent mechanical collisions with the tree canopy and trellis wires, and are more capable in that regard [68].

3 Detection Approaches for Harvesting Robot

Although harvesting robots and deep learning techniques have advanced significantly in recent years, controlling robots to detect fruits in unstructured orchards still requires considerable effort [69]. To develop classifiers, researchers gather low-level features such as textures, colors, and shapes, and then use machine learning techniques such as the SVM algorithm, K-means clustering algorithm, and AdaBoost algorithm to detect and classify fruits. Unlike traditional machine learning, deep learning allows for the creation of more abstract, high-level features or attribute categories that can improve accuracy [70]. This section mainly introduces traditional machine learning-based image processing methods and deep learning-based image identification methods. Figure 3 demonstrates the use of computer vision technology in a strawberry-picking robot to identify strawberries at different levels of ripeness.

Fig. 3
figure 3

The use of computer vision technology in strawberry-picking robot (adapted from [71])

3.1 Image Processing Techniques Based on Machine Learning

Due to the constantly changing backdrop and illumination of fruits, researchers often use extracted low-level features to segment and detect target fruits. The flow and methods of traditional image recognition technology, which are illustrated in Fig. 4, involve image preprocessing that eliminates extraneous information, recovers useful information, improves detectability, and simplifies data to enhance feature extraction, image segmentation, matching, and detection reliability. Images are acquired using various types of sensors, including black-and-white cameras [72], RGB cameras [73,74,75], hyperspectral cameras [76,77,78], multispectral cameras, and thermal cameras [79,80,81], multispectral cameras [79, 80] and thermal cameras [81,82,83]. Image data is then preprocessed using color space transformation, histogram equalization, and noise reduction techniques [43, 84, 85]. The majority of the images are captured using RGB cameras. Any additional types of sensors involved in image fusion methods will be discussed individually.

Fig. 4
figure 4

Traditional image recognition technology

3.1.1 Detection Algorithm Based on Color Features

Zhang et al. [86] proposed a color-based technique for detecting cucumber fruits in greenhouses, achieving a 76% detection rate for mature fruits. The identification rate was hindered by the misclassification of fruit with high highlights on the surface as leaves, as well as the exclusion of partially occluded fruit due to its categorization as noise. To counteract the aforementioned issues with light and shadows, Fan et al. [87] presented a pixel block segmentation approach based on a gray-centered red–green–blue (RGB) color space, which effectively distinguishes apple-fruit pixels from other pixels, such as shadow areas. Jidong et al. [88] developed a color feature-based recognition approach to solve the issue of overlapping apples. However, the identification rate for obscured apple fruits was only 86%, highlighting the need for further improvement. Identifying unripe fruit is crucial for farmers to optimize fertilizer application during the ripening phase and forecast yield before harvesting. Zhao et al. [89] presented an algorithm for immature green-orange detection that employs color features and an absolute transformation difference. After classification and detection using the Support Vector Machine (SVM) classifier, the algorithm achieved an accuracy of over 83%. Tan et al. [90] utilized histograms of gradient orientation and color features to differentiate blueberry fruits that vary in maturity. The authors compared the accuracy of the K-Nearest Neighbor (KNN) classifier to the newly developed Template Matching with Weighted Euclidean Distance (TMWE) classifier and determined that the TMWE classifier achieved higher accuracy at a lower computational cost.

3.1.2 Detection Algorithm Based on Geometric Features

Geometric features based on shape and size can be utilized to detect apricot species using various machine-learning-based algorithms. Yang et al. [91] proposed an approach that utilized different algorithms, like decision trees, KNN, Naive Bayes, linear discriminant analysis, SVM, and artificial neural networks. The authors reported that SVM with a continuous projection algorithm led to the most accurate detection. Additionally, Lin et al. [92] introduced a novel approach for detecting apricot species by first generating a shape descriptor through contour information-based feature detection for partial shape matching. Then, the probabilistic Hough transform was used to locate candidate fruits, and lastly, the SVM classifier was used to identify all candidate fruits.

3.1.3 Detection Algorithm Based on Texture Features

In terms of texture features. Yamamoto et al. [93] proposed a machine-learning algorithm to detect various types of tomato fruits, including ripe, immature, and young fruits, by fusing multiple features. By calculating the appropriate number of clusters with X-means clustering, the algorithm detected individual fruits. However, the recall rate for young fruits was just 78%, highlighting the difficulty of distinguishing them from the stems due to size and appearance similarities. Li et al. [94] introduced a fast normalized cross-correlation (FNCC) machine vision-based algorithm to identify immature green citrus fruits by minimizing the impact of lighting fluctuations on RGB images. The algorithm combined color, shape, and texture features, with the KNN classifier achieving 84.4% detection. Additionally, the authors suggested manually adjusting the camera shutter set to produce a more consistent image brightness for better fruit-to-leaf differentiation. Zhang et al. [95] researched a color and texture fusion feature-backed approach to improve apple image segmentation. The Random Forest classifier, with a 94% accuracy rate, outperformed the other eight machine learning algorithms tested for pixel classification.

3.1.4 Multi-feature Fusion Method

Regarding multiple features, Lin & Zou et al. [96] introduced a novel segmentation method that made use of an AdaBoost classifier and texture-color features that fused Leung-Malik texture and HSV color features to detect citrus by fixed-size sub-windows. Nevertheless, the LM filter bank was impacted by fluctuations in light, resulting in the over-segmentation of citrus images. Additionally, Wu et al. [97] introduced a method for detecting juicy peaches that makes use of color data and three-dimensional (3D) contour features. The study utilized a conditional Euclidean clustering approach to cluster preprocessed 3D point clouds of fruit trees. In addition, 3D contour features were used to locate and harvest the fruits. Unfortunately, when detecting unripe green fruits, the accuracy of the method was relatively low. To address this issue, Wu et al. [98] proposed a fruitful point cloud segmentation approach that blends 3D geometric features with color features. This new method demonstrates superior performance compared to the traditional color segmentation method, with an accuracy of 80.09%. Although the method is effective in detecting fruits with roughly round or spherical surfaces, the authors caution that it may not be reliable for image segmentation of fruits with irregular surfaces.

It is noted that the above image processing algorithms can be summarized in Table 1. It is further concluded that color can be used as the main extracted feature when the color of the fruit is distinguishable or it is more differentiated from the background color, such as apricots, peaches, and citrus crops with more obvious colors. However, color features rely too much on the ideal situation of light, so color extraction is usually performed under artificial conditions. When the color of the fruit is similar to its background, the shape feature can be used as the main extracted feature, e.g., green-colored fruits are similar to the color of the branches and leaves, and their shapes can be detected to improve recognition accuracy. When branches or clusters heavily occlude the fruits, texture features can be used to recognize the target fruits more quickly and accurately. By extracting multiple features, the accuracy of target recognition and adaptability to complex real-world environments can be significantly improved, and the constraints under non-artificial conditions can be reduced.

Table 1 Machine learning based image processing algorithms

3.2 Image Recognition Technology Based on Deep Learning

Deep learning constitutes a subset of machine learning techniques [99, 100]. In traditional machine learning algorithms, the ability to learn is typically constrained, and larger amounts of data do not necessarily result in continuous improvement in the information learned. Conversely, deep learning systems, like the machine equivalent of "more experience," are capable of enhancing performance by accessing vast amounts of data. To overcome the challenge of numerous parameters and lengthy optimization times, the advent of GPU parallel computing technology has triggered a global rise in deep learning research. Additionally, comprehensive and rigorous investigations on the application of deep learning to agricultural robots have been conducted.

3.2.1 Two-stage Object Detection Algorithm

Faster R-CNN is a typical two-stage object detection model, and its structural diagram is shown in Fig. 5. The RPN (Region Proposal Network) is a crucial innovation that connects the region generation and convolutional networks through an anchor mechanism. It achieved an increased detection rate of 17 frames per second [30]. With the Faster R-CNN as the foundation, Kaiming [101, 102] introduced the Mask R-CNN, an innovative instance segmentation network. The key improvement can be summarized as:

  1. (1)

    To resolve the accuracy loss caused by the ROI pooling rounding method, the RoI Align method was introduced as a replacement for the original ROI pooling method.

  2. (2)

    A mask branch was incorporated into the image segmentation model to determine the class of each pixel.

Fig. 5
figure 5

Structure diagram of Faster R-CNN (adapted from [30])

Furthermore, new and improved algorithms based on the two-stage detection algorithm have been introduced to meet the requirements of various fruit detection tasks. Jia et al. [103] proposed a Mask Region Convolutional Neural Network (Mask R-CNN) based visual detector model for harvesting robots. The authors tested the method with a random test set of 120 images and achieved 97.31% accuracy and 95.70% recall. Parvathi et al. [104] have proposed an improved faster region-based convolutional neural network (Faster R-CNN) model for the detection of two important ripening stages of coconut in a complex context. The Faster R-CNN algorithm based on the ResNet-50 network was used to improve the detection scores of nuts at the two major ripening stages. Table 2 provides an overview of the improved two-stage target detection algorithms.

Table 2 Improved two-stage object detection algorithm

3.2.2 One-stage Object Detection Algorithm

The two-stage object detection algorithm creates region proposals in the first stage. In the second stage, the contents of the region of interest are classified and regressed, but this results in the omission of spatial information for local objects within the entire image. To solve this problem, a one-stage object detection algorithm is proposed that omits the region proposal creation stage and can directly detect objects. One of the most representative and popular single-stage target detection algorithms is the YOLO series [113,114,115,116], whose structure is shown in Fig. 6. The YOLO series has a faster detection speed than the R-CNN series, but the detection accuracy of the YOLO series is usually inferior to that of the R-CNN series.

Fig. 6
figure 6

Structure diagram of YOLO-v3

To balance detection accuracy and speed for optimal gains, another one-stage object detection algorithm called SSD was presented by Liu et al. [42]. During the last five years, several improved algorithms based on the YOLO or SSD framework have been developed. Tian et al. [117] have introduced an improved YOLOv3 model for detecting apples at different growth stages in orchards with fluctuating light, complex backgrounds, overlapping apples, and complicated branches and foliage. The proposed YOLOv3 dense model outperforms the original YOLOv3 model and the Faster R-CNN based on the VGG16 network, with an average detection time of 0.304 s/frame, which enables real-time detection of apples in orchards. To automatically identify graspable and non-graspable apples in apple tree images, Yan et al. [118] proposed a lightweight target detection method for an apple-picking robot based on improved YOLOv5s. The experimental results show that the improved network model can effectively identify graspable apples that are not occluded by leaves or only occluded by leaves, as well as non-graspable apples that are occluded by branches or occluded by other fruits. Specifically, the recognition recall, precision, mAP and F1 were 91.48%, 83.83%, 86.75% and 87.49%, respectively. The average recognition time was 0.015s per image. Overall, the improved one-stage object detection algorithm used in fruit harvesting robots is outlined in Table 3.

Table 3 Improved one-stage object detection algorithm

4 Localization Methods for Harvesting Robot

Harvesting robots require 3D spatial position information from the detected fruit to guide the end effectors through the harvesting procedure. However, the camera obtains only the 2D image space position of the target, thus needing to establish a mapping relationship between the target position in the 2D image space and the 3D space position. Researchers have proposed successful solutions to address this issue, which is introduced in this section. The paper categorizes the methods for localizing fruit based on camera data into 2D and 3D categories. The detailed comparison of 2D and 3D cameras is shown in Table 4.

Table 4 The detailed comparison of 2D and 3D cameras

4.1 Localization Method Based on Two-dimensional Images

2D cameras that contain charge-coupled device (CCD) sensors or complementary metal oxide semiconductor (CMOS) sensors are prevalent for fruit localization and the trajectory tracking of harvesting robots [146]. Mehta et al. [134] acquired 3D positioning information on citrus fruits using a monocular camera based on perspective transformation. The authors demonstrated that this method was less temporally complex than a stereo vision technique's depth estimation method after comparing test results. Xiong et al. [135] used a CCD camera with artificial lighting to detect green grapes and locate harvesting points, preventing the missing and inadvertent collection of nascent grapes during the night. They found that the highest accuracy in harvesting point detection was 92.5% at a depth of 500 mm. However, they also discovered that increased shot distance reduced light density, causing errors in point computation due to poor image quality.

Accurate fruit localization is crucial for effective robotic harvesting. Mehta et al. [136] developed a nonlinear estimator based on particle filters to predict fruit locations captured from multiple CMOS cameras. However, the approach has limited effectiveness in the case of an obstructed view. Unpruned buds can hinder accurate localization by producing vegetation that conceals new buds and affects the subsequent output. To address these issues, Daz et al. [137] developed a grape bud detection and localization method based on motion structure and 2D image categorization. The approach captured 2D images to construct a 3D model of the scene, achieving a localization error of 259–554 pixels.

4.2 Localization Method Based on Three-dimensional Coordinates

Conventional RGB color cameras, known as 2D cameras, only capture objects in the sensor field of view and cannot acquire component depth information. However, RGB-D depth cameras can acquire this depth information directly. The depth camera captures data and calculates the distance of each point in the picture from the camera. The x and y coordinates combine to provide the 3D spatial coordinates of each point in the image. Researchers have used depth cameras in combination with robotics to create more efficient robotic harvest systems. Depth cameras are categorized as structured light cameras, binocular cameras, and Time of Flight (ToF) cameras based on their operating principles.

4.2.1 Localization Based on Structured Light

A structured light camera consists of a laser projector and one or more structured light cameras. The projector actively emits infrared light onto the object's surface, which is then imaged with one or more structured light cameras. By calculating the location and depth information based on the triangulation principle, 3D reconstruction is achieved [147]. Laser triangulation is a fundamentally structured light system, as shown in Fig. 7.

Fig. 7
figure 7

Triangulation with a single laser spot (adapted from [147])

Structured-light cameras are widely utilized in agricultural automation. Nguyen et al. [138] used an RGB-D structured light camera to acquire images and developed an algorithm based on color and shape features that detected and located red and bi-colored apples beneath an umbrella blocking direct sunlight. The positioning accuracy in all directions was less than 10 mm. Additionally, the authors suggested using additional sensors for more information on the 3D position of the fruit and the location of the stem to enhance the gripping and harvesting of individual fruits.

4.2.2 Localization Based on Binocular Stereo Vision

Binocular stereo vision involves taking two pictures of the object of interest using cameras placed at different locations and then determining the positional difference between the corresponding points in the two images to obtain 3D geometric information about the object [147]. Figure 8 shows its schematic diagram. The system for binocular stereo vision is constructed using two conventional consumer-grade RGB cameras due to the inexpensive nature of the camera hardware requirements. Wang et al. [148] presented a technique for target localization using window scaling. The approach involves collecting photos of produce and estimating the three-dimensional coordinates of the target by utilizing the triangulation principle to achieve complete target localization. In a natural environment, Liu et al. [149] implemented a binocular stereo-vision approach and an improved YOLOv3 model for pineapple identification and localization. At a range of 1.7–2.7 m, the absolute mean error was 24.414 mm, with an average relative error of 1.17%. Additionally, during robotic harvesting, the method took into account wind disturbance, mutual branch contact, and mechanical collisions. The visual localization of dynamic lychee clusters was explored by Xiong et al. [150] The harvesting point was determined by computing the oscillation angle of lychee clusters in three states of disturbance: static, slight, and large. The maximum depth error was 5.8 cm, and the minimum depth error was 0.4 cm.

Fig. 8
figure 8

Schematic diagram of binocular stereo vision (adapted from [151])

In practice, biocular depth cameras have been widely adopted by researchers for developing vision systems for harvesting robots. Hou et al. [139] reported a recent technique utilizing modified YOLOv5 and binocular stereo vision for detecting and localizing ripe citrus. The average distance error between citrus fruit and the camera in non-uniform, low, and good lighting conditions was 3.98 mm. The approach was found to offer accurate and swift detection and localization of citrus fruits in intricate orchard landscapes, as concluded by the authors. Occlusion of leaves, branches, and other fruits often leads to imprecise bounding boxes of detected fruits and the associated depth measurements. To manage this problem, Li et al. [140] proposed a distinctive 3D fruit localization method dependent on a truncated cone point cloud processing algorithm. According to the authors, this method decreased the median and average fruit localization errors by 59% and 43%, respectively, when compared to the traditional approach. Wang et al. [152] presented a geometry-aware detection network designed for apple harvesting. The network utilized color and geometry sensory input from RGB-D cameras and executed end-to-end instance segmentation and grasping estimates with an average precision of 0.61 cm and 4.8° in center and orientation, respectively.

4.2.3 Localization Based on the Principle of ToF

A time-of-flight camera consists of a light transmitter and a receiver. The receiver detects the light emitted by the transmitter once it reflects off the object in view, and the distance to the target object is determined by measuring the time taken for the signal to travel to and reflect off the object [147]. Wu et al. [141] developed a platform of robotic devices resembling bananas that utilize stereo vision to improve 3D localization accuracy at the truncation point. The study found a median error of 8 mm and a median absolute deviation of 2 mm for the depth coordinates. Lin et al. [55] utilized Euclidean clustering guided by fruit binary maps and RGB-D depth pictures to separate point clouds into individual fruits to enable collision-free harvesting. By determining the center location of each fruit and its relation to the mother branch, it was possible to accurately estimate the 3D poses of the fruits. The study found that the 3D posture error, calculated using the spherical fitting method, was 23.43° ± 14.18° and that the method took 0.565 s to execute per fruit. Li et al. [143] employed principal component analysis (PCA) to estimate the positioning for lychee harvesting, specifically targeting the random scattering and uneven appearance of lychee clusters. The study achieved a detection precision of 83.33% and a placement precision of 17. 29° ± 24.57°. Therefore, further improvements in accuracy were deemed necessary by the authors.

5 Vision-Based Control for Harvesting Robot

Vision servo control is a widely used robotics technique that utilizes vision sensors to gather environmental data and translate it into appropriate kinematic commands for the controller of robots. Early robot vision systems utilized open-loop vision control, employing a "look then move" approach rather than employing closed-loop control. Ongoing advancements in computer hardware and related algorithms have led to the recent and rapid development of technology in this field, as shown by recent research [153,154,155]. This section presents an overview of the control methods utilized in vision-based harvesting robots, as discussed previously. Table 5 details the various control methods and overall performance metrics of harvesting robots utilized in prior years.

Table 5 Control mode and performance of harvesting robot

5.1 Open-loop Visual Control

Silwal et al. [158] employed RGB-D cameras to develop a robot for apple harvesting with open-loop vision control. The study found a successful harvesting rate of 0.846 and attributed the partial failure to progressive position errors in open-loop vision control systems and difficulty catching apples on long, thin, flexible branches. Ling et al. [156] developed a dual-arm cooperative technique utilizing binocular vision sensors to improve the efficiency of tomato harvesting robots, which involved tomato detection, target localization, trajectory planning, and real-time control of dual-arm motions. The study achieved a success rate of up to 87.5% using suction cup grabbing and wide-range cutting during robotic harvesting. Yu et al. [156] introduced a humanoid robot designed for efficient and flexible apple harvesting, utilizing the scale-invariant feature transformation (SIFT) feature point detection and matching mechanism to identify the pixel coordinates of the optimal apple contour and target apple center. The authors reported several practical issues that adversely impact the performance of robot:

  1. (1)

    All the electric motors and components of the robot operated on lithium-ion batteries that were inadequate in meeting the necessary driving force.

  2. (2)

    The research assumed uniform apple size, yet apples have varying sizes, requiring the claws to be more adaptable or the addition of a tactile sensor for better performance.

  3. (3)

    Although the color segmentation-based identification accuracy of binocular camera systems is unsatisfactory, it can be improved by using RGB-D cameras and advanced identification algorithms.

5.2 Visual Servo Control

Errors in the vision input device and system, localization errors in vision-based recognition, coordinate transformation errors, and other factors affect the operational efficiency of harvesting robots in an open-loop system. Cumulative mistakes fail to accurately pick certain fruits. To improve the accuracy of the robot, the visual feedback loop identifies deviations between the actual and intended positions of the manipulator [164]. Conventionally, visual servo control is either a position-based visual servo (PBVS) or an image-based visual servo (IBVS) [154, 155] depending on whether the feedback signal is a 3D spatial coordinate value or an image feature value.

5.2.1 Position-based Visual Servo (PBVS)

The position-based vision servo (PBVS) system determines the intended poses via image analysis and the geometric model of the target. The deviation between the current and goal poses informs the trajectory planning [165, 166]. PBVS control technology is built on the basis of precise measurement of the spatial coordinates of the target fruit by the visual sensor. First of all, the visual sensor must obtain accurate spatial position information of the target fruit, and then through the establishment of an accurate hand-eye coordinate conversion model, the position information under the visual sensor coordinate system will be converted to the spatial coordinates under the robot coordinate system. Finally, the positional relationship between the target fruit and the robot end-effector can be used to carry out the motion planning, which in turn can control the movement of the end-effector of the picking robot to the target fruit position. The schematic structure of the system is illustrated in Fig. 9.

Fig. 9
figure 9

Position-based visual servo

5.2.2 Image-based Visual Servo (IBVS)

In image-based visual servoing, the control quantity is computed directly from the error signal in the image to drive the robot to move toward the target fruit and complete the picking task. The critical problem of image-based visual servo control is the need to estimate the image Jacobi matrix, which is the bridge to construct the transformation between the image coordinate system and the robot coordinate system [165, 166]. This image-based visual servo control is relatively insensitive to robot kinematic calibration and camera model errors compared to position-based visual servo control, and does not require estimation of the position of the target fruit in the robot coordinate system, thus reducing computational latency. Therefore, it has become one of the most preferred solutions nowadays. The schematic structure of the system is depicted in Fig. 10.

Fig. 10
figure 10

Image-based visual servo

In practice, image-based vision servoing is considered less computationally challenging compared to position-based servoing, making it the preferred control mechanism. Mehta et al. [161, 162] proposed a cooperative vision servo controller that addressed external disruptions, such as mechanical contact between the robot and trees, by incorporating a feedback term that compensated for positioning flaws, allowing the end-effector to micro-adjust the position. However, when the robot interacted with dense crops, it potentially missed the target, leading to harvesting failure. Barth et al. [163] designed a servo control framework to address agricultural settings with dense plant cover. The servo control framework achieved motion control of a sweet pepper harvesting robot with visual information, successfully harvesting sweet peppers under laboratory conditions. Chen et al. [159] developed a vision-based servo control for a harvesting robot using an upgraded fuzzy neural network sliding mode algorithm. The enhanced algorithm significantly increased not only the design efficiency but also the success rate of the harvest. However, the procedure was only tested in a laboratory setting, and the authors acknowledged potential obstacles when harvesting in natural settings.

6 Challenges and Future Trends

The potential of harvesting robots to revolutionize smart agriculture is immeasurable. The advancements in machine vision and artificial intelligence technologies have significantly accelerated the transition of harvesting robots from laboratory settings to practical orchards. However, fruit-harvesting robots encounter various challenges in their current implementation. These include issues associated with energy consumption and unstructured orchard environments. Fruit blockage by branches and leaves, uncertainty caused by the similarity of the background color to the fruit's body color, direct contact with fruit by bulky and inflexible robots, and the need to consider the degree of fruit ripeness and defects during harvesting are just a few examples of such challenges. Researchers must explore and enhance fruit detection, localization, and control techniques to address these problems.

6.1 Building a Structured Environment Suitable for Harvesting Robots

The haphazard growth of fruit leaves in natural environments poses a significant challenge to harvesting robots. It is often the most difficult issue for fruit-harvesting robots to detect and tackle. In recent years, contemporary garden management techniques have been employed to create structured environments that are suitable for harvesting robots. For instance, apple trees have been pruned to form a flat crown, leaves and branches are cleaned manually or mechanically, chamber agricultural systems are used to seed fruits [167,168,169].

6.2 Designing the End Effector Suitable for Fruit Detachment

Robots used for picking fruits are often in motion during operation, and the movement caused by wind and picking can cause the fruit to swing back and forth, resulting in damage to the fruit skin and affecting its quality. Harvesting usually involves grabbing the fruit and pulling it off the vine, which may cause mechanical damage. Therefore, designing high-precision end effectors to grab fruits is a direction for future improvement. To design an end-effector appropriate for picking robot, the following factors must be taken into account: its ability to adjust to various shapes and sizes of produce, its lightweight and flexibility to enable swift robot motion, and its high-precision and stability to ensure gentle and accurate picking. Moreover, the end-effector must be easy to maintain given the prolonged timeframe of operation. Therefore, it is advisable to use durable materials and a simplistic design [168, 170, 171].

6.3 Developing a More Accurate Fruit Detection and Localization Algorithm

Despite the emergence of several high-performing fruit recognition algorithms, image processing algorithms are continually being improved upon. In recent years, new vision sensors, such as light field cameras and chlorophyll fluorescence cameras, have garnered increased attention. The use of these advanced sensors to acquire higher-detail visual data would undoubtedly enhance the recognition of complex environments. The accuracy and efficiency of vision control are other areas that require improvement. Although pose-based visual servoing (PBVS) control has been applied in a variety of applications, most agricultural producers continue to employ image-based visual servoing (IBVS) control, given the economic costs and the current state of vision sensor technology. However, we should devote future efforts to implementing PBVS control [172,173,174].

6.4 Training a Lightweight Model for Fruit Target Detection

To enhance the recognition performance of the vision system, it is common to improve recognition algorithms. However, these methods result in a more complex algorithm and longer computing time, despite the improvement in recognition accuracy. Meeting actual production requirements becomes challenging due to the real-time nature of vision systems employed by picking robots. Consequently, priority should be given to develop lightweight target detection models that support real-time fruit detection on edge devices and enhance the performance of visual recognition systems in embedded devices [175].

6.5 Other Feasible Directions

Using multiple robotic arms can enable the efficient grasping of multiple fruits simultaneously. A collaborative effort can reduce the risk of errors and failures while improving the accuracy of grasping [144, 176]. In specific scenarios, a single robotic arm may find it challenging to accomplish the task at hand, which makes having multiple robotic arms collaborate ideal. This gives the robotic arm system more flexibility and adaptability in its application range. In the field of agriculture, visual recognition and detection technology can be incorporated into intelligent agricultural systems to help farmers achieve automated management and production. For instance, the growth status, fruit maturity, and yield of fruit trees can be monitored in real-time in the orchard through corresponding sensors. Diseases and pests can also be detected at early stages by visual recognition and detection technology, leading to reduced fruit loss and pesticide use [177]. Additionally, image analysis algorithms can automatically grade fruits based on their size and color, resulting in increased efficiency and yield quality [178, 179].

7 Conclusions

This paper presents a comprehensive review of the recent progress in fruit-harvesting robots developed by researchers in the past five years. The study discusses the advancements in target recognition and detection techniques and the methods for achieving target localization in fruit-harvesting robots. The paper compares the vision-based control techniques for harvesting robots, and after conducting a thorough survey, visual servo control is identified as the most widely used control method. The primary contribution of this paper is to provide a comprehensive and in-depth analysis of the core technologies utilized in fruit-harvesting robots. Additionally, the paper highlights significant advancements achieved through multi-sensor fusion technology, deep learning-based target detection algorithms, novel end-effectors, and vision servo-based closed-loop control, which have the potential to further enhance the intelligence, accuracy, flexibility, and efficiency of fruit-harvesting robots.