Introduction

The evolution of fruit size through the season and its prediction at harvest are important information for growers. Generally, larger fruits guarantee better prices, on the contrary high yield of small fruits cannot be economically sustainable. As reported by (Miranda et al., 2023), considering consumer willingness to buy, fruit size represents a key factor in appearance, the most important quality parameter. Sorting equipment in packaging lines typically categorizes fruits by size, and this classification impacts the price of the fruit. Furthermore, knowing final fruit size in advance, thanks to prediction based on fruit growth monitoring, is essential for management, post-harvest processing, and logistics (Chaves et al., 2017). The implementation of precision management strategies is a common practice in modern orchards, with the aim of guaranteeing superior-quality crop production from mature, developed trees (Whiting, 2018). In order to facilitate prompt interventions in management operations, it is necessary to monitor and process several parameters, possibly in real-time. Fruit size, in defined timings, and fruit growth, along the seasons, are two of these parameters, which can lead to a more precise orchard management. Different approaches to measure fruit growth have been proposed over the years such as manual (Manfrini et al., 2015), strain sensors (Link et al., 1998) or fruit gauges based on linear and angular potentiometers (Morandi et al., 2007; Peppi et al., 2021, 2023) and optoelectrical sensors (Thalheimer, 2016). The analysis of growth patterns can be useful to predict harvest fruit size class distribution or give daily scale information, as demonstrated by (Boini et al., 2019) who proposed absolute growth rate thresholds as an indicator of tree water stress to consequently manage irrigation.

To enable large-scale data collection, non-destructive, proximal sensors, not requiring physical contact with the fruits can be a potential solution. The appeal of this topic has been confirmed by many studies over the last decade on computer vision systems (CVSs) and 3D reconstruction focusing on the detection and sizing of fruits before, during, and after picking, as reported by (Miranda et al., 2023; Neupane et al., 2023), reviewing the results obtained in the last years. Moreover, the introduction of automation friendly training systems such as the “planar cordon” (Tustin et al., 2018) that facilitate the implementation of the aforementioned proximal sensing technologies (Bortolotti et al., 2021; Mengoli et al., 2023) furthermore favoring robotization (Zhang, 2018). In this context, the real application of such a proximal sensing solution is becoming a reality with several companies active in this sense (Neupane et al., 2023), with a few of them collaborating actively with research institutions (Islam et al., 2022).

The presented study focuses on assessing the performance of a consumer-grade CVS in sizing fruits directly on the tree, in real time and throughout the entire growing season. The CVS is part of a wider project that aims to assess the ability of a low-cost CVS to estimate fruit size variation during the season, thus extracting physiological traits and automize orchard management practices. The present study focuses on the sizing performances of the system and is an extended work made on the complete dataset of a preliminary work (Bortolotti et al., 2023).

Materials & methods

Data collection and preparation

The trial was carried out in the 2022 season at the experimental farm of the University of Bologna (Cadriano, IT, Lat: 44.548179, Lon: 11.414376). The data collection was done in a 3-years-old apple orchard (cv. Fuji - Fujiko), grown as ‘planar cordon’, a highly productive 2D training system (Tustin et al., 2018, 2022). Thanks to its ‘standardized’ structure and thin (and sparse) canopy, it favors both management (e.g., managing crop load at an ‘upright’ level; (Bortolotti, 2022; Bortolotti et al., 2022b) as well as computer vision (Bortolotti et al., 2021; Mengoli et al., 2023) and robotization (Zhang, 2018). The orchard was managed following standard farm practices.

The data collection campaign consisted of 17 collection times, from the 17th of June till the 12th of October (full bloom occurred on the 10th of April) in which fruit growth of 24 tagged highly visible fruits (hanging on 2 trees) was monitored from an initial fruit size of approximately 45 mm. At every collection time, tagged fruit were manually measured by a digital caliper for their maximal equatorial diameter. Manual measurements were used as the reference fruit size per each time.

Data collection (Fig. 1) consisted of a static video recording of each of the two trees, using a consumer grade depth camera (RGB-D) (Intel RealSense D435i, Intel Corp., CA - USA) through its original software tool and a standard laptop. A tripod, with bubble levels, was used to hold the camera oriented parallel to the tree-row and perpendicular to the ground. Recordings were taken at 1.0 and 1.5 m distance from the tree-row. Camera settings were fixed according to the suggested resolution for best results in terms of depth accuracy (i.e., 848x480) and RGB image definition (1920x1080). Recordings were taken at 30 frames per second (FPS) for 2–3 s approx. For all collection times, except the first two, at least 3 tennis balls (Tballs) were included in the scene as a dimension reference to better evaluate the effect of fruit non-sphericity and dimension variability. From the recorded RGB-D videos, a pair of color (RGB) and depth frames were extracted using the official Intel Realsense SDK. On the RGB extracted frames, those tagged apples framed in the scene, were accurately manually annotated and tagged on the image exploiting a web platform service (https://Supervisely.com). The label tagging in the image was used to later match reference fruit size measurements done with manual caliper and digitally estimated ones; the same was done for Tballs present in the RGB frames. The obtained dataset is open source and available at https://github.com/ECOPOM/OpenAcces_RGBD_apple_dataset (Bortolotti et al., 2024).

Fruit sizing algorithm

Initially, the authors intended to develop the following algorithm exclusively for manually annotated fruit in the images, aiming to support fruit size estimation. This would enable operators to increase their sample numerosity without the necessity of physically going to the field for sampling. Despite this, the advancement in object detection algorithms makes the authors willing to automate the whole process, from fruit detection to sizing.

To automate fruit sizing, authors exploited an open-source YOLO object detection algorithm (i.e., ‘yolov5l6’ - https://github.com/ultralytics/yolov5) which was trained on a custom dataset for apple detection made of more than 100 apple tree images, collected during previous seasons, on different apple cultivars and exploiting different devices for image collection. Images of the dataset were manually annotated only for highly visible fruits in order to reduce the errors in the later sizing phase, caused by not properly sizable fruits due to leaf occlusion or unproper fruit visibility. The dataset was split with a proportion of 70-20-10% respectively, for training, validation, and test -set. Data augmentation was used to increase the train-set image number up to 6 times, and the ‘yolov5l6’ was then trained for 500 epochs exploiting two Nvidia RTX 2080super GPUs. The trained model presented test-set precision, recall and F1-score values of 0.87, 0.83 and 0.85, respectively. From now on, authors will refer to this trained model as the ‘Yc’ detector, while ‘Ystd’ will refer to the ‘yolov5l6’ model not trained on the custom dataset aforementioned, but directly downloaded from the repository. Despite the current availability of newer YOLO versions (e.g., YOLOv6, YOLOv7, YOLOv8, YOLOv9), the authors decided to use YOLO version 5 since this paper was extending a previous work based on that version (Bortolotti et al., 2023). Furthermore, authors found literature showing lower performances of newer versions of YOLO, compared to YOLOv5, could be observed when using a specific training dataset (Gašparović et al., 2023). In a preliminary test carried out by the authors with the same training dataset and hyperparameters, YOLOv5 showed better performances compared to YOLOv8 (data not shown).

A Python script was developed to size each of the detected fruits on the RGB frame based on the extraction of its bounding box (bbox) coordinates. The latter were then used to project the fruit bbox on the related depth map to get fruit-distance information. The steps involved in the sizing algorithm were the following (Fig. 1):

  1. i)

    RGB and aligned depth map frames were cropped at fruit-bbox level exploiting the coordinates given by the Yc model application.

  2. ii)

    The mean fruit-to-camera distance (mFD) was computed on the depth-bbox as the average of all non-zero values present in the area that is centered with the bbox and has its halved size (since this method shown to be more precise from in previous trial (Mengoli et al., 2022).

  3. iii)

    The possible scene dimensions (SD) framed at mFD (SD@mFD) and related pixel dimension in millimeter (pixmm) were obtained similarly to (Bortolotti et al., 2022); SD@mF was computed by a trigonometric relation (Eq. 1) between the distance and the camera fields of view (FoVs) - Eq. 1 is valid for vertical or horizontal dimension. The pixmm is obtained as the square root of the ratio between scene area in mm2 (obtained by SD@mFvertical * SD@mFhorizontal / RGB frame resolution).

    $$SD@mFD=2*\left(mFD *tan\left(\frac{1}{2}FoV\right)\right)$$
    (1)
  4. iv)

    A circle detection algorithm, based on the open source OpenCV library (Bradski, 2000) function HoughCircles, was applied to the enlarged (i.e., + 10 pixels), greyscale, bbox-RGB-matrix. The algorithm was customized in its parameters to search for the circle best fitting-the-detected-fruit. To get the customization a test on more than 500 apples was run and it resulted in a best-parameters settings both for green and red fruits. Both the best-parameters settings are then exploited to find fruit-circles in each bbox.

  5. v)

    Exploiting the fact that good performing object detection models should generate the bbox in order to have the object at its center, a filtering step was applied: detected circles not centered (± 20%) with the bbox or with not consistent dimensions, with bbox dimensions (i.e., > bbox dimensions or < halved bbox dimensions), were discarded.

  6. vi)

    Per each of the remaining filtered circles the real dimension in millimeter (Dmm) was computed by multiplying the circle diameter in pixel by pixmm.

  7. vii)

    The mean fruit size in millimeters (mFS) was computed as the average of all the Dmm obtained for the analyzed bbox. Clearly, if no circle was detected, no fruit size estimation was possible.

Fig. 1
figure 1

Top: Data collection set-up (left) and YOLO inference results for apple and Tball detections (right). Bottom: process for fruit sizing, from left to right: enlarged RGB bbox cropped and related depth map [square = area used in mFD computation]; circles (green) remained after step v) [square = area limit for center filtering; red dot = center of circles]; representation of the resulting mean circle obtained after step vii) calculations

The same process was also applied to Tballs to check the effect of irregular fruit shape. For Tballs, manual annotations were used as reference and Ystd was used as an example of a possible detector for these objects due to its good performance in detecting such objects. It is worth mentioning that all available classes of the Ystd models were used in the detection, and only those detections with the highest IoU with the manual annotations were kept as valid Tball-bboxes for later evaluations.

Data analysis

A brief comparison between the results obtained applying Yc and Ystd was carried out to check for performance alteration due to the chosen detector.

A ‘fruit detection rate’ was computed as the difference between the number of detected fruits by the YOLO model and the number of target fruits (i.e., manual annotations).

Similarly, a ‘circle detection rate’ (CDr) was computed to evaluate the success rate of the system in sizing the detected fruit; in this case, the rate was computed based on the difference between the number of detected circles (maximum 1 per fruit) and the number of detected fruit by the considered detection model.

An overall parameter called ‘sizing rate’ (SR) was computed (Eq. 2) to account for the cumulative effect of fruit detection rate and circle detection rate.

$${\text{SR}_{\text{Y}}} = {\text{CircD}_{\text{Y}}}/{\text{Ref}_{\text{N}}}$$
(2)

Where SRY is the sizing rate using a detector ‘Y’, CircDY is the number of circle detections among the fruit detected by means of Y detector, and RefN is the number of target fruits manually annotated.

Fruit sizes estimated with the presented algorithm were then compared with reference (manually collected) fruit sizes. Pearson’s correlation, mean error (mE) and root mean squared error (RMSE), as well as their percentual values, were calculated between estimated and actual fruit size to evaluate the sizing performance of the developed CVS.

Statistical analysis per each date and distance tested were carried out by applying a T-student or Mann Whitney test, between predicted and actual fruit size, exploiting Scipy Python library (Virtanen et al., 2020). The test type was chosen based on the fulfillment of the test assumption by the data (e.g., normal or not normal distribution of the data).

Results and discussion

Fruit detection performance

The Yc model used for fruit detection performed well, with an overall detection rate of 94% among all dates; the detection performance seemed to be improved at 1.0 m (97%) compared to 1.5 m (91%). During the season, a positive increase in the detection rates seemed to be present, passing from timing to the following. This pattern was more identifiable at 1.5 m than at 1.0 m. However, it was an apparent trend, without any correlation between fruit size (i.e., result of the time passing) and fruit detections in each of the distances tested and neither among all the data. It is worth remembering that these results are obtained considering only the targeted fruits in the image (i.e., highly visible, manually annotated, and sized) and not all the visible fruits in the image. Because of that, the detection performances could be different if computed considering all the visible fruits. The detector (Yc model) was trained to identify only well exposed fruits with a visibility level coherent with the one of the manually annotated fruits in the training dataset. By this, detection rates can also be altered by changing the sensitivity of the detector to the ‘fruit visibility’ feature.

It is worth mentioning that the authors exploited only the RGB data, coming from the depth camera utilized, for fruit detection. As reported by (Fu et al., 2020), different approaches exploiting depth information, alone or in combination with RGB, could be used to try to improve performances. The authors tested the RGB segmentation based on the depth map distance thresholding, but the results were not suitable with an excessive RGB data reduction. The main motivation for that was found in the unregular distance of the tree canopy (and fruit in it) from the camera, which led to the presence of many black pixels after the thresholding, thus reducing the fruit detection performances. In addition, detected fruits were altered along their perimeter due to the thresholding, which was affected by the irregular distance from the camera. Because of this and considering the aforementioned result obtained only using RGB images, authors decided not to exploit depth information in the detection process.

A comparison of the fruit detection rate of Yc with that of the Ystd model was done to investigate deeper the results described above. Considering the fruit detection rate, it had a similar performance with an average of 97% (vs. 94% of Yc), while at single distances, Ystd performed better at 1.5 m (100%) compared to 1.0 m (95%), contrary to what was observed for Yc. This could be confirmation of the effect that the training set has on the output of the model, by altering features like visibility. The analysis of each timing led to the identification of the main factor of this opposite behaviour, being the low detection performance of the Ystd on the 30th of August and on the 12th of October. On those days, Ystd had detection rates of target fruits lower than 80% and 20%, respectively; no correlation with either fruit size or colour appeared to be present on these dates. Regarding Tballs, Ystd was used as detector, and it performed quite well with a global detection rate of 99% (99% at 1.0 m and 98% at 1.5 m). Yc was tested as well, and rates of 91, 92 and 89% were obtained, showing the importance of model customization, but also the effect of object features in the detection performance; despite the fact that this model has not been trained to detect Tballs, in 90% of the cases it confused Tballs for apples; the main common feature between these objects was the shape.

Circle detection performance

Considering the sizing algorithm described, on the detected fruits, a circle detection is needed to identify the borders of the physical object and thus, size the fruit. To evaluate only the performance of the tuned HoughCircle detection algorithm, a circle detection rate was computed on those reference fruits identified within the image annotations (i.e., manual reference), avoiding missed detections led by Yc and Ystd. In this condition, the circle detection rate (CDr) resulted in 97% at 1.5 m, and 98% at 1.0 m. Doing the same for the Tballs resulted in a 100% rate at both distances, highlighting the effect of the object’s high contrast and bright colour in identifying a circle within the object’s bbox.

Well performing object detection algorithms should create detection bboxes in a consistent manner, with the detected object properly centered in them. Considering this, the same CDr was computed utilizing the bbox generated from the Yc model instead, with the purpose of investigating the possible effects given by the detection models. Results were CDrs of 96, 98 and 98% respectively at 1.5 and 1.0 m distances and globally on the entire dataset (ED). These results align with those presented earlier, suggesting that the automation of fruit detection using the YOLO model does not enhance or diminish the circle detection rate as hypothesized.

A comparison of CDrs with the Ystd model was done trying to investigate deeper what was previously described. As tested for Yc, the CDr of Ystd against the Ystd fruit detections was computed to check an effect of the bbox geometry on the later circle detection. Marginally lower results than the customised model (i.e., Yc with 98, 96 and 98% for 1.0 m, 1.5 m and ED) and the manual reference (i.e., with 98, 97 and 98% for 1.0 m, 1.5 m and ED) were obtained, with values of 92, 98, and 95% respectively for 1.0 m, 1.5 m and ED. When compared to Yc, Ystd presented a fruit detection, in which bboxes are created with different geometries around the fruits, resulting in more bboxes at 1.0 m with aspect ratios not optimal for the later circle detection, as can be seen in Fig. 2. On the contrary, at 1.5 m bboxes are similar to both Yc and manual reference. It can be argued that this behaviour could be related to the object dimension in the training dataset used to train the different models but considering the high CDr shown by both models (> 90%), this can also be due to chance and variable lighting conditions.

When testing CDr on Tballs, results were 100% for tball bboxes when created both manually and by Yc model; while by applying Ystd detector, the circle detection showed to be slightly reduced (98% globally), confirming the hypothesis of an effect induced by the bbox dimension/aspect ratio on the circle detection.

Fig. 2
figure 2

Left: Fruit detection using the Yc model (cyan bboxes) and the Ystd model (pink bboxes). Ystd missed detection of target fruit (tagged) and false positive detection (untagged) can be seen; different bbox aspect ratios (ar) are clearly visible. Right: Violin plot of bbox aspect ratios (ar): Ystd shows higher variability at 1.0 m reflecting thr opposite results of Yc and manual in circle detection

Sizing rate

To evaluate the actual fruit sizing performance rates of the presented methods, Eq. 2 was applied. In general, CVS achieved good results (Fig. 3) with an average SR among all dates of 92%. The best result was 96% for 1.0 m distance, while 1.5 m showed a value of 88%. The mismatch in circle detection performance, between the two distances, is not clearly explained. A possible explanation could be related to the object dimension which is inversely related to the distance from the camera. Objects further away (1.5 m) are described by a lower number of pixels if compared to closer ones. Smaller objects could be more difficult to analyze since they present a lower number of pixels and, theoretically, less information. As a result, the cv2.HoughCirlce algorithm could more easily fail if configured with predefined parameters that do not account for pixel resolution and the distance between the object and the camera. As a consequence of this, the optimisation of the HoughCircles parameters should occur based on the combination of fruit size and object-camera distance. In fact, when analyzing single dates, this seems confirmed by the fact that the fewest circle detections in the first 5–8 dates, both at 1.0 m and 1.5 m, occurred exactly when real fruit size was smaller compared to later timings. Additionally, since the RGB digital information is fed into the HoughCircles algorithm as a single-channel 8-bit matrix, the ambient lighting condition could strongly affect the circle detection. Brighter spots in the image can be identified as a circle, making it challenging to identify the circle describing the real physical object boundary.

Analyzing Tballs, with the same shapes and dimensions along the season, it was found that there was no difference between 1.0 (SR = 97%) and 1.5 m (SR = 100%) supporting what just exposed regarding the need to account for the fruit dimension and camera distance in the HoughCircles algorithm parameters tuning.

Finally, the comparison with the Ystd as previously done, SR resulted in 87, 98 and 92% respectively for 1.0 m, 1.5 m and ED. This introduction of Ystd as a detector demonstrated an opposite behaviour compared to Yc ( 96, 88, 92% respectively for 1.0 m, 1.5 m, ED) mainly due to the previously cited missed fruit detection occurred on the 7th of July and 12th of October at 1.0 m only.

The fruit detection performances of the CVS are suitable for field applications (> 90% detection rate) and in line with other research focusing on fruit detection, as reported in (Fu et al., 2020; Kuznetsova et al., 2021; Xiao et al., 2023). The comparison with a Ystd showed similar performances, revealing the high performance of open-source models to directly detect fruits, favoring counting applications, with no need for customized detection models. In this research, not focusing on object counting, but on dimensioning, the model customization was needed to size the fruit based on the circle detection algorithm Hough Circles. In fact, the Ystd obtained only slightly lower circle detection rates if compared to the Yc and manual reference, but the bbox dimensions resulted not suitable for later circle detection and sizing (Fig. 2). To avoid this and fruit occlusions, it was decided to focus the detectors only on highly visible fruits instead of dealing with occluded ones, as done by other researchers reporting complex and sometimes computing intensive algorithm(Chen et al., 2021; Ferrer-Ferrer et al., 2023; Gené-Mola et al., 2023).

The overall sizing rate of the system presented good performances in detecting and sizing 88 and 96% of the target fruits for 1.5 and 1.0 m respectively, highlighting a potential good ability in the detection and sizing of highly visible fruits in the field. As mentioned above, these results are specific to the target fruits. However, when considering the entire orchard fruit population and sizing ten fruits per plant across the entire orchard, the manual sample size used for monitoring fruit size and growth (Manfrini et al., 2015) would be scaled up to thousands of fruits per hectare. This approach better represents the orchard status and its spatial variability, particularly when paired with GNSS data, thus favoring precision orchard management techniques (Longchamps et al., 2022; Manfrini et al., 2020).

Fig. 3
figure 3

Sizing rates of the CVS per each date and distance

Sizing estimation

The Pearson correlation between reference fruit size and CVS estimated one was investigated. Generally, best results were found at 1.5 m distance, always presenting slightly higher values of r per each tested condition. As a baseline, the results obtained on manually annotated fruit were good, with r values of 0.83 and 0.88 respectively for 1.0 and 1.5 m. The use of Yc model for fruit detection slightly improved the correlation, presenting values of 0.84 and 0.89 respectively, while this was not the case with the Ystd model which presented the same values of manually labelled fruits. As can be seen from the scatter plot in Fig. 4, between actual and estimated fruit size, outliers are present, and results could be improved by a statistical cleaning, but the authors decided to keep them to evaluate the actual performance of the CVS including the possible outlier values occurring in real field conditions.

Fig. 4
figure 4

Scatterplot of reference and CVS estimated fruit size per each tested distance

To evaluate CVS sizing performances, statistical metrics such as mE (Fig. 5) and RMSE (Fig. 6) were computed (Table 1). Error distributions were left-skewed, with a shift towards oversize, with 5.70 mm (9%) as average mE. At 1.0 m, best results were always found when compared to 1.5 m with mE values along the season of 4.40 mm (7%) and 7.60 mm (11%), respectively.

This oversizing trend remains also for manual and Ystd fruit detections (data not shown), with slightly higher values of 6.20 mm (1.0 m), 8.70 mm (1.5 m), 7.30 mm (whole data), and 4.80 mm (1.0 m), 7.80 mm (1.5 m), 6.10 mm (whole data) respectively. Differently from Manual and Yc, Ystd showed negative mE in a few dates at 1.5 m only (data not shown), suggesting Ystd potential underestimation and field-applications only at 1.0 m. Considering the consistency of Yc model in overestimating the fruit size, with a large enough dataset, a regression model to fine-tune CVS estimated size with the actual one could be developed; this would be less useful in the case of Ystd where mE values span from − 15 to + 10 mm highlighting a lower accuracy.

Fig. 5
figure 5

Mean error (mE) barplot per each collecting date, with dashed lines representing average mE per each of the distances tested (with respect to the color legend) and among all data (black)

Fig. 6
figure 6

Root mean squared error (RMSE) barplot per each collecting date, with dashed lines representing the average RMSE per each of the distances tested (with respect to the color legend) and among all data (black)

Table 1 CVS sizing performances with errors and results from statistical analysis (T-student or Mann Whitney based on the data normal distribution)

Regarding RMSE evaluation, results showed a bigger error than mE, with average RMSEs of 9.60 mm (14%), 10.50 mm (16%) and 10.00 mm (15%) respectively for 1.0 m, 1.5 m and whole data. The higher RMSE values are mainly due to the effect induced by outliers. Probably a data cleaning would have favored RMSE results, but as already stated above, this is not worth for testing in real-world scenarios. Again, Yc resulted in the best solution compared to Manual (11.1, 11.5 and 11.3 mm respectively for 1.0 m, 1.5 m and whole data) and Ystd (10.6, 11.2 and 10.9 mm respectively for 1.0 m, 1.5 m and whole data) but with a lower gap (of around 1 mm) compared to mE results.

The analysis of both mE and RMSE showed there is not a trend or a correlation between the fruit size and the sizing errors at both distances and at all dates. Regarding the presented sizing errors, literature shows a wide range of performance obtained based on methodology and equipment utilized, but it can be said that results are approximately in line with studies based both on 3D reconstruction (Gené-Mola et al., 2021; Tsoulias et al., 2020) and image analysis (Ferrer-Ferrer et al., 2023; Lu et al., 2022). Despite this, researchers are active in this field, and in more recent literature, better sizing results were found as can be seen from (Miranda et al., 2023; Neupane et al., 2023). The developed CVS performance result is currently not properly suitable for field application and far away from high farmers expectations (i.e., sizing errors < 1–2 mm). The obtained results are not in line with author’s expectations either, since previous results obtained on a subset of the here-analysed dataset showed an average mE of 2.7–3.3 mm and an average RMSE of 7.5–8.7 mm (Bortolotti et al., 2023). These discrepancies led the authors to speculate about potential factors that might increase sizing errors, including environmental conditions such as lighting or fruit color, which could significantly affect detection and sizing accuracy with the proposed method based on circle detection. With the aim of investigating that, a visual inspection of the images analyzed was done. It was observed that the pose of the fruit relative to the camera naturally changes throughout the season due to the increase in fruit weight. Manual references, against which size errors are calculated, are taken based on the fruit’s maximum equatorial diameter, regardless of fruit pose and position. Conversely, the CVS’s estimated fruit size primarily depends on the fruit’s perspective as visible from the camera, thus being influenced by its pose (Mirbod et al., 2023). This observation led to the inference that a primary cause of the system’s high error levels could be attributed to the differing methods of diameter measurement between human operators—who find the desired diameter—and the CVS, which is limited to the perspective presented to it. To investigate this, Tballs were exploited, since their pose should not alter size estimation thanks to their sphericity. Results on manually annotated Tballs do not confirm the fruit pose as the main source of error, since also with Tballs, sizing errors remained with high values of mE (8.7, 12.7, 10.1 for 1.0 m, 1.5 m and whole data) and RMSE (13.5, 16.2, 14.5 for 1.0 m, 1.5 m and whole data), together with a wide error range variability (minimum-to-maximum = -11.2 to + 56.1 mm). Thus, it seems that the proposed methodology has an intrinsic source of error of the magnitude exposed in this paragraph. Said that, it is worth mentioning that the HoughCircle algorithm was not tuned for Tball detection, and this could be a motivation for the high error shown when testing them. A main motivation for the presented sizing errors probably relies on the HoughCircle algorithm itself, which needs to adapt its parameters to the tested conditions (variable illumination, variable fruit color and size) to properly identify the circle better describing the fruit perimeter; this was not the case since it was used the same fixed parameter tuned on previous trials.

In addition to mE and RMSE, a statistical analysis between predicted and actual fruit size was undertaken to infer on CVS performances when upscaling the sample size to the orchard level. Statistical analysis was run only for the Y model since it was the one performing better as presented above. In Fig. 7 it is shown the actual and predicted mean fruit size per each collection timing and the related error bars. It can be seen how the CVS can follow in a good manner the fruit growth along the season while always keeping a wide gap from the actual fruit size (as suggested previously by mE values). Despite this, it can be seen how, for 1.0 m distance only, the fruit size estimated is not statistically different in 5 timings (‘ns’ in Fig. 7). So, at the closest distance, the system is able to estimate closer to the reference mean fruit size in 1/3 of the collection dates (N = 17, Table 1).

Fig. 7
figure 7

Plot reporting actual (Ref) and CVS predicted (Pred) fruit size for both 1.0 (blue) and 1.5 m (orange) distances. Stat. analysis results between actual and predicted fruit size are reported by ‘ * ’ (p < 0.05) or ‘ns’ (p > 0.05) per each date with respect to the treatment color

To investigate and speculate about performances when modifying the sample size, random sampling with replacement strategy (a strategy used in bootstrapping) was utilized. Resampling was done randomly in order to match the smaller sample between the reference and the prediction. Results shown to be highly variable and, repeating 100 times the statistical analysis with the random sampling strategy, the number of dates in which ‘ns’ value appeared varied from 2 to 11 (out of 17) and the dates presenting ‘ns’ varied from iteration to iteration: for 1.5 m distance, 0 to 2 (minimum-to-maximum) dates resulted in ‘ns’ and in 5 different timings (7th and 20th of July, 1st, 11th and 24th of August) among all 100 iterations; For 1.0 m distance, 2 to 9 dates resulted in ‘ns’ with a timing range of all the dates but 17th and 23rd of June. Considering this, it can be speculated that possible better results at 1.0 m could be present in a real field scenario, with a bigger sample presenting probably higher variability compared to the fruit sample of this trail.

Since in some cases, the ground truth sample was down-sampled to prediction one numerosity, highly modifying the results, authors also wondered about manual ground truth validity since operators varied along the season, and considering the known low consistency of human operators for repetitive works. By this, a small trial to test the operators’ consistency in collecting ground truth data was conducted. The 24 fruits tagged plus other 26 fruit (50 in total) were measured from three different operators at 2 timing of the same day (11th of August). Unfortunately, it was not possible to define the actual fruit size by detaching the fruit and analyze it in the lab. To overcome this, a sizing error representing the measure uncertainty, was computed as the difference between the maximum and minimum values measured on each fruit among all the operators. Results (data not presented) shown an error variability comprised between + 0.7 to + 9.2 mm, with an overall mE of + 3.2 mm and RMSE of 3.7 mm (i.e., 5–6%).

This uncertainty should be taken into account and considered as the target goal for a CVS trying to substitute a human operator for fruit sizing tasks. If comparing the reference uncertainty just presented with the sizing error of the CVS, it can be noticed that the latter has higher values of approx. +2–3 mm (3–4%) of mE and approx. +10 mm (8–9%) of RMSE globally (average mE = + 5.7 mm / 9%; average RMSE = 10 mm / 15%). Interestingly, at 1.0 m distance, where the best results were found, these error values are quite close to the reference uncertainty for the mE (+ 4.4 mm; 7%), while further for RMSE (9.6 mm; 14%). Considering this operator uncertainty, the work presented by (Ferrer-Ferrer et al., 2023) in which a multitask neural network model for automated fruit sizing that works on both RGB and depth data, could be considered accurate enough to substitute human operators. The authors of this work did not carry out a statistical analysis between reference and estimated fruit size, but their results (mE reported in each of the timings is approximately 1 mm) make plausible the ability of that system to correctly estimate the mean fruit size in each timing.

Conclusions

The present study aimed to test, directly on the tree, the performance of a previously developed CVS exploiting a sizing algorithm based on circle detection (Bortolotti et al., 2023) for in field real time fruit sizing. The CVS was tested on the complete dataset (Bortolotti et al., 2024), made up of RGB-D field images presenting fruit sizes ranging from 45 to 85 mm approximately, collected among 17 dates in the 2022 season.

The developed algorithm was tuned to identify only high visible fruit, discarding occluded ones, so to focus only easy-to-size fruit. It showed a good fruit detection (> 90%) and circle detection performance (98%), with a sizing rate of (92%) with respect to the target fruit number of this trial. Better results in SR were generally found at the closer distance (1.0 m), but when testing this on spherical objects (Tballs), it was shown ow distance did not affect SR. As a result, the circle detection algorithm used should be tuned in its parameters to account for camera distance, object dimensions, and color. This tuning is essential for enhancing performance, avoiding the use of static parameters throughout the entire season, as done in this trial. A comparison with the open-source model Ystd shown the potential for a ready-to-use model for fruit detection and counting; but it was demonstrated how the bbox geometries differs from the Yc one, altering the later SR and sizing performances.

Good correlations (r > 0.8) were found between estimated and actual fruit size, despite the presence of a relatively high number of outliers that were kept to stress performance in real conditions. Despite previous work results, the sizing performance of the CVS was not in line with expectation, pointing out an overall mE and a RMSE respectively of + 5.7 mm (9%) and 10 mm (15%). Authors inferred on the camera perspective and fruit pose as the motivation for this error, since these not always match manually measured equatorial diameter sized, but the test with Tball denied those as error drivers. At the same time, it was pointed out how manually collected fruit size reference presented an uncertainty of 3 mm approximately (5–6%) that should be considered in the performance evaluation of a system trying to reach human precision.

A brief evaluation of the possible effects of field sampling and the statistical ability of the system in estimating fruit size was carried out. Statistical analysis shown that despite the errors, the algorithm found ‘ns’ between estimated and actual fruit size in 1/3 of the cases. On 100 subsampling test it came out how not significant stalactitical results were highly variable in term of number of timings (0–9 times) and dates in which they occurred (0–15 out of 17). This made authors speculate on the acceptable performance of the system at entire orchard level.

Concluding, the present study shows the potential and limitations of a fruit sizing algorithm based on a HoughCircle detector and distance information. The main outcomes for the algorithm are the importance of customized fruit detectors able to create fruit bboxes with geometries suitable for the latter circle detection phase; the adaptability of HoughCircle parameters to object size, camera distance, and color; the problem relative to field natural illumination (that could be controlled artificially). Other interesting outcomes are the human operators’ uncertainty in this kind of “over-the-season” trial, the statistical effect of random subsampling, and the related possibility for acceptable system performances at the orchard level.

Future works will try to improve and test on a large scale the presented CVS as well as investigate its suitability for fruit growth estimation (Manfrini et al., In Press). The authors will also investigate other types of image analysis not relying only on standard features such as color and shape based on RGB information.