1 Introduction

The growing New Zealand kiwifruit industry has achieved export sales of $2.599 billion NZD during the 2020/21 season (Zespri 2020) and the revenue is expected to be $6 billion NZD by 2030 (NZKGI 2020). With this rapid growth, kiwifruit growers are experiencing challenges associated with labour shortages and rising labour costs. Labour shortages have become apparent during the 2018 harvesting season, as declared by the New Zealand Ministry of Social Development of an official seasonal labour shortage of 1200 people for six weeks (Msd 2018). This shortage is likely to worsen with a projected 7000 more seasonal workers required for the kiwifruit industry by 2027 (NZKGI 2018). The second problem is rising labour costs. Pruning, thinning, and picking wages make up approximately half of the total orchard working expenses based on a 2011/12 orchard model. The New Zealand minimum wage has significantly increased recently. The Government has largely increased the minimum wage by 2021 (Statistics New Zealand 2018). Of the 2021 labour expenses, pruning wages are the greatest at $9,700 NZD/ha, followed by picking wages at $3,922 NZD/ha (Mpi 2021).

Robotics and automation in agriculture can play a key role in the long-term solution to both the labour shortage and rising labour cost. As a result, an autonomous kiwifruit harvesting system is developed in-house as shown in Fig. 1. The configuration parameters of this harvesting system can be found in references (Williams et al. 2019, 2020). The system consists of four identical modules. Each module has a robot and a stereo vision system. The robot has a manipulator for moving the gripper, which grabs the kiwifruit and a corrugated tube for delivering the kiwifruit to the container. Although each module has a stereo vision system, it is not just dedicated to the robot of that module. The information gathered by the stereo vision system in each module will be merged together by a scheduler, which assigns the fruit to be harvested to each robot. This approach can fully utilize the resources to complete the task with the minimum time.

Fig. 1
figure 1

The kiwifruit harvesting system with four robot modules

This autonomous kiwifruit harvesting system has been employed for fruit harvesting in various kiwifruit orchards. The harvesting task by robot in an orchard is quite similar to the pick and place operation in the manufacturing environment, but it is more challenging due to uncontrollable dynamic conditions in orchards. For instance, no two canopies are the same, the robot base is not fixed as it has to navigate the orchards with many obstacles to delicately grab and place a fruit.

One of the significant factors that affects the economic viability of the system is the successful harvesting rate of the total detached fruit. All the fruit left in the canopy and dropped on the ground are considered unsuccessful harvesting, and thus, dropped kiwifruit during harvesting have a significant impact on the orchard profit. Fruit drop study is commissioned to investigate the number and causes of unsuccessful harvesting. A total of 1456 harvesting attempts under various operation conditions in ten hours of video footage is watched and recorded manually. The attempts are categorized into picked, dropped and unharvested as listed in Table 1.

Table 1 The categories for harvesting attempts in the previous video footages

The main reason for fruit left in the canopy and dropping on the ground is position error between the gripper and the fruit location. This is a combination of inaccuracies due to fruit localization by the stereo vision system, kinematics of the robot and the coordination between the vision system and the robot. Calibration is one of the approaches to remedy these inaccuracies.

This article presents an integrated approach to obtain the calibration data and perform these calibration processes for the autonomous kiwifruit harvesting system so that its positional errors are reduced. Field tests are carried out in the kiwifruit orchard. The performance of the calibrated harvesting system has shed light on future visual sensing system research for an autonomous fruit harvesting system.

2 Related work

Various robots are developed for harvesting different fruits such as tomatoes (Liu et al. 2015), strawberry (Xiong et al. 2020), apples (Silwal et al. 2017), kiwifruit (Mu et al. 2020) etc. These robots are demonstrated in the lab environment and they are prototypes for design concept proofs. Most of them are just one robot arm with a camera for sensing a target fruit. They are not fully autonomous to sense over hundreds of fruits in the canopy simultaneously and determine the harvesting order.

In order to develop an industrial scale autonomous harvesting system, harvesting speed and successful harvesting rate are two major factors to be considered. The system should have multiple robot arms to perform the harvesting task simultaneously so that the harvesting speed is comparable with the manual harvesting speed. This implies that the canopy view captured by the sensing system such as a set of stereo vision systems must be as much as possible, since an autonomous robotic system for kiwifruit harvesting must first perceive and then correctly position the gripper precisely to perform a task. Furthermore, a scheduler is required for efficient fruit assignment to each robot. The information flow for harvesting kiwifruit with four modules is outlined in Fig. 2.

Fig. 2
figure 2

Information flow in the kiwifruit harvesting system with four modules

The harvesting robotic system is positioned under the canopy to capture the kiwifruit distribution. The canopy is imaged by both camera L and R of the stereo vision system in each module. Since the images obtained from the cameras are distorted due to the lens geometrics, the images will be undistorted before the kiwifruit are detected based on the calyx recognition by using a fully convolution neural network (FCNN). The positions of the detected kiwifruit are calculated with respect to the world frame. These positions from each module are fed into a scheduler so that all the fruit positions are merged. Then the kiwifruit is assigned to each robot arm for harvesting based on the minimum distance between the fruit and the current gripper position. Once the kiwifruit picking order is scheduled, the corresponding fruit position with respect to the world frame is transformed to the robot base frame. For each fruit position, the joint parameters are computed by inverse kinematics. The arm attempts to drive the gripper to harvest the kiwifruit when it receives the joint parameters sequentially.

The high successful harvesting rate refers to positioning the gripper accurately to the target kiwifruit for detachment from the canopy. When there is a discrepancy between the gripper position and the actual kiwifruit position, the fruit may not be correctly grabbed. These errors in the system have a negative impact on the successful harvest rate which determine the economic viability of the robotic system. A common approach for mitigating the positional inaccuracies is through calibration so that positional difference between the commended position and the actual position is reduced. Stereo vision system calibration, kinematic calibration and world—robot base frame transformation estimation are the major calibrations for reducing the positional inaccuracies.

A stereo vision system consists of two cameras for sensing the depth information of a point. The computation is only related to the image locations in two cameras with their intrinsic and extrinsic parameters (Feng and Fang 2021). As a result, camera calibration is needed to obtain these parameters prior to the stereo vision system calibration. A pinhole camera model is used for modelling a camera (Brown 1971) which relates a 3D point with its image location on the image plane by a projection matrix. The pinhole camera model is not directly applicable when the camera lens causes a nonlinear distortion in the image. A simplified polynomial model (Tsai 1987; Zhang 2016) is used for image undistortion so that the computation efficiency can be raised. The parameters of this model have to be estimated together with the camera intrinsic and extrinsic parameters using Levenberg–Marquardt Algorithm (Zhang 2016; Moré 1987). The calibration process is equivalent to estimating the projection matrix using a set of 3D calibration points and their corresponding image locations. Various methods are employed to acquire the calibration points from a calibration tool (Tsai 1987; Zhang 2016).

Calibrating a stereo vision system refers to estimating the relative position and orientation of one camera with respect to the other camera. This can be determined from their orientations and positions obtained in the individual camera calibration procedures (Zhang et al. 2009; Cui et al. 2016). On the other hand, this information is also completely described by the epipolar geometry of the stereo vision system (Zhang 1998) which relates the image locations on the right and left image plane of a 3D point by a fundamental matrix. The orientation and position of one camera expressed in the other camera coordinate frame can be obtained directly from the fundamental matrix (Zhang 2016). Furthermore, stereo camera calibrations are also developed based on the neutral network relating the 3D point with its image location (Memon and Khan 2010). It is not necessary to model the distortion separately in this approach, but the noise and instabilities are also included in the network.

The major error in the position and orientation of the end–effector is due to the robot component tolerances and clearances which are termed geometric variation. The error can be expressed in terms of the geometric variation through the kinematic model of a robot. Kinematic calibration is a process of obtaining the geometric variation of a robot so that the error of the end–effector position and orientation is minimized. The kinematic model of a robot depends upon its mechanical structure. For instance, it can be obtained by using Denavit–Hartenberg parameters (Rocha et al. 2011). On the other hand, the non-geometric parameters such as the manipulator stiffness (Lightcap et al. 2008) and the compliance in the joints can also be included in the kinematic model (Nubiola and Bonev 2013). As a set of calibration points is needed for calibration, various methods are proposed to acquire these points. These include dial gauges (Xu and Mills 1999), laser trackers (Park et al. 2008) and 3-D camera (Motta et al. 2001) etc.

In summary, the modelling of various calibrations with the calibration data leads to a system of linear equations. In order to eliminate the noise influence, this system of linear equations is overdetermined. The research focuses on the autonomous robotic system accuracy including the coverage of the process and component models, computation efficiency of solving the system of equations and approaches of acquiring calibration data so that the error accumulation can be minimized.

3 Mathematical models for calibration

The mathematics for formalizing the calibration of the stereo vision system, kinematic and integration of both systems are briefed. Figure 3 depicts the configuration of a module which is part of the autonomous kiwifruit harvesting system. The robot has a manipulator and a gripper while the stereo vision system consists of two cameras R and L. There are four coordinate frames. \(\left\{{{\varvec{F}}}_{w}\right\}\) is the world frame, \(\left\{{{\varvec{F}}}_{b}\right\}\) is the robot base frame, \(\left\{{{\varvec{F}}}_{r}\right\}\) and \(\left\{{{\varvec{F}}}_{l}\right\}\) are the frame of camera R and L respectively. Furthermore, the frame \(\left\{{{\varvec{F}}}_{r}\right\}\) of camera R is also the stereo vision system frame. The relationships among these four frames \(\left\{{{\varvec{F}}}_{w}\right\}\), \(\left\{{{\varvec{F}}}_{b}\right\}\), \(\left\{{{\varvec{F}}}_{r}\right\}\) and \(\left\{{{\varvec{F}}}_{l}\right\}\) are:

$$\left\{{{\varvec{F}}}_{l}\right\}={}_{l}{}^{r}{\varvec{T}}\bullet \left\{{{\varvec{F}}}_{r}\right\}$$
$$\left\{{{\varvec{F}}}_{r}\right\}={}_{r}{}^{b}{\varvec{T}}\bullet \left\{{{\varvec{F}}}_{b}\right\}$$
$$\left\{{{\varvec{F}}}_{w}\right\}={}_{w}{}^{b}{\varvec{T}}\bullet \left\{{{\varvec{F}}}_{b}\right\}$$

where \({}_{j}{}^{i}{\varvec{T}}\) is a transformation matrix of frame \(\left\{{{\varvec{F}}}_{i}\right\}\) with respect to \(\left\{{{\varvec{F}}}_{j}\right\}\).

Fig. 3
figure 3

A module of the autonomous kiwifruit harvesting system

A point \({\varvec{Q}}\) located on the xy plane of world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) has a position vector \({{\varvec{Q}}}_{w}={\left[\begin{array}{ccc}{x}_{w}& {y}_{w}& 0\end{array}\right]}^{T}\). Its image \({\varvec{q}}\) is on the image plane which is at the focal length f along the z axis of the camera frame \(\left\{{{\varvec{F}}}_{r}\right\}\). It has a position vector \({{\varvec{q}}}_{r}={\left[\begin{array}{cc}{u}_{r}& {v}_{r}\end{array}\right]}^{T}\) and \({{\varvec{q}}}_{l}={\left[\begin{array}{cc}{u}_{l}& {v}_{l}\end{array}\right]}^{T}\) on the image plane of camera R and L with point or and ol as the origin of the pixel frame respectively. The principal point, where the optical axis pierces the image plane, is at \(\left({o}_{x},{o}_{y}\right)\) from the origin o.

The stereo vision system calibration is split into two parts. The calibrations are performed on the cameras in the stereo vision system and the system itself.

3.1 Camera calibration

Camera calibration is to obtain the intrinsic and extrinsic parameters of the cameras. The intrinsic parameter matrix is

$${\varvec{K}}=\left[\begin{array}{c}{f}_{x}\\ 0\\ 0\end{array} \begin{array}{c}0\\ {f}_{y}\\ 0\end{array} \begin{array}{c}{o}_{x}\\ {o}_{y}\\ 1\end{array}\right]$$

which consists of the pixel focal lengths \({f}_{x}\) and \({f}_{y}\) of the camera (focal length f) in \(x\) and \(y\) direction of the image plane. The extrinsic parameters are the orientations and positions of the cameras.

Consider a point on the xy plane of the world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) so that its z coordinate is zero. Camera R is at an orientation \({}_{r}{}^{w}{\varvec{R}}=\left[\begin{array}{ccc}{{}_{r}{}^{w}{\varvec{r}}}_{1}& {{}_{r}{}^{w}{\varvec{r}}}_{2}& {{}_{r}{}^{w}{\varvec{r}}}_{3}\end{array}\right]\) and a position \({}_{r}{}^{w}{\varvec{t}}\) expressed in the world frame \({F}_{w}\) such that \({}_{r}{}^{w}{\varvec{T}}=\left[\begin{array}{cc}{}_{r}{}^{w}{\varvec{R}}& {}_{r}{}^{w}{\varvec{t}}\\ \underline{0}& 1\end{array}\right]\). The point \({\varvec{Q}}\) at \(\left({x}_{w}, {y}_{w}, 0\right)\) in the world frame is related to its image \({\varvec{q}}\) at \(\left(u, v\right)\) in the pixel frame on the image plane as

$$s\left[\begin{array}{c}u\\ v\\ 1\end{array}\right]=\left[\begin{array}{c}{{\varvec{h}}}_{1}^{T}\\ {{\varvec{h}}}_{2}^{T}\\ {{\varvec{h}}}_{3}^{T}\end{array}\right] \left[\begin{array}{c}{x}_{w}\\ {y}_{w}\\ 1\end{array}\right]$$

where \(s\) is a scale factor and \({\varvec{H}}=\left[\begin{array}{ccc}{{\varvec{h}}}_{1}& {{\varvec{h}}}_{2}& {{\varvec{h}}}_{3}\end{array}\right]\) is the homography projection matrix and \({{\varvec{h}}}_{i}^{T}\) is the i–th row of homography \({\varvec{H}}\). The pixel coordinates \(\left(u, v\right)\) are un-distorted using Levenberg–Marquardt algorithm (Weng et al. 1992). The homography \({\varvec{H}}\) is computed by stacking Eq. (3) for each calibration point \({{\varvec{Q}}}_{w}^{i}\) (i = 1, 2, …, \(n\)) from a single view yields a system of homogenous equations

$${A}_{h}\bullet h=0$$

where \({{\varvec{A}}}_{h}\) is a \(n\times 9\) coefficient matrix and \({\varvec{h}}=\left[\begin{array}{c}{{\varvec{h}}}_{1}\\ {{\varvec{h}}}_{2}\\ {{\varvec{h}}}_{3}\end{array}\right]\). The homography \({\varvec{H}}\) can be estimated by solving Eq. (4). On the other hand, the homography \({\varvec{H}}\) can also be expressed as

$${\varvec{H}}={\varvec{K}}\bullet \left[\begin{array}{ccc}{{}_{r}{}^{w}{\varvec{r}}}_{1}& {{}_{r}{}^{w}{\varvec{r}}}_{2}& {}_{r}{}^{w}{\varvec{t}}\end{array}\right]$$

The intrinsic parameter matrix \({\varvec{K}}\) and extrinsic parameters \({}_{r}{}^{w}{\varvec{R}}\) and \({}_{r}{}^{w}{\varvec{t}}\) of a camera can be decomposed from homography \({\varvec{H}}\) using a number of \(m\) \(\left(m\ge 2\right)\) views. Each view j has a different homography \({{\varvec{H}}}^{j}\) (and so do \({{\varvec{R}}}^{j}\) and \({{\varvec{t}}}^{j}\)). However, K is the same for all views. The homography for each view, \({{\varvec{H}}}^{j}={\varvec{K}}\bullet \left[\begin{array}{ccc}{{}_{r}{}^{w}{\varvec{r}}}_{1}^{j}& {{}_{r}{}^{w}{\varvec{r}}}_{2}^{j}& {{}_{r}{}^{w}{\varvec{t}}}^{j}\end{array}\right]\) provides two linear equations in the 6 entries of the matrix \({\varvec{D}}={{\varvec{K}}}^{-T}\bullet {{\varvec{K}}}^{-1}=\left[\begin{array}{ccc}{{\varvec{d}}}_{1}& {{\varvec{d}}}_{2}& {{\varvec{d}}}_{3}\end{array}\right]\) (\(\because {\varvec{D}}\mathrm{ is a symmetric matrix})\).

Letting \({{\varvec{w}}}_{i}^{j}={\varvec{K}}\bullet {{}_{r}{}^{w}{\varvec{r}}}_{i}^{j}\), the rotation constraints \({\left({{}_{r}{}^{w}{\varvec{r}}}_{i}^{j}\right)}^{T}\bullet {{}_{r}{}^{w}{\varvec{r}}}_{i}^{j}=1\) (\(\forall i=1, 2)\) and \({\left({{}_{r}{}^{w}{\varvec{r}}}_{1}^{j}\right)}^{T}\bullet {{}_{r}{}^{w}{\varvec{r}}}_{2}^{j}=0\) implies

$${\left({{\varvec{w}}}_{1}^{j}\right)}^{T}\bullet {\varvec{D}}\bullet {{\varvec{w}}}_{1}^{j}-{\left({{\varvec{w}}}_{2}^{j}\right)}^{T}\bullet {\varvec{D}}\bullet {{\varvec{w}}}_{2}^{j}=0$$
$${\left({{\varvec{w}}}_{1}^{j}\right)}^{T}\bullet {\varvec{D}}\bullet {{\varvec{w}}}_{2}^{j}=0$$

Stacking 2 m equations from m views, a linear system of equations is yielded

$${{\varvec{A}}}_{z}\bullet {\varvec{d}}=0$$

where \({{\varvec{A}}}_{z}\) is a \(2m\times 6\) coefficient matrix and \({\varvec{d}}={\left[{d}_{11}\boldsymbol{ }\boldsymbol{ }\boldsymbol{ }{d}_{12} {d}_{22} {d}_{13} {d}_{23} {d}_{33}\right]}^{{\varvec{T}}}\) (which are the components of \({\varvec{D}}\)). The intrinsic parameter matrix \({\varvec{K}}\) can be obtained from \({\varvec{D}}\) by Cholesky decomposition.

$${\varvec{D}}={{\varvec{K}}}^{-T}\bullet {{\varvec{K}}}^{-1}$$

The extrinsic parameters for view j can be computed using K and \({{\varvec{H}}}^{j}=\left[\begin{array}{ccc}{{\varvec{h}}}_{1}^{j}& {{\varvec{h}}}_{2}^{j}& {{\varvec{h}}}_{3}^{j}\end{array}\right]\),

$${{}_{r}{}^{w}{\varvec{R}}}^{j}=\left[\begin{array}{ccc}{{\varvec{K}}}^{-1}\bullet {{\varvec{h}}}_{1}^{j}& {{\varvec{K}}}^{-1}\bullet {{\varvec{h}}}_{2}^{j}& {{\varvec{K}}}^{-1}\bullet {{\varvec{h}}}_{1}\times {{\varvec{K}}}^{-1}\bullet {{\varvec{h}}}_{2}\end{array}\right]$$
$${{}_{r}{}^{w}{\varvec{t}}}^{j}={{\varvec{K}}}^{-1}\bullet {{\varvec{h}}}_{3}^{j}$$

3.2 Stereo vision system calibration

The orientation and position of a camera in the stereo vision system relative to the other is estimated during the stereo vision system calibration process. As shown in Fig. 3, frame \(\left\{{{\varvec{F}}}_{r}\right\}\) of camera R is the stereo vision system frame, \({}_{l}{}^{r}{\varvec{R}}\) and \({}_{l}{}^{r}{\varvec{t}}\) are the orientation and position of camera L expressed in the frame \(\left\{{{\varvec{F}}}_{r}\right\}\) of camera R such that \({}_{l}{}^{r}{\varvec{T}}=\left[\begin{array}{cc}{}_{l}{}^{r}{\varvec{R}}& {}_{l}{}^{r}{\varvec{t}}\\ 0& 1\end{array}\right]\). \({{\varvec{K}}}_{r}\) and \({{\varvec{K}}}_{l}\) are the matrix \({\varvec{K}}\) (Eq. (2)) of camera L and R respectively, then

$$\left[{{\varvec{q}}}_{l}^{T} 1\right]\bullet {\varvec{F}}\bullet \left[\begin{array}{c}{{\varvec{q}}}_{r}\\ 1\end{array}\right]=0$$

where \({\varvec{F}}={\left({{\varvec{K}}}_{l}^{-1}\right)}^{T}\bullet {\varvec{E}}\bullet {{\varvec{K}}}_{r}^{-1}=\left[\begin{array}{ccc}{{\varvec{f}}}_{1}& {{\varvec{f}}}_{2}& {{\varvec{f}}}_{3}\end{array}\right]\) is the fundamental matrix. By substituting \(n\) pairs corresponding image points in Eq. (10), the components of fundamental matrix \({\varvec{F}}\) can be formulated a set of homogenous equations

$${{\varvec{A}}}_{f}\bullet {\varvec{f}}=0$$

where \({{\varvec{A}}}_{f}\) is a \(n\times 9\) coefficient matrix and \({\varvec{f}}=\left[\begin{array}{c}{{\varvec{f}}}_{1}\\ {{\varvec{f}}}_{2}\\ {{\varvec{f}}}_{3}\end{array}\right]\). The estimated fundamental matrix \({\varvec{F}}\) are obtained by solving Eq. (11) which also gives the essential matrix \({\varvec{E}}\). Consequently, the orientation \({}_{l}{}^{r}{\varvec{R}}\) and position \({}_{l}{}^{r}{\varvec{t}}\) of camera L expressed in the frame \(\left\{{{\varvec{F}}}_{r}\right\}\) of camera R is calculated from \({\varvec{E}}\) using single value decomposition.

$${\varvec{E}}={\left({{\varvec{K}}}_{l}\right)}^{T}\bullet {\varvec{F}}\bullet {{\varvec{K}}}_{r}$$
$$E={T}_{x}\bullet {}_{l}{}^{r}R$$

where \({{\varvec{T}}}_{x}\) is the skew-symmetric matrix of \({}_{l}{}^{r}{\varvec{t}}\).

3.3 Kinematic calibration

The end-effector location \({{\varvec{Q}}}_{b}\) with respect to the robot base frame \(\left\{{{\varvec{F}}}_{b}\right\}\) is expressed in terms of the robot geometric \({\varvec{a}}\) and joint parameters \({\varvec{\theta}}\),


The position error \(\Delta {{\varvec{Q}}}_{b}\) of the end-effector due to a set of identified geometric variation \(\Delta {\varvec{a}}\) is given by

$$\Delta {{\varvec{Q}}}_{b}={\varvec{J}}\bullet\Delta {\varvec{a}}$$

where \({\varvec{J}}\) is the Jacobian matrix of the kinematic model \(f\). The kinematic calibration is to determine the robot geometric variation \(\Delta {\varvec{a}}\) so that the position error \(\Delta {{\varvec{Q}}}_{b}\) is eliminated.

Using a set of \(n\) end-effector locations at various corresponding joint parameters, Eq. (14) can be expressed in form of a set of homogeneous equations with geometric variation \(\Delta \overline{{\varvec{a}} }={\left[\begin{array}{ccc}\Delta {{\varvec{a}}}_{1}& \cdots & \Delta {{\varvec{a}}}_{m}\end{array}\right]}^{T}\) (where \(m\) is the number of identified geometric variations)

$$\left[\begin{array}{c}{{\varvec{J}}}^{1}\\ \vdots \\ {{\varvec{J}}}^{n}\end{array}\right]\left[\begin{array}{c}\Delta {{\varvec{a}}}_{1}\\ \vdots \\ \Delta {{\varvec{a}}}_{m}\end{array}\right]=\left[\begin{array}{c}\Delta {{\varvec{Q}}}_{b}^{1}\\ \vdots \\ \Delta {{\varvec{Q}}}_{b}^{n}\end{array}\right] or{{\varvec{A}}}_{k}\bullet \Delta \overline{{\varvec{a}} }=\Delta {\overline{{\varvec{Q}}} }_{b}$$

where \({{\varvec{A}}}_{k}=\left[\begin{array}{c}{{\varvec{J}}}^{1}\\ \vdots \\ {{\varvec{J}}}^{n}\end{array}\right]\) and \({{\varvec{J}}}^{i}\) is the Jacobian matrix of kinematic model \(f\) with i-th configuration; \(\Delta {\overline{{\varvec{Q}}} }_{b}={\left[\begin{array}{ccc}\Delta {{\varvec{Q}}}_{b}^{1}& \cdots & \Delta {{\varvec{Q}}}_{b}^{n}\end{array}\right]}^{{\varvec{T}}}\).

Hence, the geometric variation \(\Delta \overline{{\varvec{a}} }\) can be estimated by solving Eq. (15).

3.4 Integration calibration

The stereo vision system in the autonomous robotic system returns the coordinates of a target location with respect to its frame \(\left\{{{\varvec{F}}}_{r}\right\}\). These coordinates are transformed into the world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) by the transformation matrix \({}_{r}{}^{w}{\varvec{T}}\) (obtained from the camera calibration) so that locations from various stereo vision systems are merged together in the world frame for harvesting scheduling. However, the robot control modules orientates and moves the end–effector to a target location expressed in the robot base frame \(\left\{{{\varvec{F}}}_{b}\right\}\). The world–robot base frame transformation is a process that binds the domain of locations (expressed in the world frame) perceived by the stereo vision system with the domain of end-effector target locations in robot base frame. These two tasks need to be integrated by a transformation matrix \({}_{w}{}^{b}{\varvec{T}}\), which is estimated based on a set of \(n\) calibration points.

A set of n points \({{\varvec{Q}}}^{i}\) (\(\forall i=1,\cdots ,n\)) expressed in the world frame with position vector \(\left\{{{\varvec{Q}}}_{w}^{i}={\left[\begin{array}{ccc}{x}_{w}^{i}& {y}_{w}^{i}& {z}_{w}^{i}\end{array}\right]}^{T}, \forall i=1,\cdots ,n\right\}\) and the robot base frame with position vector \(\left\{{{\varvec{Q}}}_{b}^{i}={\left[\begin{array}{ccc}{x}_{b}^{i}& {y}_{b}^{i}& {z}_{b}^{i}\end{array}\right]}^{T}, \forall i=1,\cdots ,n\right\}\) are employed to establish the transformation matrix.

A rigid correspondence between these two sets of coordinate expression (i.e. point \({{\varvec{Q}}}^{i}\) is always expressed as \({{\varvec{Q}}}_{b}^{i}\) in robot base frame \(\left\{{{\varvec{F}}}_{b}\right\}\) and \({{\varvec{Q}}}_{w}^{i}\) in world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) \(\forall i=1,\cdots ,n\)) is assumed so that

$$\left[\begin{array}{c}{{\varvec{Q}}}_{b}^{i}\\ 1\end{array}\right]={}_{w}{}^{b}{\varvec{T}}\bullet \left[\begin{array}{c}{{\varvec{Q}}}_{w}^{i}\\ 1\end{array}\right], \forall i\in I$$

where \({}_{w}{}^{b}{\varvec{T}}=\left[\begin{array}{cc}{}_{w}{}^{b}{\varvec{R}}& {}_{w}{}^{b}{\varvec{t}}\\ \underline{0}& 1\end{array}\right]\) is the transformation matrix between the robot base frame and the world frame such that \({}_{w}{}^{b}{\varvec{R}}=\left[\begin{array}{ccc}{}_{w}{}^{b}{{\varvec{r}}}_{1}& {}_{w}{}^{b}{{\varvec{r}}}_{2}& {}_{w}{}^{b}{{\varvec{r}}}_{3}\end{array}\right]\). Expanding Eq. (16) and stacking it for each n calibration point \({{\varvec{Q}}}^{i}\) (I = 1, 2, …, \(n\)) gives a system of linear equations

$${{\varvec{A}}}_{T}\bullet {\varvec{t}}=$$

where \({{\varvec{A}}}_{T}\) is a \(3n\times 12\) coefficient matrix and \({{\varvec{b}}}_{t}\) is a \(3n\times 1\) column vector; \({\varvec{t}}=\left[\begin{array}{c}\begin{array}{c}{}_{w}{}^{b}{{\varvec{r}}}_{1}\\ {}_{w}{}^{b}{{\varvec{r}}}_{2}\end{array}\\ \begin{array}{c}{}_{w}{}^{b}{{\varvec{r}}}_{3}\\ {}_{w}{}^{b}{\varvec{t}}\end{array}\end{array}\right]\). The transformation matrix \({}_{w}{}^{b}{\varvec{T}}\) can be estimated by solving Eq. (17). A summary of these calibrations are tabulated in Table 2. The column “Method” lists the methods to obtain the matrix in column “Outcome” using the equations listed in the “Equation” column.

Table 2 A summary of calibrations

4 Calibration

The autonomous kiwifruit harvesting system has a stereo vision system of two cameras as the sensor for localising kiwifruit, and a robotic arm for picking the kiwifruit as shown in Fig. 4.

Fig. 4
figure 4

The autonomous kiwifruit harvesting system

The manipulator consists of three major links and a wrist. These links are actuated by three servo motors installed at the base. At the HOME position of the robot, link 2 aligns with link 1 vertically. Link 3 is positioned horizontally, and the gripper is oriented vertically upward as shown in the figure. Figure 5 shows a schematic configuration of the kiwifruit harvesting robot. \({\theta }_{1}\), \({\theta }_{2}\) and \({\theta }_{3}\) are the joint parameters driving the link 1, 2 and 3 with link length are \({l}_{1}\), \({l}_{2}\) and \({l}_{3}\) respectively. Parameter \({a}_{j}\), \({b}_{j}\) and \({c}_{j}\) (j = 2, 3) are link lengths of the two four-bar linkages.

Fig. 5
figure 5

A kiwifruit robot arm manipulator with two four bar linkages

The orientation of the gripper \({\theta }_{4}=\pi -\left(\varphi -{\theta }_{3}^{^{\prime}}\right)\). The coordinates of pick point Q are expressed in the robot base frame \({F}_{b}\) as

$${{\varvec{Q}}}_{b}=f\left({\varvec{g}},{\varvec{\theta}}\right)=\left[\begin{array}{c}\left({l}_{2}\mathrm{sin}{\theta }_{2}+{l}_{3}\mathrm{cos}{\theta }_{3}^{^{\prime}}+{g}_{2}\right)\mathrm{sin}{\theta }_{1}+{g}_{1}\mathrm{cos}{\theta }_{1}\\ \left({l}_{2}\mathrm{sin}{\theta }_{2}+{l}_{3}\mathrm{cos}{\theta }_{3}^{^{\prime}}+{g}_{2}\right)\mathrm{cos}{\theta }_{1}+{g}_{1}\mathrm{sin}{\theta }_{1}\\ {{l}_{1}+l}_{2}\mathrm{cos}{\theta }_{2}+{l}_{3}\mathrm{sin}{\theta }_{3}^{^{\prime}}+{g}_{3}\end{array}\right]\in {R}^{3}$$

where \(f\left({\varvec{g}}, {\varvec{\theta}}\right)\) is the kinematic model of the robot with \({\varvec{\theta}}=\left[\begin{array}{c}{\theta }_{1}\\ {\theta }_{2}\\ {\theta }_{3}\end{array}\right]\in {R}^{3}\) and \({\varvec{g}}=\left[\begin{array}{c}{g}_{1}\\ {g}_{2}\\ {g}_{3}\end{array}\right]\in {R}^{3}\).

$${\theta }_{3}^{^{\prime}}={\frac{\pi }{2}-\theta }_{2}-\phi$$
$$\phi ={\mathrm{sin}}^{-1}\left(\frac{{a}_{2}\mathrm{sin}\left({\theta }_{2}-{\theta }_{3}\right)}{{e}_{2}}\right)+{\mathrm{cos}}^{-1}\left(\frac{{e}_{2}^{2}+{c}_{2}^{2}-{b}_{2}^{2}}{2{e}_{2}{c}_{2}}\right)$$
$${e}_{2}^{2}={a}_{2}^{2}+{l}_{2}^{2}-2{a}_{2}{l}_{2}\mathrm{cos}\left({\theta }_{2}-{\theta }_{3}\right)$$
$$\varphi ={\mathrm{sin}}^{-1}\left(\frac{{a}_{3}\mathrm{sin}\phi }{{e}_{3}}\right)+{\mathrm{cos}}^{-1}\left(\frac{{e}_{3}^{2}+{c}_{3}^{2}-{b}_{3}^{2}}{2{e}_{3}{c}_{3}}\right)$$

The angle \({\theta }_{3}^{^{\prime}}\), \(\phi\) and \(\varphi\) are shown in Fig. 5.

The stereo vision system has two Basler ac1920‐40uc USB 3.0 cameras with Kowa lens LM6HC F1.8 f6mm 1" format. The devices are installed in a way that two pairs of stereo cameras share two pairs of floodlight bars. Each consists of a 200 kW floodlight bar with 40 LEDs which illuminates the canopy with white light. A 3-D space measurement system is required to perform measurements that will quantify the magnitude of inaccuracies contributed by the stereo vision system and robot of the harvesting system.

Appropriate calibrations can be applied once these inaccuracies have been quantified. Figure 6a, b depict the setup of the measurement system. It consists of a frame holding a constraint plate which is above the robot and the stereo vision system. The constraint plate is connected to a sub-frame through a ball head so that its position and orientation are adjustable. Its height \(h\) is also adjustable using the adjusting unit. The constraint plate is made from an aluminium-acrylic composite sandwich material which has a flatness of 0.3 mm while suspended. A permanent marker and a dial gauge with resolution of 10 μm are installed at the end of the manipulator as shown in Fig. 6c. A small spring is placed under the marker so that it can be pressed against the constraint plate. The amount of compression is measured by the dial gauge. As the manipulator is moved to a measurement position, it leaves a mark indicating the marker position. The spring action of the marker is implemented so that the marker tip position can either be varied (higher or lower) across the constraint plate. Hence, the position of the marker can be measured with reference to the frame.

Fig. 6
figure 6

The measurement system

Various calibration processes can be performed by attaching different calibration tools such as a checkerboard or a board with a grid of dots to the constraint plate. For instance, a checkerboard is used for camera calibration as depicted in Fig. 7a, b shows a board with a grid of dots attached to the constraint plate for stereo vision system calibration, kinematic calibration and integration calibration. The world frame (in red) is set at the centre of the calibration tool as shown.

Fig. 7
figure 7

The calibration tools

As the robot moves back and forth, dots are drawn on the board fixed with the constraint plate. The board can then be removed from the frame and placed on a bench to measure the dot positions. Measurements are made by measuring the deviation in the x and y directions of the dots from their target positions. A 10 × etched optical magnifier with an etched in ruler with 100 μm increments as shown in Fig. 8a is used during the measurement process to get sub-millimetre resolution accuracy. An example measurement is shown in Fig. 8b (the optical magnifier is not quite aligned with the grid in this image as it is difficult to both align the magnifier and take an image simultaneously). The centre of the dot shown is estimated by measuring the position of the outside edges of the dot and using them to calculate a centre-point. In this case the centre of the dot would be measured as 1.5 mm away from the reference line. The system performs dot position measurements for verifying the kinematic calibration and integration calibration.

Fig. 8
figure 8

Dot position measurement

4.1 Camera calibration

The cameras in the stereo vision system are calibrated individually to obtain their intrinsic and extrinsic parameters. A checkerboard (A1—80 mm squares—9 × 6 vertices, 10 × 7 squares) is installed on the constraint plate. The cameras are calibrated individually using 50 images taken at multiple orientations, covering the entire field of view of the cameras. The matrix \({\varvec{H}}\) and \({\varvec{D}}\) are estimated using Eqs. (4) and (7). Moreover, the intrinsic and extrinsic parameters are computed by Eqs. (8) and (9).

4.2 Stereo vision system calibration

The checkerboard is replaced by a board with a grid of dots (13 × 17 dots is spaced 50 mm apart in the x range of ± 400 mm and y range of ± 300 mm) for stereo vision system calibration. Images of the dots are captured by both cameras. The dots are detected using a simple colour filter that generates a black and white mask for both images based on pixels that meet the criteria of the filter. From these masks, the pixel locations of the centroid of the distinguished dots are calculated and are stored as a list of key-points. These key-points are then processed through the undistortion matrix of the stereo vision system and then matched between the left and right images. The fundamental matrix \({\varvec{F}}\) and essential matrix \({\varvec{E}}\) are then estimated using Eqs. (11) and (12). Consequently, the transformation matrix \({}_{l}{}^{r}{\varvec{T}}\) of camera L with respect to the frame \(\left\{{{\varvec{F}}}_{r}\right\}\) of camera R is obtained from the essential matrix \({\varvec{E}}\) by single value decomposition.

4.3 Kinematic calibration

The board with a grid of dots is also used for kinematic calibration. The marker is driven by the manipulator so that it moves to each dot on the constraint plane sequentially. The marker leaves a dot at each actual position. The joint parameters \({\varvec{\theta}}\) are recorded at each dot. These dots are the actual positions of the marker. The joint parameters are substituted into the kinematic model (Eq. (18)) to obtain the calculated positions. The differences in each pair of calculated position and actual (marked) position of the marker define the position error \(\Delta {{\varvec{Q}}}_{b}\). The geometric variations are estimated iteratively using Eq. (15).

4.4 Integration calibration

An array of dots \({{\varvec{Q}}}_{w}^{i}\left({x}_{w}^{i},{ y}_{w}^{i}, { z}_{w}^{i}\right) (\forall i=1,\cdots ,n)\) expressed in the world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) is used for calibrating the robot base and the stereo vision system integration. The end-effector with the marker is driven to each of these points and their corresponding joint parameters are recorded. The coordinates of this array of dots \({{\varvec{Q}}}_{b}^{i}\left({x}_{b}^{i},{ y}_{b}^{i}, { z}_{b}^{i}\right) (\forall i=1,\cdots ,n)\) expressed in the robot base frame \(\left\{{{\varvec{F}}}_{b}\right\}\) are obtained by substituting the joint parameters into the calibrated kinematic model as modelled by Eq. (18). The transformation matrix \({}_{w}{}^{b}{\varvec{T}}\) is computed by solving Eq. (17).

4.5 Solving a system of linear equations

All the calibrations result in solving a system of equations such as Eqs. (4), (7), (11), (15) and (17) in the forms of

$$A\bullet x=b$$

When \({\varvec{b}}=0\) such as Eqs. (4), (7) and (11), the solving of this system of homogeneous equation can be formulated as estimating \(\widehat{{\varvec{x}}}\) by

$$\underset{{\varvec{x}}}{\mathit{min}}\left({\widehat{{\varvec{x}}}}^{T}{{\varvec{A}}}^{T}\bullet {\varvec{A}}\widehat{{\varvec{x}}}\right){\text{ such that }}\boldsymbol{ }{\widehat{{\varvec{x}}}}^{T}\bullet \widehat{{\varvec{x}}}=1$$

A Loss function \(L\left(\widehat{{\varvec{x}}},\lambda \right)\) is defined such that

$$L\left(\widehat{{\varvec{x}}},\lambda \right)={\widehat{{\varvec{x}}}}^{T}{{\varvec{A}}}^{T}\bullet {\varvec{A}}\widehat{{\varvec{x}}}-\lambda \left({\widehat{{\varvec{x}}}}^{T}\bullet \widehat{{\varvec{x}}}-1\right)$$

Taking derivative of \(L\left(\widehat{{\varvec{x}}},\lambda \right)\) with respect to \(\widehat{{\varvec{x}}}\) yields an eigen value problem

$${A}^{T}\bullet A\widehat{x}=\lambda \widehat{x}$$

\(\widehat{{\varvec{x}}}\) is an eigen vector with smallest eigen value \(\lambda\) of matrix \({{\varvec{A}}}^{T}\bullet {\varvec{A}}\) minimizing the Loss function \(L\left(\widehat{{\varvec{x}}},\lambda \right)\). Once variable \(\widehat{{\varvec{x}}}\) is estimated, the corresponding matrix can be obtained by re-arranging the components of variable \(\widehat{{\varvec{x}}}\).

If \({\varvec{b}}\ne 0\), the equations are linear such as Eqs. (15) and (17), the least squares solution \(\widehat{{\varvec{x}}}\) is estimated


However, Eq. (15) \({{\varvec{A}}}_{k}\bullet \Delta \overline{{\varvec{a}} }=\Delta {\overline{{\varvec{Q}}} }_{b}\) in kinematic calibration is a non-linear system and \(\Delta \overline{{\varvec{a}} }\) is computed by iteration until convergence. At the j-th iteration,\(\Delta \overline{{\varvec{a}} }\), \(\Delta {\overline{{\varvec{Q}}} }_{b}\) and \({{\varvec{A}}}_{k}\) are denoted as \(\Delta {\overline{{\varvec{a}}} }^{j}\), \(\Delta {\overline{{\varvec{Q}}} }_{b}^{j}\) and \({{\varvec{A}}}_{k}^{j}\) (\({{\varvec{A}}}_{k}^{j}\) contains \({\overline{{\varvec{a}}} }^{j}\) and \({\varvec{\theta}}\), the joint parameters for n calibration points) respectively,

$${{\varvec{A}}}_{k}^{j}={{\varvec{A}}}_{k}^{j}\left( {\overline{{\varvec{a}}} }^{j},{\varvec{\theta}}\right)$$
$$\Delta {\overline{{\varvec{a}}} }^{j}={\left({\left({{\varvec{A}}}_{k}^{j}\right)}^{T}\bullet {{\varvec{A}}}_{k}^{j}\right)}^{-1}\bullet {\left({{\varvec{A}}}_{k}^{j}\right)}^{T}\bullet \Delta {\overline{{\varvec{Q}}} }_{b}^{j}$$
$${\overline{{\varvec{a}}} }^{j+1}={\overline{{\varvec{a}}} }^{j}+\Delta {\overline{{\varvec{a}}} }^{j}$$
$$\Delta {\overline{{\varvec{Q}}} }_{b}^{j+1}={{\varvec{A}}}_{k}^{j}\bullet \Delta {\overline{{\varvec{a}}} }^{j}$$

5 Verification of calibration processes

A constraint plate with a circular array of dots is used for accuracy verification. These dots are arranged in polar coordinates with the origin at the home position of the robot. The verifications are conducted at three height level h of 700 mm, 800 mm and 900 mm. The measured dots are plotted with the circular array of dots for comparison. The plot is shown as arrows pointing from the measured dot positions to the dot positions in the circular array. The length of the arrow is proportional to the error magnitude in radial and circumferential direction. The colour of the arrow is based on the measured z error, which is the depth of that data point, with purple being below the average depth and yellow being above.

In the stereo vision system verification, the sensed dot (measured dots) locations expressed in the world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) are calculated. The discrepancies between the sensed dots and the dots in the circular array are plotted in Fig. 9. The summary of the plot is tabulated in Table 3.

Fig. 9
figure 9

Stereo vision system accuracy of the kiwifruit harvesting system after calibration

Table 3 Summary of stereo vision system accuracy before and after calibration

In the kinematic accuracy verification, the end-effector is driven to each point of the circular array. The joint parameters are recorded. The calculated end-effector location for each set of joint parameters is obtained from Eq. (18). The positional errors between these calculated locations and their corresponding dot locations in the circular array are plotted in Fig. 10. Figure 10a, b show the positional errors before and after the kinematic calibration. The summary of the plot is tabulated in Table 4.

Fig. 10
figure 10

Kinematic accuracies of the kiwifruit harvesting system before and after calibration

Table 4 Summary of kinematic accuracy before and after calibration

The accuracy of integration calibration is verified by considering the hand–eye coordination accuracy of the autonomous harvesting system. A circular array of dots is sensed by the stereo vision system. The coordinates of the dots \({{\varvec{Q}}}_{r}^{i}\) are computed and expressed in the stereo vision system coordinate frame \(\left\{{{\varvec{F}}}_{r}\right\}\). These coordinates are then transformed into the world frame \(\left\{{{\varvec{F}}}_{w}\right\}\) to yield coordinates \({{\varvec{Q}}}_{w}^{i}\). By using the transformation matrix \({}_{w}{}^{b}{\varvec{T}}\), these dot coordinates are transformed to \({{\varvec{Q}}}_{b}^{i}\) in the robot frame \(\left\{{{\varvec{F}}}_{b}\right\}\). The joint parameters \({{\varvec{\theta}}}^{i}\) corresponding to each dot coordinate \({{\varvec{Q}}}_{b}^{i}\) are obtained by inverse kinematics. Eventually, the end-effector is driven by these joint parameters \({{\varvec{\theta}}}^{i} (\forall i=1, \cdots ,n)\) to mark the dots on the constraint plate. The differences between the original circular array of dots and the marker positions are plotted in Fig. 11. The results are summarized in Table 5.

Fig. 11
figure 11

The accuracy of hand–eye coordination

Table 5 Summary of the hand–eye coordination accuracy

6 Lab tests

Several tests are set up to check the performance of the autonomous kiwifruit harvesting system after calibrations using the fake kiwifruit in the laboratory as shown in Fig. 12a. As the kiwifruit tends to grow in cluster, the fake kiwifruit is arranged as individual fruit, three fruit clusters, two fruit cluster and seven fruit cluster as shown in Fig. 12b–e.

Fig. 12
figure 12

Laboratory tests of autonomous kiwifruit harvesting system

The autonomous kiwifruit harvesting system is placed under the fruit which are hung on the lab ceiling. The fruit is localized by the stereo vision system and the corresponding joint parameters of the manipulator are then computed to drive the gripper in picking the fake kiwifruit. The outcomes are categorized into three categories of picked, dropped and unpicked as tabulated in Table 6.

Table 6 Results of laboratory tests of autonomous kiwifruit harvesting system

7 Orchard tests

Orchard tests are conducted in the several commercial Hayward orchards to determine the performance of the calibrated harvesting system as shown in Fig. 13.

Fig. 13
figure 13

Orchard test for autonomous kiwifruit harvesting system

Four bays of kiwifruit are harvested to determine the performance. Each bay is divided into several zones. At each zone, the harvesting system is instructed to pick the fruit that are visible by the stereo vision system. At the end of harvesting in a zone, fruit harvested and fruit dropped are manually picked and counted. Fruit left in the canopy is also manually harvested and counted. The harvesting processes are video recorded so that a more detailed analysis can be performed later. The results are summarized in Table 7 which are compared to the performance before calibration.

Table 7 Orchard test result comparison to before and after calibration

8 Discussion

The calibrations can be categorized into two parts. The stereo vision calibration and kinematic calibration. Stereo vision techniques use two cameras to view the same object. The orientations of both cameras installed in the stereo vision system are the same and the distance apart is known. However, these are hard to guarantee the accuracy due to manufacturing issues, especially the fixture holding the cameras. Such inaccuracies affect the extrinsic parameters of the cameras. The stereo vision system calibration is mainly to evaluate these parameters using a set of calibration points and to eliminate the unwanted noise. Since the cameras are operated under the optical principles, a linear model and a set of linear equations are employed to model the system except the lens distortion. The lens distortion is non-linear and this complicates the calibration process. The effects of lens distortion can be serious if it is not properly undistorted.

Different from stereo vision system calibration, kinematic calibration is performed to minimize the difference between the desired and actual gripper position. A kinematic model of a manipulator, which depends on its mechanical structure, is required for calibration. The kinematic model is usually a non-linear function due to the revolution joints. As a result, the kinematic calibration results in a system of non-linear equations and iterative processes are required to solve the equations.

In the kinematic calibration accuracy plot, those "crosses" around the plot centre at (0, 0) indicate that the r and θ coordinates of the commanded and actual positions are very close. In addition, the z coordinate differences are also close to zero. This shows that the assumption of approximating the gripper orientation to be vertical to simplify the inverse kinematic computation is feasible when the gripper is around its home position.

In fact, the plots in Figs. 9, 10, 11 show that the accuracy of the autonomous harvesting system is very high. This also reflects in the laboratory test. The successful picking rate for individual fake kiwifruit is 100%. However, this rate drops when the fruit cluster size increases. The average picking rate drops from 100% for individual fruit to 67% for seven fruit clusters (with a minimum at 57% for three fruit clusters). The actual successful harvesting rate in the orchard test is even lower than that of the laboratory test results. The calibration improves the positioning of the gripper so that the number of entry failure, grip failure, target and non-target knockoffs, and fruit throws and drops are reduced. As a result, the fruit dropped percentage decreases from 24.6 to 10.7% after calibration. However, the successful harvesting rate just rises from 50.8 to 55%. This indicates that those fruits remain in the canopy instead of harvested as the percentage of fruit left in the canopy increases from 24.6 to 34.3%. This is due to the “offline” stereo vision system for sensing the canopy information, which refers to capturing the canopy once at the beginning of a harvesting cycle.

The operation of the autonomous harvesting system at a specific zone under the canopy follows a harvesting cycle as shown in Fig. 14a. The stereo vision system on each module images the canopy. Then the fruits are detected and their positions in the world frame are located. The located fruit positions from each module are collated by a scheduler which assigns the target fruit for each robotic arm dynamically based on the principle of minimum distance with a set of rules for collision prevention. When the robot controller receives a target fruit location from the scheduler, the joint parameters are calculated and the manipulator drives the gripper to perform the harvesting. Then the next target fruit for the robot will be determined by the scheduler until all target fruit in the scheduler are assigned. Then the autonomous harvesting system is moved to another zone.

Fig. 14
figure 14

The harvesting cycle of the autonomous harvesting system at a specific location

Reviewing the harvesting videos shows that most of the kiwifruit grow in clusters. These fruits are not at the same level. Some are at a level higher than the others. Before the calibration, once a fruit is harvested, the gripper usually knocks off the neighbouring fruit because of the position inaccuracy. After calibration, the position error is reduced. The fruit is correctly placed into the gripper and the neighbouring fruit are not knocked off. However, these fruits in the cluster are obstructed by the stereo vision system before harvesting. They are not visible to the stereo vision system and are not scheduled for harvesting. Hence, they remain in the canopy even though they are not knocked off. As an alternative, the harvesting cycle is re-run as a “sub-cycle” again to pick those fruit left in the canopy at that zone. This sub-cycle repeats until all the fruit are clear from the canopy (Williams et al. 2019). The flow charts of these processes are outlined in Fig. 14b. Hence, the harvesting time is increased depending upon how the kiwifruit grows clustery.

In order to increase the economic viability of the harvesting system, a real-time stereo vision system (Jin and Tang 2009; Matsumoto and Zelinsky 2000) is proposed for the next development phase as depicted in Fig. 14c. With the real-time stereo vision system, the instantaneous canopy information is obtained, and the scheduler should update the harvesting order accordingly if an obstructed kiwifruit becomes visible. Furthermore, the target fruit allocation by the scheduler should also be based on evenly workload distribution to minimize the cycle harvesting time. This can be achieved by introducing competition among the robots while they are cooperating to complete the harvesting task. With the real-time stereo vision system, visual servoing control (Chaumette and Hutchinson 2006, 2007) can also be implemented into the manipulator controller to achieve a more accurate fruit harvesting.

9 Conclusion

Video analysis is performed to investigate the reasons for high fruit drop rate during harvesting using an autonomous kiwifruit harvesting system. It is revealed that the position error between the gripper and the fruit is the main cause of fruit drop during harvesting. In order to reduce these position errors, an integrated approach for stereo vision system calibration and kinematics calibration is proposed for the autonomous kiwifruit harvesting system.

A measurement system is set up to acquire the data for both calibrations and verifications. Laboratory tests and orchard tests are arranged to investigate the performance of the harvesting system in real application environments. Based on the measurements after these calibrations, it is found that the position error of the harvesting system is greatly reduced. This is actually reflected in the high successful picking rate (100%) in the laboratory test of picking individual fake kiwifruit. The fruit dropped rate is also greatly improved in the real orchard tests because of the position error reduction. However, the successful harvesting rate is not impressively improved as most of the fruit remain unharvested in the canopy even though they are not knocked off due to position error. This is mainly due to the fact that canopy information such as fruit distribution is obtained by the stereo vision system before harvesting starts. These fruits are not harvested because they are obstructed to the stereo vision system as they are hidden by their neighbouring fruit.

As a result, a real-time stereo vision system is required during harvesting to sense the dynamic environment of canopy information to increase the successful harvesting rate and economic viability. Nevertheless, this will complicate the harvesting scheduling and workload distribution among the robots. As the autonomous kiwifruit harvesting system consists of multiple robots, the harvesting task is cooperative and competitive. The robots cooperate to harvest the fruit and they compete with each other so that the harvesting is completed in a minimum of time.