Keywords

1 Introduction

Grape picking is one of the most important links in grape production, which directly affects the market value of grapes. Picking is time-consuming and laborious, and its labor input accounts for 50% to 70% of the labor input in the entire grape planting process. The aging population of China is increasing, on the other hand the number of agricultural workers is decreasing. The inefficient manual picking will inevitably lead to higher and higher picking costs, and with the prevalence of large-scale and facility viticulture, the previous manual picking operations are difficult to adapt to the needs of market development. Therefore, the development of a grape picking robot with intelligent recognition function has become a hot research issue for scholars at home and abroad. One of the key issues in the development of intelligent recognition picking robots is the recognition and positioning of the target fruit. Zhiyong Xie and others used RGB channel recognition technology to realize the contour recognition of strawberry fruit, with an accuracy rate higher than 85%. Using the characteristic spectrum of apple reflection, Zhaoxiang Liu and others used PSD three-dimensional calibration technology to realize the positioning of the apple fruit, and the maximum deviation was controlled within 13 mm. Traditional optical recognition technology has the advantages of fast recognition speed and low structural complexity. However, it has insufficient processing capacity for obscured branches and leaves and overlapping fruits in a complex environment, and is difficult to use in actual production.

In recent years, there have been related researches on target positioning based on deep learning. Grishick proposed R-CNN (Regions with Convolutional Neural Network Features), which is a regional convolutional neural network [1]. The neural network uses a selective search algorithm to select 2000 candidate regions in the input image, and uses the volume of the image of each candidate region, producting neural network for feature extraction and recognition. This method is the first to combine deep learning with object detection algorithms. After that, Fast R-CNN and Faster R-CNN were successively proposed. Fast R-CNN solves the repeated convolution of candidate regions in R-CNN, and adds ROI pooling (Region of interest pooling) to the last layer of the extracted feature network [2], which significantly speeds up the recognition speed. Faster R-CNN builds RPN (Region Proposal Networks) on the basis of Fast R-CNN, which directly generates candidate regions and realizes high-accuracy end-to-end detection [3,4,5]. Its derivative iterative network model includes SSD (Single Shot Multibox Detector) etc.

Based on the SSD network model, this paper conducts further transfer learning and transformation, and uses the mode of multi-image combined analysis to study the location of grapes cultivated in facilities.

2 Materials and Methods

2.1 Image Acquisition

The image of grapes to be picked was collected as the training set and test set of SSD MobileNet model transfer learning training. The image of the training set would directly affect the microstructure of the model, and then affect the final accuracy [6]. Therefore, when selecting the image, it was necessary to collect representative and wide coverage images, and pay attention to the complexity of the background to avoid over fitting. The model of image acquisition equipment was Sony IMX363 with CMOS resolution of 4032 × 3024 pixels, using a lens with an equivalent focal length of 28 mm. In order to ensure the robustness of the target network model under various light sources, the light sources were not strictly limited. In the process of image acquisition, the light sources were randomly distributed. 30 clusters of Pujiang grapes with different shapes were selected as the experimental object. The cluster height was distributed between 17.3 cm–31.1 cm. The grapes were hung vertically downward perpendicular to the cross bar of facility cultivation. With the grape stem as the axis center, the lens was 50 cm away from the axis center. An image was taken every 15°, and a total of 720 color images were taken.

2.2 Image Pretreatment

The image analysis and processing platform was a computer equipped with windows10 operating system, Intel i7-7700 CPU, 8 GB ram, NVIDIA Quadro P620 2 GB VRAM professional graphics card.

The training mode adopted in this paper is supervised learning, that is, it is necessary to input the label and previous frame content into SSD MobileNet model, and use the model to construct the mapping function of grape object detection. Manually mark the collected image with labelimg tool, place the grape fruit string in the rectangular box of the marking tool, and the upper, lower, left and right edges need to coincide with the rectangular box. Mark the position of grape stem. The edge marking is the same as that of fruit string. If the stem is blocked by fruit or leaves, it will not be marked. At the same time, if there are blades, the blades shall also be marked accordingly. A total of 720 fruit string labels, 633 fruit stem labels and 201 leaf labels were marked (Fig. 1).

Fig. 1.
figure 1

Manual marking of fruit.

Before the transfer learning training of the image, it is necessary to preprocess the image to remove some noise that may affect the accuracy, or lift the weight of some low weight training sets to prevent under fitting [7]. Because the number of 201 leaf labels collected was much less than the other two types, and there were large differences among leaves in different viewing angles, this paper oversampled the images with leaves. We transformed each image with a clockwise tilt of 10° and a counterclockwise tilt of 10°, so that the images with leaves were expanded to 603. After amplification, label the new samples manually with labelimg tool. Because there were images captured at various angles with the grape stem as the axis center, the training set samples were no longer subject to geometric preprocessing.

2.3 Transfer Learning

In this article, we used a programming environment with Tensorflow 1.14.0-gpu and CuDNN 7.6.0 to build a new SSD MobileNet.

SSD MobileNet is a neural network model combining MobileNet and SSD algorithm. MobileNet is used for image classification in the front end of the model, and SSD algorithm is placed in the back end to realize object detection [8]. MobileNet belongs to a lightweight convolutional neural network structure with relatively low network complexity. It can obtain better recognition rate on platforms with low computing power, such as mobile processor or embedded chip carried by agricultural machinery. This network contains the depthwise separable convolution [9]. In the conventional convolution calculation process, the total number of parameters is the number of channels plus the size of the convolution cores. A mature neural network model often involves the combination of several dozens of layers of convolution and pooling layers, so the size of parameters is large and will affect its rate. Depth Separable Convolution divides the traditional convolution calculation into two steps. First, Depthwise Convolution is performed, a separate feature map is generated in each channel. Then Pointwise Convolution is implemented by using a 1 × 1 convolution core. The weighted operation of the individual feature map in the depth direction gives a feature map consistent with the number of traditional convolution processes [10]. Because the number of parameters is significantly reduced during channel-by-channel convolution, this method can significantly reduce the number of parameters, improve the recognition rate, increase the network depth and increase the recognition accuracy in the neural network architecture mode with the same number of parameters.

The MobileNet V1 network structure has 28 layers. The entire network uses only an average pooling layer of 7 × 7 × 1024 size at the end and a SoftMax classifier at the front. A serial combination of multiple convolution layers and deep detachable convolution layers is used at the front, which reduces the computing time required for pooling. This network model also introduces two superparameters: Width Multiplier α and Resolution Multiplier β, Width Multiplier α in the convolution result operation is Dk × Dk × αM × Df × Df + αM × αN × Df × Df, where α \(\in \) (0,1], when α is 1, for standard MobileNet model, when α is less than 1, it is a reduction model. Width factor α can make each layer in the network smaller, further accelerate training and recognition speed, but will affect accuracy. Resolution Multiplier β is to reduce the length and width of the input parameter, which can reduce the length and width of the output feature map in equal proportion [11].

The back-end SSD network model is a modification of the VGG16 network. SSD has 11 blocks, converting the sixth and seventh layers of the VGG16 full connection layer to a 3 × 3 convolution layer, removing the eighth layer of the Dropout layer and the full connection layer, and adding a new convolution layer to increase the number of feature maps. SSD uses a combination of feature maps of multiple resolutions to monitor. For different size targets, small size feature maps have low resolution and can be used for large-scale object detection. For fine texture targets, there is also a corresponding large size feature map to detect. This network is end-to-end, no longer requires candidate areas, and is more efficient than Faster R-CNN.

In transfer learning, the source domain is the built-in classification in the recognition classifier inherent in the MobileNet part [12], while the target domain is the classification set containing fruit strings, fruit stems and leaves. First, the labelled XML identification file needs to be converted to Tensorflow identifiable TFRecord format data. This paper divides 80% of the sample data into training set, 10% into test set and 10% into validation set. When configuring files and pipeline profiles, it is necessary to adjust the parameters of one training sample according to the size of graphics card’s video memory. The size used in this paper is 16. We used fixed feature extractor for transfer learning. Solidify network structures such as mature convolution layers at the front end of the model, were used as feature extractors for the process required by the target domain. At last, train classifiers at the end of the network and related parts of the structure for constructing new classifiers [13,14,15].

2.4 Position Calibration

After getting the network model completed by transfer learning, the network model can identify the contents of the target domain in the image and provide the coordinate points of the rectangular vertex of the recognition box in the image. During the harvesting process, the end executor uses the method of cutting the fruit stem and receiving at the bottom of the fruit string. Therefore, this paper mainly carries out location calibration on the center of the fruit stem and the bottom of the fruit string.

Depth distance acquisition was carried out with a micro laser range finder. The measurement accuracy of the range finder is < 1 mm, the measurement range is 0.03–80.00 m, the spot diameter is less than 0.6 mm under normal working conditions, and it was parallel to the camera on a 360° rotatable electronic pedestal. The camera lens center had a horizontal distance of 2.5 cm from the center of the transmission module of the distance sensor.

When collecting the 3-D coordinate data of the target object, the fruit stem is located by the return value of the object detection. When the picture combination is only fruit strings and blades, it prompts for moving until the fruit stem appears. After the object detection identifies the fruit stem, the target object is placed in the center of the picture by rotating the rotatable support, and the horizontal and vertical rotation angles of the support are recorded at this time. Sweep left and right to get the return value characteristic spectrum of the range sensor. The minimum value x of the characteristic spectrum is determined as the depth distance, then the three-dimensional coordinate of the target point is (x·cosβ·sinα, x·sinβ, x·cosβ·cosα) (Fig. 2).

Fig. 2.
figure 2

Characteristic spectrum of distance.

3 Experimental Results and Analysis

3.1 Object Detection Results

The model was migrated using fixed feature extraction, which included 576 training sets, 72 test sets and 72 validation sets of fruit strings; 506 training sets of fruit stems, 63 testing sets and 63 validation sets; 482 training sets of leaf blades, 60 testing sets and 60 validation sets. Setting batch size to 16, initial learning rate to 0.003, maximum training times to 10,000, when iterating training at 5000 times, the recognition accuracy reached a maximum of 82.9%. Where IOU > 0.85, it was determined correct (Table 1).

Table 1. Results of object detection.

When the batch size was reduced to a smaller scale, loss begins to fluctuate greatly with the increase of the number of iterations, so it is difficult to carry out good normalization conversion, and it is impossible to accurately calculate the mean and variance of all data. At the same time, the recognition accuracy will also decline significantly [16]. As the batch size increased, the number of parameter updates was less and the gradient decreases more accurately. However, because of a too large batch size, the training stops due to the insufficient display memory. At the same time, too large batch size also affects the performance of the random function [17, 18].

3.2 Comprehensive Test Results

Due to the combination of object detection output, comprehensive image analysis, and side-axis ranging data, the final three-dimensional coordinate points need to be determined with the accuracy of the data. 20 strings of fruits were measured with a camera and a ranging sensor installed on a rotatable support. A single target was repeated 10 times, totaling 200 times. The target recognition model identified the stem and the bottom of the string. When the IOU is >0.85, the recognition is correct. When the error between the three-dimensional coordinate points and the actual measurement was less than 1.5 cm, the calculation was correct. Among them, the correct number of target recognition was 159, the accuracy was 79.5%, and the accurate number of three-dimensional coordinate positioning was 159, that is, the error of coordinate calculation of all correctly identified targets was within the allowable range, and the overall accuracy was 79.5%.The aliasing frame rate remains around 20 fps, which achieved good recognition results.

4 Discussion

In Tensorflow platform, SSD MobileNet V1 was used to transfer and learn the characteristics of grape picking samples, and the recognition accuracy was close to the original model. Through the central deviation angle method and the depth data of rangefinder, the picking three-dimensional coordinate information is constructed.

Transfer learning significantly speeds up the efficiency of model construction, and eliminates the process of repeatedly adjusting network structure, optimizing network node parameters, collecting and labeling a large number of sample sets. In the fixed feature extraction process, there is a better generalization ability of the original mature network for feature extraction, which makes the recognition rate and accuracy of the target domain task close to or even exceed the original model. It is very suitable for the model construction of target recognition of grapes and other fruits and vegetables.

In this paper, the three-dimensional coordinate information obtained by the combination of object detection and central deviation angle method is constructed from the orientation of the image receiving end. In the future construction of picking machinery, the coordinate information can be transformed into the final coordinate point required for the positioning of the end effector by re-calibration. When the object detection is completed, the calculation accuracy of three-dimensional coordinate information is close to 100%. The focus of further improving the comprehensive accuracy lies in the further transformation and optimization of the object detection model.

5 Conclusion

According to the subdivision steps of grape picking process, SSD MobileNet V1 network model is used for grape picking sample transfer learning by using fixed feature extraction. Combined with the central deviation angle method, we achieved 79.5% comprehensive accuracy in 200 physical samples, which is close to the inherent accuracy of the original model before transfer learning. It shows that the method in this paper has achieved ideal migration effect in the target domain.