Research on Positioning Technology of Facility Cultivation Grape Based on Transfer Learning of SSD MobileNet

Han, Kaiyuan; Xu, Minjie; Li, Shuangwei; Xu, Zhifu; Ye, Hongbao; Hua, Shan

doi:10.1007/978-981-19-2456-9_61

Kaiyuan Han⁴⁰,
Minjie Xu⁴⁰,
Shuangwei Li⁴⁰,
Zhifu Xu⁴⁰,
Hongbao Ye⁴⁰ &
…
Shan Hua⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

8379 Accesses
2 Citations

Abstract

There is an urgent need of developing grape picking robot with intelligent recognition function due to the decrease of grape picking workers’ population. Acquiring the 3D information of picking coordinate is the key process of constructing intelligent picking equipment. In this paper, based on SSD MobileNet neural network model, transfer learning and central deviation angle method were used to realize the positioning of picking coordinate points of facility cultivation grape by machine vision. After testing 720 fruit labels, 633 stem labels and 603 leaf labels labelled by pretreatment, the general precision was 79.5%, which was close to the inherent accuracy of the original model before transfer learning.

You have full access to this open access chapter, Download conference paper PDF

Autonomous Navigation in Vineyards with Deep Learning at the Edge

Autonomous Robot Visual-Only Guidance in Agriculture Using Vanishing Point Estimation

Various Type of Crops and Trees Detection Using Clustering Technique Through Image Processing

Keywords

1 Introduction

Grape picking is one of the most important links in grape production, which directly affects the market value of grapes. Picking is time-consuming and laborious, and its labor input accounts for 50% to 70% of the labor input in the entire grape planting process. The aging population of China is increasing, on the other hand the number of agricultural workers is decreasing. The inefficient manual picking will inevitably lead to higher and higher picking costs, and with the prevalence of large-scale and facility viticulture, the previous manual picking operations are difficult to adapt to the needs of market development. Therefore, the development of a grape picking robot with intelligent recognition function has become a hot research issue for scholars at home and abroad. One of the key issues in the development of intelligent recognition picking robots is the recognition and positioning of the target fruit. Zhiyong Xie and others used RGB channel recognition technology to realize the contour recognition of strawberry fruit, with an accuracy rate higher than 85%. Using the characteristic spectrum of apple reflection, Zhaoxiang Liu and others used PSD three-dimensional calibration technology to realize the positioning of the apple fruit, and the maximum deviation was controlled within 13 mm. Traditional optical recognition technology has the advantages of fast recognition speed and low structural complexity. However, it has insufficient processing capacity for obscured branches and leaves and overlapping fruits in a complex environment, and is difficult to use in actual production.

In recent years, there have been related researches on target positioning based on deep learning. Grishick proposed R-CNN (Regions with Convolutional Neural Network Features), which is a regional convolutional neural network [1]. The neural network uses a selective search algorithm to select 2000 candidate regions in the input image, and uses the volume of the image of each candidate region, producting neural network for feature extraction and recognition. This method is the first to combine deep learning with object detection algorithms. After that, Fast R-CNN and Faster R-CNN were successively proposed. Fast R-CNN solves the repeated convolution of candidate regions in R-CNN, and adds ROI pooling (Region of interest pooling) to the last layer of the extracted feature network [2], which significantly speeds up the recognition speed. Faster R-CNN builds RPN (Region Proposal Networks) on the basis of Fast R-CNN, which directly generates candidate regions and realizes high-accuracy end-to-end detection [3,4,5]. Its derivative iterative network model includes SSD (Single Shot Multibox Detector) etc.

Based on the SSD network model, this paper conducts further transfer learning and transformation, and uses the mode of multi-image combined analysis to study the location of grapes cultivated in facilities.

2 Materials and Methods

2.1 Image Acquisition

The image of grapes to be picked was collected as the training set and test set of SSD MobileNet model transfer learning training. The image of the training set would directly affect the microstructure of the model, and then affect the final accuracy [6]. Therefore, when selecting the image, it was necessary to collect representative and wide coverage images, and pay attention to the complexity of the background to avoid over fitting. The model of image acquisition equipment was Sony IMX363 with CMOS resolution of 4032 × 3024 pixels, using a lens with an equivalent focal length of 28 mm. In order to ensure the robustness of the target network model under various light sources, the light sources were not strictly limited. In the process of image acquisition, the light sources were randomly distributed. 30 clusters of Pujiang grapes with different shapes were selected as the experimental object. The cluster height was distributed between 17.3 cm–31.1 cm. The grapes were hung vertically downward perpendicular to the cross bar of facility cultivation. With the grape stem as the axis center, the lens was 50 cm away from the axis center. An image was taken every 15°, and a total of 720 color images were taken.

2.2 Image Pretreatment

The image analysis and processing platform was a computer equipped with windows10 operating system, Intel i7-7700 CPU, 8 GB ram, NVIDIA Quadro P620 2 GB VRAM professional graphics card.

The training mode adopted in this paper is supervised learning, that is, it is necessary to input the label and previous frame content into SSD MobileNet model, and use the model to construct the mapping function of grape object detection. Manually mark the collected image with labelimg tool, place the grape fruit string in the rectangular box of the marking tool, and the upper, lower, left and right edges need to coincide with the rectangular box. Mark the position of grape stem. The edge marking is the same as that of fruit string. If the stem is blocked by fruit or leaves, it will not be marked. At the same time, if there are blades, the blades shall also be marked accordingly. A total of 720 fruit string labels, 633 fruit stem labels and 201 leaf labels were marked (Fig. 1).

Before the transfer learning training of the image, it is necessary to preprocess the image to remove some noise that may affect the accuracy, or lift the weight of some low weight training sets to prevent under fitting [7]. Because the number of 201 leaf labels collected was much less than the other two types, and there were large differences among leaves in different viewing angles, this paper oversampled the images with leaves. We transformed each image with a clockwise tilt of 10° and a counterclockwise tilt of 10°, so that the images with leaves were expanded to 603. After amplification, label the new samples manually with labelimg tool. Because there were images captured at various angles with the grape stem as the axis center, the training set samples were no longer subject to geometric preprocessing.

2.3 Transfer Learning

In this article, we used a programming environment with Tensorflow 1.14.0-gpu and CuDNN 7.6.0 to build a new SSD MobileNet.

SSD MobileNet is a neural network model combining MobileNet and SSD algorithm. MobileNet is used for image classification in the front end of the model, and SSD algorithm is placed in the back end to realize object detection [8]. MobileNet belongs to a lightweight convolutional neural network structure with relatively low network complexity. It can obtain better recognition rate on platforms with low computing power, such as mobile processor or embedded chip carried by agricultural machinery. This network contains the depthwise separable convolution [9]. In the conventional convolution calculation process, the total number of parameters is the number of channels plus the size of the convolution cores. A mature neural network model often involves the combination of several dozens of layers of convolution and pooling layers, so the size of parameters is large and will affect its rate. Depth Separable Convolution divides the traditional convolution calculation into two steps. First, Depthwise Convolution is performed, a separate feature map is generated in each channel. Then Pointwise Convolution is implemented by using a 1 × 1 convolution core. The weighted operation of the individual feature map in the depth direction gives a feature map consistent with the number of traditional convolution processes [10]. Because the number of parameters is significantly reduced during channel-by-channel convolution, this method can significantly reduce the number of parameters, improve the recognition rate, increase the network depth and increase the recognition accuracy in the neural network architecture mode with the same number of parameters.

The MobileNet V1 network structure has 28 layers. The entire network uses only an average pooling layer of 7 × 7 × 1024 size at the end and a SoftMax classifier at the front. A serial combination of multiple convolution layers and deep detachable convolution layers is used at the front, which reduces the computing time required for pooling. This network model also introduces two superparameters: Width Multiplier α and Resolution Multiplier β, Width Multiplier α in the convolution result operation is Dk × Dk × αM × Df × Df + αM × αN × Df × Df, where α \(\in \) (0,1], when α is 1, for standard MobileNet model, when α is less than 1, it is a reduction model. Width factor α can make each layer in the network smaller, further accelerate training and recognition speed, but will affect accuracy. Resolution Multiplier β is to reduce the length and width of the input parameter, which can reduce the length and width of the output feature map in equal proportion [11].

The back-end SSD network model is a modification of the VGG16 network. SSD has 11 blocks, converting the sixth and seventh layers of the VGG16 full connection layer to a 3 × 3 convolution layer, removing the eighth layer of the Dropout layer and the full connection layer, and adding a new convolution layer to increase the number of feature maps. SSD uses a combination of feature maps of multiple resolutions to monitor. For different size targets, small size feature maps have low resolution and can be used for large-scale object detection. For fine texture targets, there is also a corresponding large size feature map to detect. This network is end-to-end, no longer requires candidate areas, and is more efficient than Faster R-CNN.

In transfer learning, the source domain is the built-in classification in the recognition classifier inherent in the MobileNet part [12], while the target domain is the classification set containing fruit strings, fruit stems and leaves. First, the labelled XML identification file needs to be converted to Tensorflow identifiable TFRecord format data. This paper divides 80% of the sample data into training set, 10% into test set and 10% into validation set. When configuring files and pipeline profiles, it is necessary to adjust the parameters of one training sample according to the size of graphics card’s video memory. The size used in this paper is 16. We used fixed feature extractor for transfer learning. Solidify network structures such as mature convolution layers at the front end of the model, were used as feature extractors for the process required by the target domain. At last, train classifiers at the end of the network and related parts of the structure for constructing new classifiers [13,14,15].

2.4 Position Calibration

After getting the network model completed by transfer learning, the network model can identify the contents of the target domain in the image and provide the coordinate points of the rectangular vertex of the recognition box in the image. During the harvesting process, the end executor uses the method of cutting the fruit stem and receiving at the bottom of the fruit string. Therefore, this paper mainly carries out location calibration on the center of the fruit stem and the bottom of the fruit string.

Depth distance acquisition was carried out with a micro laser range finder. The measurement accuracy of the range finder is < 1 mm, the measurement range is 0.03–80.00 m, the spot diameter is less than 0.6 mm under normal working conditions, and it was parallel to the camera on a 360° rotatable electronic pedestal. The camera lens center had a horizontal distance of 2.5 cm from the center of the transmission module of the distance sensor.

When collecting the 3-D coordinate data of the target object, the fruit stem is located by the return value of the object detection. When the picture combination is only fruit strings and blades, it prompts for moving until the fruit stem appears. After the object detection identifies the fruit stem, the target object is placed in the center of the picture by rotating the rotatable support, and the horizontal and vertical rotation angles of the support are recorded at this time. Sweep left and right to get the return value characteristic spectrum of the range sensor. The minimum value x of the characteristic spectrum is determined as the depth distance, then the three-dimensional coordinate of the target point is (x·cosβ·sinα, x·sinβ, x·cosβ·cosα) (Fig. 2).

3 Experimental Results and Analysis

3.1 Object Detection Results

The model was migrated using fixed feature extraction, which included 576 training sets, 72 test sets and 72 validation sets of fruit strings; 506 training sets of fruit stems, 63 testing sets and 63 validation sets; 482 training sets of leaf blades, 60 testing sets and 60 validation sets. Setting batch size to 16, initial learning rate to 0.003, maximum training times to 10,000, when iterating training at 5000 times, the recognition accuracy reached a maximum of 82.9%. Where IOU > 0.85, it was determined correct (Table 1).

Table 1. Results of object detection.

Full size table

When the batch size was reduced to a smaller scale, loss begins to fluctuate greatly with the increase of the number of iterations, so it is difficult to carry out good normalization conversion, and it is impossible to accurately calculate the mean and variance of all data. At the same time, the recognition accuracy will also decline significantly [16]. As the batch size increased, the number of parameter updates was less and the gradient decreases more accurately. However, because of a too large batch size, the training stops due to the insufficient display memory. At the same time, too large batch size also affects the performance of the random function [17, 18].

3.2 Comprehensive Test Results

Due to the combination of object detection output, comprehensive image analysis, and side-axis ranging data, the final three-dimensional coordinate points need to be determined with the accuracy of the data. 20 strings of fruits were measured with a camera and a ranging sensor installed on a rotatable support. A single target was repeated 10 times, totaling 200 times. The target recognition model identified the stem and the bottom of the string. When the IOU is >0.85, the recognition is correct. When the error between the three-dimensional coordinate points and the actual measurement was less than 1.5 cm, the calculation was correct. Among them, the correct number of target recognition was 159, the accuracy was 79.5%, and the accurate number of three-dimensional coordinate positioning was 159, that is, the error of coordinate calculation of all correctly identified targets was within the allowable range, and the overall accuracy was 79.5%.The aliasing frame rate remains around 20 fps, which achieved good recognition results.

4 Discussion

In Tensorflow platform, SSD MobileNet V1 was used to transfer and learn the characteristics of grape picking samples, and the recognition accuracy was close to the original model. Through the central deviation angle method and the depth data of rangefinder, the picking three-dimensional coordinate information is constructed.

Transfer learning significantly speeds up the efficiency of model construction, and eliminates the process of repeatedly adjusting network structure, optimizing network node parameters, collecting and labeling a large number of sample sets. In the fixed feature extraction process, there is a better generalization ability of the original mature network for feature extraction, which makes the recognition rate and accuracy of the target domain task close to or even exceed the original model. It is very suitable for the model construction of target recognition of grapes and other fruits and vegetables.

In this paper, the three-dimensional coordinate information obtained by the combination of object detection and central deviation angle method is constructed from the orientation of the image receiving end. In the future construction of picking machinery, the coordinate information can be transformed into the final coordinate point required for the positioning of the end effector by re-calibration. When the object detection is completed, the calculation accuracy of three-dimensional coordinate information is close to 100%. The focus of further improving the comprehensive accuracy lies in the further transformation and optimization of the object detection model.

5 Conclusion

According to the subdivision steps of grape picking process, SSD MobileNet V1 network model is used for grape picking sample transfer learning by using fixed feature extraction. Combined with the central deviation angle method, we achieved 79.5% comprehensive accuracy in 200 physical samples, which is close to the inherent accuracy of the original model before transfer learning. It shows that the method in this paper has achieved ideal migration effect in the target domain.

References

Zeiler, M., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Girshick, R.: Fast R-CNN. Computer Ence (2015)
Google Scholar
Shaoqing, R., Kaiming, H., Girshick, R., Jian, S.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 1137–1149 (2017)
Google Scholar
Lanchantin, J., Singh, R., Wang, B., Yanjun, Q.: Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. In: PSB 2017: Pacific Symposium on Biocomputing, pp. 254–265 (2017)
Google Scholar
Li, Y., Huang, H., Xie, Q., Yao, L., Chen, Q.: Research on a surface defect detection algorithm based on MobileNet-SSD. Appl. Sci. 8(9), 1678 (2018)
Article Google Scholar
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016). https://doi.org/10.1186/s40537-016-0043-6
Article Google Scholar
Zheng, Q., Zhaoning, Z., Shiqing, Z., Hao, Y., Yuxing, P.: Merging-and-evolution networks for mobile vision applications. IEEE Access 6, 31294–31306 (2018)
Article Google Scholar
Ni, Z., Yan, Y., Si, C., Hanzi, W., Chunhua, S.: Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recogn. 80, 225–240 (2018)
Article Google Scholar
Zhu, J., Liao, S., Yi, D., Lei, Z., Li, S.: Multi-label CNN based pedestrian attribute learning for soft biometrics. In: 2015 International Conference on Biometrics (ICB), pp. 535–540 (2015)
Google Scholar
Vishal, P., Raghuraman, G., Ruonan, L., Ca, R.: Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 32(3), 53–69 (2015)
Article Google Scholar
Yuhua, C., Wen, L., Christos, S., Dengxin, D., Luc, G.: Domain adaptive faster R-CNN for object detection in the wild. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3339–3348 (2018)
Google Scholar
Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 342–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_21
Chapter Google Scholar
Yi, T., Wenbin, Z., Zhi, J., Yuhuan, C., Yang, H., Xia, L.: Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans. Circuits Syst. Video Technol. 29(7), 1973–1984 (2019)
Article Google Scholar
Wei, F., Lin, W., Peiming, R.: Tinier-YOLO: a real-time object detection method for constrained environments. IEEE Access 8, 1935–1944 (2020)
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 36–47 (2014)
MathSciNet MATH Google Scholar
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One 10, 21–30 (2015)
Google Scholar
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 891–898 (2014)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar

Download references

Acknowledgements

This research is funded by the project “Research on new technology of intelligent facility agricultural safety production - research and development of intelligent multi ecological three-dimensional planting in greenhouse and multi-functional electric operation platform” from Key Research and Development Projects in Zhejiang Province (Subject No.: 2019C02066).

Author information

Authors and Affiliations

Institute of Agricultural Equipment, Zhejiang Academy of Agricultural Sciences, Hangzhou, 310021, China
Kaiyuan Han, Minjie Xu, Shuangwei Li, Zhifu Xu, Hongbao Ye & Shan Hua

Authors

Kaiyuan Han
View author publications
You can also search for this author in PubMed Google Scholar
Minjie Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shuangwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhifu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hongbao Ye
View author publications
You can also search for this author in PubMed Google Scholar
Shan Hua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shan Hua .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, K., Xu, M., Li, S., Xu, Z., Ye, H., Hua, S. (2022). Research on Positioning Technology of Facility Cultivation Grape Based on Transfer Learning of SSD MobileNet. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_61

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_61
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics