Keywords

1 Introduction

As an important tool for unloading goods, tipplers are widely used in the coal transportation industry. Its working principle is to fix and rotate a carriage of the train at a certain angle, thereby dumping the coal out of it. However, there are couplers between the train carriages for connection, so it is crucial to timely and effectively remove the couplers before the tippler works. The operating handles for uncoupling train couplers are divided into two types: upper acting and lower acting, and different handles require different methods of uncoupling operation. The removal of train couplers was initially completed by manpower, but due to the low efficiency of manual removal of couplers and the danger of workers during the process, the research on automatic decoupling robots has emerged. Compared to manual labor, automatic hook picking robots can adapt to more harsh working environments and have higher work efficiency. The ability to accurately and quickly determine the position of the coupler and identify the type of coupler has a significant impact on the completion and work efficiency of the uncoupling robot. Therefore, the research on target detection of the coupler operating handle is of great significance.

The development of target detection methods is mainly divided into two stages: the stage of using traditional methods for target detection and the stage of using deep learning methods for target detection. The traditional target detection method uses exhaustive methods to select the target location, which not only takes a long time, but also has a high false detection rate in practical applications, so it is gradually replaced by the deep learning method. In deep learning, commonly used target detection methods are mainly divided into two types. One type is to first form a series of candidate boxes for feature extraction and target position judgment, known as the two stage algorithm, such as R-CNN algorithm and Faster R-CNN algorithm. Another type of algorithm, which does not generate candidate boxes separately, completes the selection of target positions and feature extraction together to achieve end-to-end target detection tasks, is called one stage algorithm, such as YOLO algorithm and SSD algorithm [1,2,3]. Both algorithms can effectively complete various complex object detection tasks and are widely used by a large number of scholars. He et al. [4] introduced an attention module, balance module, and context module to construct an intelligent detection model in Mask R-CNN, which is used for welding quality detection of subway car bodies. According to experimental verification, the detection accuracy of this method is 4.5 percentage points higher than the traditional Mask R-CNN algorithm. Zhang et al. [5] applied the Faster R-CNN algorithm to equipment recognition and status detection tasks in power rooms, and performed well in both image and video tests, proving the effectiveness of the algorithm. Li et al. [6] combines deep learning with unmanned aerial vehicle (UAV) bridge crack detection, uses UAV to obtain high-quality images of long distance bridges, and then uses Faster R-CNN algorithm based on transfer learning to identify cracks, and obtains high detection accuracy and efficiency. Dai et al. [7] proposed a multi task detector based on the Faster R-CNN algorithm, which uses an improved ResNet-50 architecture to estimate distance and detect pedestrians during autonomous driving. It combines infrared cameras and LiDAR with target detection algorithms, achieving good detection results in real nighttime road scenes. Luo et al. [8] optimized the Faster R-CNN algorithm and combined it with feature enhancement methods to detect different vehicles, effectively solving the problem of vehicle detection in complex traffic environments. Li et al. [9] combines YOLOv4 and YOLO-GGCNN for object recognition in positional environments, facilitating the robotic arm to selectively grasp objects in unknown environments. Jiang et al. [10] used the YOLO model to extract features from videos and images captured by infrared cameras, and achieved good results in complex environments, verifying the effectiveness of the YOLO model in drone target detection tasks. Mushtaq et al. [11] applied the deep learning method to the identification of important components of aerospace vehicles, used YOLOv5 algorithm to identify and classify different components on the assembly line, and achieved high detection accuracy. Ji et al. [12] improved the traditional YOLOv4 algorithm by introducing the extended perception module and the attention mechanism module, and improved the CIOU Loss function. It was verified that it has better detection effect in small target detection than other models through public data sets.

The above methods have achieved good detection results, but they have not taken into account the complexity and timeliness of the model. Most existing target detection models have complex structures in order to achieve high detection accuracy, but this also leads to an increase in the computational time of the model, an increase in the demand for hardware computing power, and a longer inference time when making predictions in practical applications. The detection objects studied in this article are only two types of coupler operating handles, which belong to relatively simple target detection tasks and do not require extremely complex model structures and calculation steps. Using common models for detection and analysis can waste a lot of time and burden hardware. In response to this issue, this article uses the simplest YOLOv5n model in the YOLO series as the detection model, which ensures accuracy while obtaining fewer model parameters and faster operation speed. It has high engineering application value for simple target detection tasks.

This article is summarized as follows. Section 2 introduces the diagnostic methods used in this article. Section 3 introduces the experimental process and result analysis of target recognition for the coupler handle. Section 4 summarizes the diagnostic methods used in this article.

2 YOLOv5 Model

YOLO series algorithms are representative methods in one stage target detection algorithms. Due to their fast detection speed, strong generalization ability, and high robustness, they have been widely used in various object detection scenarios. YOLOv5 is one of the most widely used detection methods. Compared to the previous YOLOv4 method, this method has fewer parameters, faster calculation speed, and higher accuracy [13]. Its model structure is shown in Fig. 1. According to the complexity of the models, the YOLOv5 series proposes a total of five official models, namely YOLOv5x, YOLOv5l, YOLOv5m, YOLOv5s, and YOLOv5n [14,15,16]. The YOLOv5n model has the fastest computational speed and the smallest number of parameters, making it more suitable for simpler target recognition applications.

Fig. 1
A structural flow diagram of the YOLO v 5 model. An input image enters three sections labeled backbone, neck, and head. These involve various layers, such as C B S, C 3 underscore 1 and 2, concatenation, convolution, and upsampling connected in a flow. Details of each layer are given separately.

YOLOv5 model structure diagram

The Input section is the input end of the model, used to input the original image dataset. This section mainly focuses on mosaic image enhancement, adaptive anchor box calculation, and adaptive image scaling processing [17, 18]. The mosaic method can randomly select several different images for random stitching and combination to form a large image, thereby improving the multiplicity of training images and the robustness of the model network. Adaptive anchor box can automatically calculate the most suitable anchor box parameters for the input image, improving the accuracy of detection results. Adaptive image scaling refers to standardizing the size of the original image to reduce information redundancy in the input image and improve the inference speed of the algorithm.

The Backbone section consists of CBS module, C3_1 module and SPPF module, mainly used for feature extraction of images [19]. Among them, CBS module and C3_1 module can repeatedly analyze the input images and gradually extract useful information. The SPPF module can achieve local feature fusion and to some extent solve the multi-scale problem of the target. Compared to the original SPP module, it has faster computational speed and higher efficiency. In the overall framework of Backbone, the first convolutional layer usually has two structures: Focus and CBS. The Focus module has stronger feature extraction capabilities, but the Focus structure is relatively complex in implementation, which can lead to increased computational complexity. The CBS module has a simple structure and is more suitable for simple target recognition tasks. Therefore, the model in this article chooses the CBS module as the first convolutional layer, which can simplify the model structure and improve computational efficiency.

The Neck section continues to use the traditional FPN + PAN structure, as shown in Fig. 2. FPN can transmit deep semantic features to shallow layers, while PAN can conversely transmit shallow localization information to deep layers. The effective combination of the two can enhance the ability of network feature extraction and fusion. Additionally, the Neck section introduces C3_2 modules to further enhance the network's feature fusion capability [20].

Fig. 2
A schematic of a neural network architecture. It undergoes downsampling of an image, processing through F P N and P A N structures, and then upsampling. An up arrow below F P N is labeled down sample and a down arrow below P A N is labeled up sample.

FPN + PAN structure

The Head section is the output end of the model, which is divided into three scales for prediction at different scales [21].

3 Experiments and Result Analysis

3.1 Experimental Introduction

Experimental Dataset

The automatic uncoupling robot needs to recognize and locate the operating handle of the coupler during the uncoupling process, which requires it to learn and memorize different types of coupler operating handles. This experimental dataset was collected from different connecting parts of train carriages, and images of coupler operating handles of different models and backgrounds were collected. The operating handle is mainly divided into two types: top acting and down acting, and its schematic diagram is shown in Figs. 3 and 4. Use the “Labelimg” tool to label all images and set the labels for the two types of couplers to “Top acting handle” and “Down acting handle”, respectively. Finally, the annotated images are randomly divided into validation, testing and training sets in a ratio of 1:1:8, with all three datasets having different images [22, 23].

Fig. 3
A photo of a rusty connection mechanism between two train carriages on a track. The handle between them acts upward.

Top acting handle

Fig. 4
A photo of a connection mechanism between two train carriages on a track. The handle between them acts downward.

Down acting handle

Experimental Evaluation Indicators

In terms of model effectiveness detection, this article selects indicators such as P (Precision), R (Recall) and mAP (Mean average precision) to evaluate the effectiveness of the model. Accuracy represents the proportion of correctly predicted positive samples in the predicted data to the actual positive samples, as shown in Eq. (1), where TP represents the number of predicted positive samples in the positive samples and FP represents the number of predicted positive samples in the negative samples [24]. The recall rate represents the probability that the correct category in the sample is predicted correctly, as shown in Eq. (2), where FN represents the number of predicted negative categories in the positive sample. Drawing with P as the vertical axis and R as the horizontal axis can obtain the PR curve. AP (Average Precision) represents the area enclosed under the PR curve, as shown in Eq. (3). The larger the AP value, the better the model's data processing performance. At present, mAP is often used to measure the model detection effect, which represents the average accuracy of all categories [25, 26], as shown in Eq. (4), where nclass represents the number of categories, and the larger the value of mAP, the better the model detection effect.

$$P = \frac{TP}{{TP + FP}}$$
(1)
$$R = \frac{TP}{{TP + FN}}$$
(2)
$$AP = \int\limits_{0}^{1} {P(R)dR}$$
(3)
$$mAP = \frac{1}{{n_{{{\text{class}}}} }}\int\limits_{0}^{1} {P(R)dR} .$$
(4)

3.2 Network Training

Input the dataset into the network model for training. During the training process, the learning rate is set to 0.01, the batch size is set to 8, and the epoch is set to 200. After training, the curves of various loss values gradually changing with continuous iteration are shown in Fig. 5. In the picture, box_loss represents the loss of the bounding box, obj_loss represents whether there is a corresponding object loss in the bounding box, cls_loss represents classification loss, train represents training set, and val represents validation set. As shown in the figure, the loss curves of both the training and validation sets can effectively converge, ultimately approaching 0, indicating that this method can achieve the best effect after 200 iterations.

Fig. 5
Six line graphs plot loss curves for training and validation datasets against epochs. In all graphs, the line resembles a concave-up declining trend, with fluctuations.

Various loss curves

Calculate the accuracy, recall rate and mAP values and plot the change curve, as shown in Fig. 6. As shown in the figure, after 200 iterations, the curves of accuracy, recall rate, and mean average precision all converge completely and are infinitely close to 1, indicating that the method of this article can classify, locate, and recognize different types of train coupler operating handles, and has achieved excellent results.

Fig. 6
Three line graphs plot precision, recall, and m A P at 0.5 versus epoch. In all graphs, the fluctuating line rises sharply up to (30, 1.0) and stabilizes thereafter.

Change curve of various evaluation indicators. a Accuracy curve. b Recall rate curve. c mAP curve

Figure 7 shows the confusion matrix of the validation results. As shown in the figure, the recognition accuracy for both top acting handle and down acting handle categories is 1, and there is no recognition at the background category. This indicates that the method of this article can effectively recognize both types of coupler handles, and the positioning effect is good, without confusing the handle with other background parts.

Fig. 7
A confusion matrix for a classification model with three labels namely top acting handle, down acting handle, and background. The x-axis represents the actual label and the y-axis represents the predicted label. The matrix has a higher value of 1 in the first two diagonal cells.

Confusion matrix

3.3 Visualization of Test Results

Use the trained model to detect and recognize the images of the coupler operating handle in the test set, and the results are shown in Fig. 8. From the figure, it can be seen that this method can accurately select different styles of coupler operating handles and has a high recognition probability, which once again demonstrates the good engineering application value of this method.

Fig. 8
Two photos of a connection between 2 train carriages with two types of handles. The first photo focuses on a top-acting handle, with an accuracy of 0.96, and the second focuses on a down-acting handle, with an accuracy of 0.95.

Visualization of detection results. a Top acting handle detection results. b Down acting handle detection results

3.4 Comparison of Different Models

To test the advantages of the YOLOv5n method used for detecting the uncoupling handle of car couplers in this article, it was compared with SSD300 and Faster R-CNN, and analyzed from the aspects of detection accuracy, parameter quantity, computational complexity, prediction speed, and saved weight file size. In terms of detection accuracy, draw the mAP change curves of the three methods, as shown in Fig. 9. From the figure, it can be seen that under a threshold of 0.5, the mAP values of YOLOv5n, SSD300, and Faster R-CNN methods can all rapidly increase with iteration and eventually converge. The best mAP values can reach 99.50, 99.95, and 96.88%, respectively. The mAP value for target detection of the coupler handle using the YOLOv5n method, although not the highest among all models, fully meets the accuracy requirements of target detection. Calculate the parameter quantity, computational complexity and prediction speed of the three methods, as shown in Table 1. From the table, it can be seen that compared to the SSD300 and Faster R-CNN algorithms, the YOLOv5n algorithm has the least number of parameters and the smallest computational complexity, thus requiring less hardware computing power during the operation process. In terms of prediction speed, the FPS value of YOLOv5n algorithm is higher than the other two algorithms, indicating that it has the fastest inference speed in the prediction process and can more quickly identify task targets in practical applications. In terms of model weight preservation, draw a bar chart comparing the optimal weight file sizes of the three methods, as shown in Fig. 10. As shown in the figure, the YOLOv5n algorithm saves the smallest optimal weight volume, making it easier to deploy to resource constrained platforms such as embedded or mobile devices. At the same time, a smaller volume can also achieve faster inference speed, thus completing prediction tasks faster. Based on the above analysis, it can be concluded that compared to other models, the YOLOv5n model is more suitable for task scenarios with simple targets and high timeliness requirements such as coupler handle detection.

Fig. 9
A line graph plots m A P 0.5 versus epochs for three object detection methods namely YOLO v 5 n, S S D 300, and Faster R C N N. All methods follow an initial increase in score and then stabilize. YOLO v 5 n and S S D 300 stabilize at the highest score.

Different methods mAP@0.5 curve

Table 1 Comparison of the effects of different methods
Fig. 10
A column chart plots the best weight file size versus different methods, such as YOLO v 5 n, S S D 300, and Faster R C N N. Data in the graph are as follows, YOLO v 5 n at 3.67 megabytes, S S D 300 at 91.1 megabytes, and Faster R C N N at 108 megabytes.

Comparison of best weight file sizes for different methods

4 Conclusions

To solve the problem of identifying and positioning the operating handle of the train automatic uncoupling robot during the uncoupling process, a method based on the YOLOv5 model for object detection of the coupler handle is proposed. This method uses the relatively simple YOLOv5n model as the reference model for detection, ensuring the detection accuracy of the model for simple objects while effectively reducing the parameter quantity of the model. The YOLOv5n model is used to detect and analyze different types of coupler handles. The experimental results show that this method has high detection accuracy, fast speed, small weight file volume, and smaller hardware requirements, thus having high engineering application value.