Keywords

1 Introduction

Dumper is a frequently used equipment in China’s coal transportation industry. It fixes and rotates a carriage of a train to a certain angle, dumping all the coal inside. Because each carriage of the train is connected by a coupler, it is particularly important to promptly and effectively remove the coupler before the tippler works. When the train coupler is removed, there are two types of operating handles: up and down, and the removal methods for different handles are also different. The extraction of train couplers was initially carried out manually, but due to issues such as low manual operation efficiency and high operational risks. The ability to accurately and quickly locate the coupler operating rod and accurately identify the type of operating rod directly affects the quality of decoupling and operational efficiency. Compared to manual operations, this robot has strong adaptability and high work efficiency. Therefore, it is of great significance to develop a new type of fully automatic decoupling robot for target detection of the coupler control lever.

With the advancement of technology, computer vision technology has been widely applied in moving object detection and tracking and has gradually affected various aspects of our lives [1,2,3]. Its importance is becoming increasingly prominent, attracting more and more scholars and research institutions from home and abroad to participate in research in this field. Moving object detection and tracking based on computer vision have wide applications in multiple fields. It achieves public safety monitoring and management, accident prevention, detection and processing, and other important functions by analyzing moving objects in videos or images. This technology has also been applied in fields such as virtual reality, human–computer interaction, planetary exploration, and behavioral understanding. In addition, it also plays an important role in monitoring the elderly, young, sick, and disabled, as well as autonomous navigation. The method of object detection is mainly divided into two stages: the traditional object detection stage (HOG [4] and SIFT [5]), and the deep learning-based object detection stage (Faster R-CNN [6], SSD [7] and YOLO [8]). The traditional target detection method uses an exhaustive method to determine the target position, which not only takes a long time but also has a high error detection rate in practical applications. Therefore, it is gradually replaced by deep learning methods. There are two main types of object detection methods based on deep learning. One type is to first form a series of candidate boxes for feature extraction and target position judgment, known as the Two-stage algorithm, The other type does not generate candidate boxes separately but completes the selection of target positions and feature extraction together to achieve end-to-end target detection tasks, known as the One stage algorithm. YOLO is one of the most commonly used technologies in recent years, which facilitates user usage through continuous updates (YOLOv4 [9], YOLOv5 [10], and YOLOv7 [11]). Jun [9] proposed a small object detection algorithm based on YOLOv4, multi-scale contextual information, and Soft-CIOU loss function, called MCS-YOLOv4. This method adds scale detection to the existing three scales to obtain rich location information. An attention module has been introduced into the neck of YOLOv4 to reduce the impact of irrelevant information on small objects in the image. In addition, a Soft-CIOU loss function has been proposed to further improve the detection accuracy of small objects. This method achieved better detection performance on publicly available small object datasets. Arifando [10] proposed the Improved YOLOv5 method, which integrates GhostConv and C3Ghost modules into the YOLOv5 network to reduce model parameters while maintaining detection accuracy. At the same time, we use the SimSPPF module to replace the SPPF in the YOLOv5 backbone to improve computational efficiency and accurate object detection capabilities. Through comparative analysis, we found that the Improved YOLOv5 method has smaller FLOPS and parameter count, uses less memory, and has faster inference time, thus achieving higher efficiency. Yanyun [11] proposed an improved detector called YOLOv7 sea, which is an improvement on the original model. This improvement adds a prediction head to enable the model to better detect small people or objects. In addition, YOLOv7 sea also integrates a simple parameterless attention module (SimAM) for searching for regions of interest in the scene. To further improve the performance of the model, researchers have adopted methods such as data augmentation, test time augmentation (TTA), and bundling box fusion (WBF) for improvement. The calculation results show that the improved YOLOv7 sea model has improved performance and performs better than the original model. Susu [12] proposed an improved SSD object detection algorithm called DF-SSD, which is based on DenseNet and feature fusion technology. We will replace the original backbone network VGG-16 with DenseNet-S-32-1 to enhance the feature extraction capability of the model. In the multi-scale detection section, we introduce a multi-scale feature layer fusion mechanism to organically combine low-level visual features and high-level semantic features. In addition, we also introduced residual blocks before target prediction to further improve model performance. The experimental results indicate that our improved model has higher accuracy in detecting small objects and objects with specific relationships. Zhang [13] proposed an object detection model called MA ResNet, which is based on the design of a multi-attention residual network. Specifically, they introduced spatial attention, channel attention, and self-attention mechanisms into the residual network, enabling the network to more accurately focus on the target region and extract key features. It is worth noting that they replaced the original VGG-16 feature extractor in Faster RCNN with MA ResNet. In the experiment, it was found that MA ResNet showed significant advantages in convergence speed, accuracy, and small target classification accuracy. This proves the excellent performance of MA ResNet in feature extraction, which helps to improve the performance and accuracy of object detection. Therefore, MA ResNet based on multi-attention residual networks can be regarded as an effective feature extraction model and exhibits superior performance in object detection tasks [14].

The object detection method has been widely applied in various fields, but due to the difficulty in capturing the dataset of the coupler operating handle, it has been less applied in the field of train uncoupling. In response to this issue, this article constructs a dataset of train coupler operating handles and expands the data. Simulate and analyze the complex visual situation during the hook-picking process using an expanded dataset. The target detection method using the YOLOV8n model was used for recognition and achieved a detection effect of 98.8%, demonstrating its practical engineering value [15].

2 YOLOv8 Model

YOLOv8, as the latest YOLO model, can be used for object detection, image classification, and instance segmentation tasks. This article is intended for object detection tasks. Based on the consideration of model size, this article selects the YOLOv8n network with small volume and high accuracy. The YOLOv8n model detection network mainly consists of three parts (as shown in Fig. 1): Backbone, Neck, and Head.

Fig. 1
An illustration of the YOLO v 8 model structure consists of three sections, namely, backbone, neck, and head. Various interconnected blocks represent layers and processes like conv, up-sample, con-cat, and detect.

YOLOv8 Model structure diagram

The backbone is mainly used for feature extraction, which includes modules such as Conv, C2f, SPPF, etc. The main function of the Conv module is to perform convolution, BN, and SiLU activation function operations on the input image. YOLOv8n has designed a brand new C2f structure, which is the main module for learning residual features. Its structure allows YOLOv8n to have rich gradient flow information while ensuring lightweight, The SPPF module, also known as spatial pyramid pooling, can convert feature maps of any size into fixed-size feature vectors.

The main function of the Neck network is to fuse multi-scale features and generate feature pyramids. The Neck part adopts a PANet structure, and its core structure consists of two parts: feature pyramid network FPN and path aggregation network PAN. FPN first extracts feature maps from convolutional neural networks to construct feature pyramids, and then uses upsampling and coarser-grained feature maps fusion from top to bottom to achieve the fusion of features at different levels. However, if only FPN is used, it will lack the position information of the target. PAN is a supplement to FPN, using a bottom-up structure to fuse feature maps from different levels through the use of a convolutional layer, accurately preserving spatial information. The combination of FPN and PAN fully achieves the fusion of upstream and downstream information flows in the network, improving the detection performance of the network.

As the final prediction part, the Head output terminal obtains the category and position information of target objects of different sizes based on feature maps of different sizes [16].

3 Basic Theory

3.1 Loss Function

YOLOv8 is an improved version of the YOLO series of object detection algorithms. In YOLOv8, the loss function consists of multiple parts, including classification loss (VFL Loss) and regression loss (CIOU Loss and DFL) [17]. The classification loss (VFL Loss) is used to measure the accuracy of object categories in the prediction box. VFL stands for Variable Focal Loss and is an improvement on Focal Loss. It solves the problem of sample imbalance by adjusting the weight of positive and negative samples, making the model more focused on difficult-to-classify samples. VFL Loss can be used to measure the error of classification prediction results.

Regression loss consists of two parts: CIOU Loss and DFL. CIOU Loss stands for Complete Intersection over Union Loss, which is a regression loss used to measure the accuracy of prediction box bounding boxes. CIOU Loss not only considers the position offset between the predicted box and the real box but also considers the shape similarity between them, which can better measure the matching degree of the bounding box. This can help improve the accuracy of object detection.

DFL stands for Distribution Focal Loss, which is mainly used to enhance the detection ability of small targets. DFL utilizes the scale distribution information of the target to adjust the weight of positive and negative samples, making the model more focused on detecting small targets. DFL Loss can help improve the detection performance of the model on small targets.

Overall, the loss function of YOLOv8 consists of classification loss (VFL Loss) and regression loss (CIOU Loss and DFL), which are used to measure the accuracy of classification prediction and bounding box regression, respectively. The introduction of these loss functions can help improve the performance of the YOLOv8 model in object detection tasks.

3.2 Evaluation Indicators

In the process of evaluating deep learning models, accuracy (A), accuracy (P), recall (R), and F1 values can be used to evaluate the model [18].

  1. (1)

    Accuracy is the proportion of correctly classified samples to the total number of samples, calculated as follows Eq. (1).

    $$A = \frac{TP + TN}{{TP + TN + FP + FN}}$$
    (1)
  2. (2)

    The accuracy rate represents the proportion of true positive cases in the samples predicted as positive cases, as shown in Eq. (2).

    $$P = \frac{TP}{{TP + FP}}$$
    (2)
  3. (3)

    Recall rate represents the proportion of samples predicted to be positive in a true positive example, as shown in Eq. (3).

    $$R = \frac{TP}{{TP + FN}}$$
    (3)

In the formula: TP represents that the predicted value is 1, the actual value is 1, and the predicted value is correct, which is the true example; FP indicates that the prediction is 1, the actual value is 0, and the prediction is incorrect, which is a false negative example; FN indicates that the prediction is 0, the actual value is 1, and the prediction is incorrect, which is a true negative example [19].

In summary, accuracy represents the percentage of correctly predicted results in the total sample; Accuracy represents the degree to which the average measured value matches the true value, and is used to represent the magnitude of system error; The recall rate is used to evaluate the quality of results, representing the proportion of correctly classified results.

The analysis of experimental results in this article will comprehensively consider the above indicators to evaluate the predicted results.

4 Experiment Validation

4.1 Experimental Datasets

There are very few datasets on the uncoupling operation handle of the coupler in object detection, making it difficult to verify with public datasets. Therefore, this experiment used a dataset of self-built coupler operating handles taken from the on-site photos of the automatic uncoupling robot. The handle types are mainly divided into two types: upper-acting and lower-acting, as shown in Fig. 2. However, due to the limited variety of operating handles and working environment for on-site coupler uncoupling, it is difficult to ensure the richness of image categories in the dataset. Before experimenting, this article first expanded the dataset, including a series of operations such as rotation, translation, and brightness adjustment, as shown in Fig. 3. Through data expansion, the diversity of experimental data has been effectively increased, and the recognition and localization ability of experimental algorithms for target objects in complex visual situations can be verified. Annotate the expanded dataset and divide the annotated images and label files into training, validation, and testing sets, with a ratio of 8:1:1. The results obtained during the training validation process are shown in Fig. 4.

Fig. 2
Two photos of a train’s coupler unhook operating handles. The original coupler is visible in the left view, while the right view has a top-acting handle and a down-acting handle.

Original coupler unhook operating handle, a Top acting handle. b Down acting handle

Fig. 3
A collage of five photos exhibit various perspectives of train couplings and railway track components. Photos a and b depict the couplings from inside a train, while photos c, d, and e depict different external views of train couplings and wheels on the track.

Expansion of handle image data, a Rotation processing. b Translation processing. c Brightness adjustment processing. d Obfuscation. e Signal-to-noise ratio processing

Fig. 4
Two sets of 12 photos depict a training and verification process for object detection with various machine parts being identified and labeled.

Training validation process image, a Training process. b Verification process

4.2 Practice and Result Analysis

The training results are shown in Fig. 5. The abscissa in the figure represents the number of training rounds (epochs). From the figure, it can be seen that the loss rate curve of model training and validation gradually decreases with the increase of training times, especially in the first 50 rounds where the descent rate is the fastest, and the gradient decreases as the model gradually stabilizes, The four indicators of accuracy, recall rate, mAP50, and mAP50-95 are opposite, increasing with the increase of training times and eventually stabilizing, indicating that the training data is stable and normal [20].

Fig. 5
Two graphs of loss versus epochs have fluctuating curves in a decreasing trend for train slash box, c l s, and d f l losses and val slash box, c l s, and d f l losses. Two graphs of metrics versus epoch have fluctuating curves for m A P 50 and m A P 50 to 95, precision and recall, respectively.

Training validation results, a Training process loss function. b Verification process loss function. c Model training metrics. d Model training metrics

After the model training is completed, the entire indicator results are shown in the table below. From Table 1, it can be seen that the average accuracy of the model is 0.996, the average recall rate is 0.986, the average value of mAP50 is 0.988, the average value of mAP50-95 is 0.943. The accuracy, recall rate, mAP50, and mAP50-95 values of the Down action handle and Top action handle categories can all achieve good accuracy results, which can be applied to actual coupler handle detection systems. To see its effect, make Table 1 as Fig. 6. In the following figure, 1 represents the overall model, and 2 represents the down-acting handle, 3 represents the top-acting handle.

Table 1 Model evaluation indicators
Fig. 6
A grouped row chart of class versus model evaluation has bars for m A P 50 to 90, m A P 50, recall, and accuracy. The bar is mostly longer and shorter for accuracy and m A P 50 to 90, respectively.

Model evaluation indicators

4.3 Compare with Other Mainstream Algorithms

Using the expanded dataset as samples, the algorithm proposed in this paper was compared with SSD300 and YOLOv4Tiny in experiments. The evaluation indicators were volume (MB), parameter quantity Params (M), computational quantity (GFLOPs), and mAP%. The results are shown in the table below.

According to Table 2, the YOLOv8n algorithm has the smallest number of parameters, volume, and computational complexity. Secondly, data augmentation techniques are used to simulate the images detected by the hook-picking robot in complex environments. The YOLOv8n method was used for object detection of the coupler handle in complex situations, and excellent results were achieved, fully meeting the accuracy requirements of object detection. In addition, its algorithm saves the smallest optimal weight volume, making it easier to deploy to resource-constrained platforms such as embedded or mobile devices. At the same time, a smaller volume can also achieve faster inference speed, thus completing prediction tasks faster. Therefore, compared to other models, the YOLOv8n model is more suitable for complex task scenarios with high timeliness requirements such as coupler handle detection.

Table 2 Comparative experiments with other algorithms

5 Conclusions

To accurately identify and locate the operating handle of the train automatic uncoupling robot, a method based on the YOLOv8n model for object detection of the coupler handle is proposed. This method uses the YOLOv8n model as the benchmark model for detection, ensuring the accuracy of the model for detecting simple targets while effectively reducing the number of model parameters. Due to the limited variety of on-site coupler operating handles and working environment, it is difficult to ensure the richness of the image categories in the dataset. Before experimenting, the dataset was expanded and processed, including a series of operations such as rotation, translation, and brightness adjustment to simulate complex visual environments. The YOLOv8n model was used to detect and analyze different types of coupler handles. The experimental results show that this method has excellent detection accuracy, reaching 98.8%. In addition, its algorithm saves the smallest optimal weight volume, making it easier to deploy to resource-constrained platforms such as embedded devices or mobile devices. At the same time, a smaller volume can also achieve faster inference speed, thus completing prediction tasks faster. Therefore, it has high engineering application value.