Keywords

1 Introduction

Traffic sign detection and recognition is one of the research hotspots of environment perception in the three major modules of unmanned driving [1]. Traffic sign recognition plays an important role in unmanned driving. However, in foggy weather, there are some problems in traffic sign detection, such as small target, unclear target and so on. The designed algorithm needs to take into account the characteristics of high precision and real-time. At the same time, it is necessary to ensure that the training image data is sufficient so that the neural network model can learn the characteristics of traffic signs in different complex environments [2].

Yu fuses the dark channel prior algorithm with MSR to defog, and uses the Faster R-CNN two-stage target detection algorithm to detect traffic signs in foggy environments. Compared with the first stage target detection algorithm, the detection speed is slower and the calculation amount is larger [3]. Xu uses image enhancement to defog, and proposes an improved convolutional neural network design to recognize traffic signs. The method of image enhancement is not to remove the fog, but to sharpen the image. This method can only be used for traffic sign detection under light and medium fog, and the effect is not ideal under dense fog. Chen and others first used the deep learning algorithm IRCNN to remove the haze, and then proposed a multi-channel convolutional neural network (Multi-channel CNN) model to identify the image after the haze removal [4]. However, the defogging method based on deep learning requires a large number of images in the data set and the speed is relatively slow. Moreover, none of the above methods has constructed a traffic sign data set in a foggy environment.

2 Image Defogging Preprocessing

2.1 Data Set Construction

In the research of traffic sign detection and recognition, researchers mostly use the American traffic sign data set (LISA) and other algorithms for performance testing. However, most of the above data set samples are collected under good lighting conditions, and no domestic researcher has constructed and published a rich comprehensive for the identification of China. The traffic sign data set of China in the foggy environment. For the traffic sign detection of YOLOv3 in the foggy environment, this article must have the Chinese traffic sign data set in the foggy environment [5].

Based on this, for traffic sign detection in a foggy environment, on the one hand, some clear traffic sign pictures are downloaded from the Internet, and on the other hand, it is collected on the spot by taking pictures in heavy fog. The images are divided into training set and test set according to the ratio of 8:2, a total of 3415 images, including 2390 training set and 1025 test set. Use LabelImg software to label each image. The label information includes the category attribute of the traffic sign, the illumination of the image, the upper left and lower right coordinates of the sign border (in pixels), and the information is saved in xml format. The data is divided into 3 categories: indication signs, prohibition signs, and warning signs.

2.2 Dehazing Algorithm

The existing defogging algorithms are mainly divided into three categories: One is the defogging algorithm based on image enhancement. The second is a defogging algorithm based on image restoration. Three defogging algorithms based on deep learning [6].

This paper compares several algorithms. The dehazing effect is shown in Fig. 4; the best effect is the DehazeNet algorithm. Its disadvantage is that it takes a long time and the average running time is 1.14 s. Therefore, in combination with traffic sign detection scenarios, this paper uses dark channel based on guided filtering. Empirical algorithm for image restoration [7]. The dark channel a priori principle believes that in most non-sky local areas, one of the three RGB color channels of each image has a very low gray value, almost tending to zero. According to the above principles, the dark channel map can be obtained first, and then the atmospheric light value and transmittance can be estimated by using the dark channel map, and the transmission function is refined by the guided filter, and the transmittance value is optimized. Finally, the result obtained is substituted into the atmospheric scattering The model can get the restored image. The steps of the algorithm are shown in Fig. 1 (Fig. 2):

Fig. 1.
figure 1

Flow chart of dark channel restoration algorithm

Fig. 2.
figure 2

Comparison of dehazing effects

3 YOLOv3 Algorithm and Improvement

This article chooses YOLOv3 model to complete this research because YOLOv3 has made improvements in category prediction, bounding box prediction, multi-scale fusion prediction, and feature extraction [8]; YOLOv3’s mAP can be comparable to RetinaNet, but the speed is increased by about 4 times. At the same time, there have been significant improvements in detecting small objects. Therefore, it is ideal to apply to the detection and recognition of traffic signs in complex environments [9].

3.1 YOLOv3 Detection Network

As shown by the dotted line in Fig. 3, in order to improve the accuracy of the algorithm for small target detection, YOLOv3 uses 5 downsampling of the input image and predicts the target in the last 3 downsampling. It can output 3 features of different scales, respectively Output 1, 2, 3 for prediction. The rule of side length is 13:26:52, and the depth is 255. The up-sample and fusion method of FPN (feature pyramid networks) is adopted; the advantage of choosing up-sample in the network: the expression effect is determined by the network level, and the effect becomes better as the network level deepens, so that you can directly use the deeper object characteristics to perform the object predict [10].

Fig. 3.
figure 3

Improved multi-scale prediction structure

3.2 YOLOv3 Network Optimization

Improved Multi-scale Prediction YOLOv3 Model.

YOLOv3 only uses three-scale features, and the shallow information used is not sufficient [11]. Aiming at the problems that the detection and classification of traffic sign targets in complex environments are affected by different environments and the target is small, an improved YOLOv3 deep neural network was designed and proposed, and the fourth feature scale was added: 104 × 104; as shown in Fig. 6 As shown. The thick line in Fig. 3 shows an improved multi-scale network structure.

The specific method is: in the YOLOv3 network, after the feature layer with a detection scale of 13 × 13 is up-sampled twice, the original feature scale of 52 × 52 can be increased to 104 × 104. If you want to make full use of deep features and For shallow features, the 109th layer and the 11th layer of the feature extraction network should be feature fused through the route layer. The remaining feature fusion is: the 85th and 97th layers outputted after 2 times upsampling. The feature maps of the 85th and 61st layers, and the 97th and 36th layers are respectively merged through the route layer. As shown in Table 1, each feature layer is specific.

Table 1. YOLOv3 feature map

Mosaic Image Enhancement.

Traditional data enhancement methods only enrich the number of data set by changing the characteristics of the image [12]. Mosaic image enhancement is a process in which a new image is obtained by combining 4 random images to train the network, which increases the diversity of data and the number of targets provide a more complex and effective training background. At the same time, the original image annotation information still exists. As shown in Fig. 4. This can further improve the accuracy and recall rate. At the same time, because multiple images are input to the network at the same time, the batch size of the input network is increased in disguise. Inputting an image stitched by four images is equivalent to inputting four original images (batch size = 4) in parallel, which reduces the need for training. The performance requirements of the equipment. Effectively improve the efficiency of statistical mean and variance of the BN (Batch Normalization) layer.

Fig. 4.
figure 4

Effect diagram of mosaic image enhancement algorithm

Loss Function.

YOLOv3 loss is divided into three parts: positioning loss Lloc (o, c), confidence loss Lconf (o, c), classification loss Lcla (o, c) three parts, as shown in formula 1:

$$ \begin{aligned} & L(o,c,O,C,l,g) \\ & = \,\lambda_{1} L_{conf} (o,c) + \lambda_{2} L_{cla} (O,C) + \lambda_{3} L_{loc} (l,g) \\ \end{aligned} $$
(1)

Among them, λ1, λ2, and λ3 are balance coefficients.

Intersection-to-Union Ratio (IOU) When performing bounding box regression prediction, when two bounding boxes (target bounding boxes) do not intersect, according to the definition of IOU, the IOU is zero at this time, and the propagation loss cannot be calculated at this time. In order to solve this defect, this paper introduces the CIOU loss function for the regression prediction of the bounding box. An excellent regression positioning loss should consider three geometric parameters: overlap area, center point distance and aspect ratio. The calculation formula is shown in formula (2):

$$ {\text{CI}}{\text{o}}{\text{U}} = {\text{IoU}} - \left( {\frac{{P^{2} ({\text{b,b}}^{gt} )}}{{c^{2} }} + av} \right) $$
(2)
$$ L_{CIoU} = 1 - CIoU $$
(3)

Among them, α is the weight function, and ν is used to measure the similarity of the aspect ratio, and the definition is shown in formula (4) (5).

$$ v = \frac{4}{{\pi^{2} }}(\arctan \frac{{w^{gt} }}{{h^{gt} }} - \arctan \frac{w}{h})^{2} $$
(4)
$$ a = \frac{v}{(1 - IoU) + v} $$
(5)

When the CIOU does not overlap with the target box, it can still provide the moving direction for the bounding box. The distance between the two target frames can be minimized directly, and the convergence is much faster. After adding aspect ratio considerations, it can further quickly converge and improve performance.

Retraining Based on Transfer Learning.

In the experiment, the idea of middle-level migration in migration learning is adopted. The training of the network model requires a large number of traffic signs. However, the database selected in this experiment only has 3,415 images. The lack of image data will make the network model under-fitting and ultimately reduce the detection accuracy. This article first initializes the pre-trained model (trained on the coco data set on the YOLO official website), Then use this model to retrain the system in this article. The training time is greatly reduced, and the probability of model divergence and fitting process is also reduced. There are a large amount of weight information and feature data in the pre-trained training model [13]. Weight information, these feature information can usually be shared by different tasks, transfer learning is to avoid relearning this knowledge by transferring specific and common feature data and information, and achieve rapid learning.

4 Evaluation of Training Results

4.1 Experimental Environment and Data

See Tables 2 and 3.

Table 2. Experimental environment configuration
Table 3. Configuration file parameters

4.2 Evaluation Indicators

The evaluation indicators are the mean Average Precision (mAP) of all traffic sign types in a complex environment and the time required for each picture t = 1/N, in ms. First, you need to understand the confusion matrix, as shown in Table 4 [14]:

Table 4. Confusion matrix

Calculate precision and recall:

$$ precision = \frac{TP}{{TP + FP}} $$
(6)
$$ recall = \frac{TP}{{TP + FN}} $$
(7)

In the formula: TP, FN, FP, TN respectively represent the negative sample that is incorrectly detected, the positive sample that is correctly detected, the positive sample that is incorrectly detected, and the negative sample that is correctly detected.

mAP: The calculation of mAP is divided into two steps. The first step is to calculate the average precision AP (Average Precision) of each category, and the second step is to average the average precision, which is defined as follows:

$$ AP_{i} = \sum\limits_{k = 1}^{N} {P(k)\Delta r(k)} $$
(8)
$$ mAP = \frac{1}{m}\sum\limits_{i = 1}^{m} {AP_{i} } $$
(9)

where: m is the number of categories. The evaluation index uses mAP and the time required to detect a picture. The mAP value is directly proportional to the detection effect, and the detection time is inversely proportional to the detection speed.

4.3 Improved YOLOv3 Algorithm Test

In order to compare the detection effect of the improved network, the collected Chinese traffic sign detection data set were used to train and test the improved YOLOv3 network model and SSD model. The precision/recall curves of the three categories are shown in Fig. 5. It can be seen that the accuracy and recall rate of the improved network are better than the YOLOv3 model. Among them: the SSD model has the lowest accuracy rate; the average accuracy of the three categories of improved networks are 85.82%, 80.56% and 80.12%, which are higher than the detection results of YOLOv3. In terms of real-time performance, based on an image of 416 × 416, the standard YOLOv3 and the enhanced YOLOv3 methods in this article require 31.4 ms and 34.2 ms to detect an image, respectively, which meets the real-time requirements (Table 5).

Fig. 5.
figure 5

Accuracy-recall rate curve

Table 5. Comparison of AP value, mAP and running time of the three categories

4.4 Experiment to Improve the Detection Ability Under Foggy Conditions

The experiment is divided into 3 groups; as shown in Table 6; the training set and test set of the first group of experiments are all original pictures, so as to compare with the following models. The second set of training sets are the images in the foggy environment after image restoration based on the dark channel algorithm of guided filtering. The test set remains unchanged. The training set and test set of the third group use pictures after image restoration processing.

Table 6. Data set classification
Table 7. Comparison of AP value and mAP value

It can be seen from Table 7 above that the AP and mAP values of the first group are slightly better than those of the second group, but there is not much difference overall. Compared with the first two groups, the mAP value of the third group is about 2.5% higher, so we can conclude that the detection effect after dehazing based on image restoration on both the training set and the test set is the best.

5 Conclusion

This paper constructs a traffic sign target detection training data set in foggy environments. The dark channel prior algorithm based on guided filtering is used to add image restoration steps to enhance the detection ability under bad foggy weather. Based on the YOLOv3 network, in order to solve the problems of insufficient data set and small amount of data, a Mosaic image enhancement training method is proposed, which improves the training efficiency and model accuracy. Aiming at the poor detection effect of YOLOv3 in complex environments, an improved YOLOv3 algorithm with increased feature scale is proposed. Aiming at the problems of small and fuzzy targets in foggy conditions and inaccurate positioning, the loss function of the target detector is redesigned using the CIOU loss function to further improve its detection accuracy of traffic signs in foggy conditions. In view of the fact that there are not many samples and the accuracy is not high, transfer learning training is adopted. The detection effect has been greatly improved.