1 Introduction

Agglomerate fog is generally found in low-lying areas with high air humidity, which is closely related to local microclimatic environment. Agglomerate fog not only brings great safety hazard to the travel of vehicles, but also makes it extremely difficult to deal with rapid emergency response required by such disasters. With the development of computer vision technology, how to use existing roadside surveillance video to detect agglomerate fog hazard has received extensive attention from scholars. Presently, many agglomerate fog detection methods were constructed based on analyzing the characteristics of images [10, 20, 25, 27]. Also, there were many fog detection approaches were developed through machine learning [23]. The former approaches designed the corresponding methods by analyzing the influence of agglomerate fog on objects in the scene and to detect the agglomerate fog. The latter approach trains classifiers through artificial features to identify agglomerate fog. The key issue of these two kinds of methods lies in the construction of features and the design of classifiers. And both two kinds of agglomerate fog detection methods have been greatly improved in terms of accuracy.

In recent years, deep learning theory and methods have been greatly developed. Convolutional neural networks (CNN) have been widely used in image processing and analysis [35], such as image classification, object detection [36], image enhancement, and so on. At present, image de-fogging algorithms based on convolutional neural networks are common. Cai proposed a trainable end-to-end system called DehazingNet to estimate the transmittance of image blocks. The system uses the deep structure of a convolutional neural network with image blocks as the input and their transmittance as the output [2]. Ren proposed a multi-scale convolutional neural network (MSCNN). The whole network includes coarse-scale networks and fine-scale networks. The coarse-scale network is used for rough estimation of transmittance, and the fine-scale network is used for refinement of coarse transmittance estimation [24]. Li introduced the network model AODNet for image de-fogging. Unlike other methods, AODNet does not need to separately estimate the transmittance or atmospheric light but rather can obtain fog-free images directly through the network output [12].

In the study of image defogging, the first step is to detect whether there is agglomerate fog in the image. If the fogging algorithm is applied to the image without agglomerate fog, the image will become bad and the computing resource will be wasted. This paper proposes a shallow convolutional neural network model for agglomerate fog detection based on image. Comparing with the existed methods, the network layer designed in this paper is shallow, which can save computing power. At the same time, due to the block strategy and convolution layer size from large to small design ideas, taking into account the overall characteristics and local characteristics, it can achieve good detection results.

2 Related works

Fog detection can be regarded as a binary classification problem, traditional machine learning algorithms have many shortcomings for this application. For instance, they need to design features such as local binary pattern (LBP) [21], histogram of oriented gradients (HOG) [4], speeded-up robust features (SURF) [1], and so on.

Image classification algorithm based on deep learning is an end-to-end training process. The input of the neural network is an image, and the output is each category's probability. The category with the highest probability is used as the prediction category. Representation features can be extracted from the hidden layers, the process from shallow to deep extracts common features to abstract features, and the output features represent the categories.

Theoretically speaking, as long as the network layers are wide and deep enough, the neural network can fit any function. However, when dealing with complex tasks, the fully connected traditional neural networks become more complicated and the number of network parameters increases as the number of neurons becomes larger, which will lead to overfitting problem. The convolutional neural network uses a convolution operation to traverse all points, which, in essence, is a filtering kernel, and the convolution operation equivalent to the image filtering. The patterns learned by convolutional neural networks are translationally invariant, hence, the convolutional neural network can identify this pattern anywhere in an image after learning a certain pattern. Moreover, the convolutional neural network obtains patterns with spatial hierarchy, which means that the shallow convolution layer learned edge features, the deep convolutional layer learns about the features of contours combining the edges, and the last layer of the network learns about the more essential features combining the previous features.

As early as 1998, LeCun proposed the convolutional neural network LeNet [11] for handwritten character identification. A schematic diagram of the network is shown in Fig. 1, including two convolution layers, two downsampling layers, and two fully connected layers. Such a convolution–downsampling–fully-connected pattern or the like is still used in new convolutional neural networks proposed in recent years such as VGGNet [22, 28] and Inception [29,30,31]. LeNet has been applied to bank check identification and has achieved good results.

Fig. 1
figure 1

LeNet network structure

In 2012, Krizhevsky used AlexNet [7] to win the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The AlexNet network includes five convolution layers, three maximum pooling layers, and two local normalized layers. Compared to LeNet, AlexNet's design ideas are similar. The differences are that AlexNet uses Relu as an activation function instead of Sigmoid, and AlexNet uses a variety of methods to prevent overfitting, such as data expansion, dropout, and so on.

In 2014, a researcher at VGG Labs in Oxford University won the ILSVRC with VGGNet. This network also used convolution-pooling-full connectivity. Unlike AlexNet, VGGNet explored the depth of the network, proposed a variety of deep networks from layer 11 to 19, and achieved good results on layer 16/19. Many networks now use VGGNet as a benchmark for comparison.

In recent years, such structures as Inception [29] and ResNet [5] have achieved better performance in object identification. The structure of Inception combines the features of multi-scale receptive fields, and the extracted features are richer. The residual in ResNet combines shallow features with deep features to make gradients easier to propagate backwards, which is beneficial for designing deeper networks. Subsequent improvements have been made to Inception [30, 31], ResNet [34], and the combination of Inception and ResNet [32]. Inception's and Residual's structures are shown in Fig. 2.

Fig. 2
figure 2

Inception and Residual structures

With the rapid development of CNN, there are many studies were carried out in many fields such as video object segmentation [15, 17, 18]. object tracking [14, 16] and so on.

Focusing on fog detection from images, Bin et.al proposed a so called CNN-RNN based multi-label classification method [38]. The convolutional neural network (CNN) was extended with a channel-wise attention model to extract the most correlated visual features at first and then the Recurrent Neural Network (RNN) was adopted to process the features and excavated the dependencies among weather classes. Toward weather condition recognition and based on the understanding of the importance of regional cues, a deep learning framework named region selection and concurrency model (RSCM) was presented to discover regional properties and concurrency [13]. Face to the applications of traffic management and control, Yu proposed a Global Similarity Local-Salience Network (GSLSNet) for traffic weather recognition. This work also designed the corresponding strategy to restrict the network focusing on road weather details [37].

The above works have achieved better detection results than the non-deep learning methods, but it can also be seen that the above methods were designed for a variety of weather, and the pertinence of haze detection is not strong, so there is still room to improve the accuracy of haze detection. Considering the requirements including low time consuming and effectiveness of the haze detection algorithm, a shallow neural network considering the global and local features as well as multi-scale features of the image is designed, so as to achieve the purpose of fast and high-precision fog detection.

3 Method

3.1 Overall framework of agglomerate fog-containing image detection algorithm

The detection of agglomerate fog-containing images can be viewed as the classification of input images into fog-containing and fog-free categories. It is not likely for all areas of an image to contain fog. In other word, it may affect the judgment result of the whole image because there is no fog in some areas of the scene. Hence, the strategy of dividing images into blocks is adopted to determine whether the whole image contains fog. The framework of the entire algorithm is shown in Fig. 3.

Fig. 3
figure 3

Comprehensive determination based on sub-area-based learning

To divide an image, for the sake of simplicity, this paper adopts an equal splitting method as shown in Fig. 4. The proposed convolutional neural network is used to determine if a sub-area contains fog. Finally, the identification results of all sub-areas are voted, and the voting result is used as a basis to determine whether this image belongs to the fog-containing or fog-free category.

Fig. 4
figure 4

Diagram of dividing an image into sub-areas

For the problem of how many sub images an image is divided into, it depends on the size of the observation field. The larger the field of view is, the more the number of sub image segmentation is. On the contrary, the less the number of sub image segmentation is. For the traffic monitoring scene, the general field of view is not too large, so it is enough to divide the image into nine sub images. For this, we give the test in the experimental part.

3.2 Network structure of agglomerate fog detection

The networks mentioned above, such as VGGNet, InceptionNet, and ResNet, are mostly applied in thousands of classifications, so a very deep network layer is needed to accurately extract features of different objects. For fog-containing or fog-free image classification, the task is relatively simple and does not require a very deep network. In fact, the use of a deep network for this task may lead to over-fitting. Based on this understanding, this paper proposed a shallow network for fog detection is shown in Fig. 5.

Fig. 5
figure 5

Network structure of fog detection

As shown in Fig. 5, the proposed network consists of the following parts:

  1. 1.

    Convolution layer: the convolution kernel size is 7*7, the number of convolution kernels is 96, and the step size is 1.

  2. 2.

    Maximum pooling layer: the pooling range is 2*2, and the step size is 2.

  3. 3.

    Convolution layer: the convolution kernel size is 5*5, the number of convolution kernels is 256, and the step size is 1.

  4. 4.

    Maximum pooling layer: the pooling range is 2*2, and the step size is 2.

  5. 5.

    Convolution layer: the convolution kernel size is 3*3, the number of convolution kernels is 384, and the step size is 1.

  6. 6.

    Maximum pooling layer: the pooling range is 2*2, and the step size is 2.

  7. 7.

    Fully connected layer: the number of neurons is 512.

  8. 8.

    Fully connected layer: the number of neurons is 512.

  9. 9.

    Softmax layer: the number of categories is 2.

The size of the convolution kernel is gradually reduced from 7*7 to 3*3, it is benefit to the shallow layer extracts the information of a large field of view and the deeper layer extracts the finer information. The first six convolutional pooling layers are used for feature extraction, and the fully connected layer and the Softmax layer can be regarded as classifiers. By sending the multi-scale features extracted from the first six layers to the later classifiers, the detection of foggy images is carried out. It can be seen that the convolution network designed here has fewer layers and takes into account the characteristics of different scales, so it can help to realize the detection of fog image quickly and accurately, this is just the advantage of the proposed method.

3.3 Feature extraction

The network performs feature extraction through a convolution pooling operation. The convolution operation completes the learning of different patterns, and the pooling operation completes down-sampling and extracts obvious responses.

The convolution operation is performed by traversing the filter kernel in a sliding window and applying the convolution kernel to the neighboring image block around each point. Taking a two-dimensional image I as an input, for a two-dimensional convolution kernel K, the convolution result is shown in Eq. (1).

$$S(i,j) = (I*K)(i,j) = \sum\limits_{m} {\sum\limits_{n} {I(m,n)K(i - m,j - n)} }$$
(1)

Since the convolution is commutative, Eq. (1) can be rewritten as Eq. (2).

$$S(i,j) = (K*I)(i,j) = \sum\limits_{m} {\sum\limits_{n} {I(i - m,j - n)K(m,n)} }$$
(2)

The cross-correlation operation is almost the same as the convolution operation, but the kernel is not flipped, as shown in Eq. (3)

$$S(i,j) = (I*K)(i,j) = \sum\limits_{m} {\sum\limits_{n} {I(i + m,i + n)K(m,n)} }$$
(3)

It should be noted that convolution operations in many machine learning libraries are replaced by cross-correlation functions, which are also known as convolutions. The convolutional layers mentioned in the previous section are all cross-correlation operations. The reason is that the two operations can achieve the same result. Even if the convolution operation is used, the weights obtained through training can be flipped to obtain the same weight as the cross-correlation operation. The cross-correlation can eliminate the operation of kernel flip.

There are many types of pooling, including average pooling and maximum pooling, which are mean filtering and maximum filtering. The maximum pooling operation obtains the largest response in a neighborhood and can also be down-sampled by setting the step size. If the step size is set to two, the width and height are sampled to half of the original. When the input makes a small amount of translation, the pooling operation can make the representation of the input approximately unchanged.

3.4 Activation function

In order to make the deep neural network nonlinear, a nonlinear activation function needs to be added to the output of the neuron. If there is no nonlinear activation function (only linear combination), the deep network is equivalent to a single-layer perceptron and does not have deep characteristics.

The study of activation function is an active field. Sigmoid and Tanh were the first activation functions. Later, Relu solved the deep learning problem of vanishing gradient. Recently, various activation functions have been developed, such as Leaky-relu [19], Selu [9], Elu [3], and Swish [26]. Relu is an excellent default choice, and the activation function of the network proposed here is also Relu. Several common activation functions and their derivatives are shown in Fig. 6.

Fig. 6
figure 6

Several common activation functions (blue represents the original function, and orange represents the corresponding derivative)

Since both Sigmoid and Tanh have exponential calculations, the calculation speed is slow. Additionally, the gradient is less than one, so the gradient tends to vanish as it propagates through layers. Relu is a piecewise function, which means there is a linear function for each segment, but the overall Relu function is nonlinear, and the forward propagation and backward feedback are fast. The gradient equals one where the value is greater than zero, thus, there are no such problems as vanishing gradient or gradient explosion.

3.5 Dropout

Bagging is a technique for reducing generalization errors by combining multiple models. The main idea is to train several different models separately and then use all the models to vote for the output of the sample.

In general, combining multiple models is more effective for the following reasons:

  1. 1.

    Statistically, the hypothesis space of a learning task is large, and multiple hypotheses have the same performance on the training set. The use of a single model may result in poor generalization performance due to misclassification.

  2. 2.

    A single model may fall into local optimum, but integration can reduce risk.

  3. 3.

    The real hypothesis may not be included in the learning algorithm hypothesis. By integrating and expanding the learning into the hypothesis space, better approximation can be achieved.

As a Bagging technique, Dropout [7] randomly invalidates some of the neurons during the training process. That is, without parameter calculation, neurons that fail in training in each step are random. As shown in Fig. 7, after the × 1 and × 2 neurons in hidden layer 1 are discarded, hidden layer 2 does not receive input from these two neurons, and the next step may discard other neurons.

Fig. 7
figure 7

Dropout schematic

Dropout can be thought of as training multiple smaller and simpler models, and predicting is equivalent to combining multiple simple models.

3.6 Loss function

The role of the Softmax layer is to map the output of each class, which is a probability between zero and one; the sum of all class outputs is equal to one. The expression is as follows:

$$\sigma (z)_{j} = \frac{{e^{{z_{j} }} }}{{\sum\nolimits_{{k = 1}}^{K} {e^{{z_{k} }} } }}$$
(4)

In classification, the last layer generally uses Softmax to obtain the probability of each class, and the loss function is the cross entropy. The cross entropy reflects the similarity of the two distributions. When there are two distributions, their cross entropy is

$$H(p,q) = - \sum {p\ln q}$$
(5)

4 Results and discussion

4.1 Experimental environment

The software environment of this experiment is an Ubuntu16.04 operating system. The CPU processor is Intel Core i7-3770 8, and the memory is 12 GB. The graphics card is a NVIDIA GeForce GTX 1080, and the memory is 8 GB. The experiment is based on the open source deep learning framework Tensorflow, version 1.5. The CUDA version is 9.0, and the cuDNN version is 7.0.

4.2 Experimental data set

In order to realize the use of neural networks for fog identification, it is necessary to collect a large number of fog-containing and fog-free images. If the size of the data set is too small, over-fitting is likely to occur. Therefore, in order to meet the training requirements, a large number of images of natural scenes were collected from ImageNet. Some samples are shown in Fig. 8. ImageNet is an image database organized according to the WordNet hierarchy (currently only nouns), where each node of the hierarchy is represented by hundreds and thousands of images, and each node currently has an average of more than 500 images.

Fig. 8
figure 8

Examples of natural scene images

In order to simulate a fog-containing image, it is necessary to obtain an imaging model of agglomerate fog. In fact, the information acquired by the sensor includes two aspects. One is the information transmitted by the particle from the original optical path, and the other is the influence of atmospheric light, as shown in Eq. (6) [6]:

$$I(x) = J(x)t(x) + A(1 - t(x))$$
(6)

where \(I\) represents a fog-containing image, \(J\) is the information reflected by objectives, i.e., a fog-free image, \(A\) is the atmospheric light, and \(t\) represents the attenuation effect of fog, i.e., the transmittance.

Using Eq. (6), a fog-containing image can be generated using a fog-free image. Assume that \(I\) and \(J\) are in the range of \([0,1][0,1]\). Let the atmospheric light be \(A = 1.0\) when generating data. Therefore, the random global transmittance is \(t \in [0,thres]\), and \(thres\) is the upper limit of \(t\). Figure 9 shows the corresponding fog-containing image when \(thres = 0.7\).

Fig. 9
figure 9

Generated fog-containing image

Two sets of data were collected. The first data set (hereinafter referred to as test data set 1) was extracted from the above data, and the distribution was consistent; the second data set (hereinafter referred to as test data set 2) used the data set in [33]. The data in test data set 2, which was obtained from the actual outdoor scene, contained 263 fog-containing images and 289 fog-free images. Some fog-containing images in test data set 2 are shown in Fig. 10.

Fig. 10
figure 10

Examples of the test data

4.3 Training methods

A corresponding fog-containing image was generated from the input fog-free image. The size of a batch was 64, which consisted of 32 fog-free images and 32 corresponding fog-containing images. All images were scaled to a size of 256*256.

We used the Adam optimizer [8]; the learning rate was set to 0.0001, and the trained epoch was 50. Adam (adaptive moments) is a learning rate adaptive algorithm. Momentum directly incorporates the gradient first-order moment (exponential weighting). It also includes offset correction, which corrects the first-order moment (momentum term) initialized from the origin and the second-order moment estimate (non-center). Adam has strong robustness to hyper parameters and only needs modification in learning rates.

The dropout layer was added to the last two layers of fully connected layers, and the dropout rate was set to 0.5.

4.4 Results and discussion

To assess the performance of the proposed method, the indexes including precision (P) and recall (R) are employed here. They are defined as follows:

$$P = TP/(TP + FP)$$
$$R = TP/(TP + FN)$$
  • TP: true Positive, that is labeled as Positive and predicted as Positive.

  • TN: true Negative, that is labeled as Negative and predicted as Negative.

  • FP: false Positive, that is labeled as Negative and predicted as Positive.

  • FN: false Negative, that is labeled as Positive and predicted as Negative.

Experiments were performed according to the parameter settings in Sect. 4.3. The accuracy of the training set and the accuracy of the validation set are shown in Fig. 11. The curve in the graph is slightly fluctuating, which is normal because a batch of data was used for testing.

Fig. 11
figure 11

Test results (the abscissa is the number of training steps, and the ordinate is the accuracy)

For the test of the number of blocks, we divide the image into 4, 9 and 16 sub-images evenly, and use data 1 and data 2 to test the overall detection accuracy of the foggy image. The statistical results are shown in Table 1.

Table 1 Detection results with different blocks

As shown in Table 1, when the number of image blocks is 4, the detection performance is the lowest. For data 1, when the number of image blocks is 9, the detection capability is the highest. For data 2, the detection capability is the highest when the number of blocks is 16, but there is only slight difference between the two when the number of blocks is 9. Therefore, the number of sub-images is set to 9 in this paper.

To verify the effectiveness of image blocking, we have added some additional test images (40 in total). Only part of these images are covered by fog. Some of these test images are shown in Fig. 12. Two strategies of block (9 blocks) and no block are adopted for retest. The experimental results are shown in Table 2.

Fig. 12
figure 12

Some test images for blocking strategy

Table 2 Detection results with different blocks

In Table 2, the number “1” means that the entire image as input. As can be seen from Table 2, the effect of using the blocking strategy is significantly improved in both precision and recall compared with the case without blocking, which verifies the effectiveness of the image blocking processing proposed in this paper.

Using dropout should allow the model to avoid overfitting and achieve better performance. A comparison of performance is shown in Table 3 (data 1 was used to test). As we can see from Table 3, the dropout operation did achieve higher accuracy.

Table 3 Comparison of test results with or without dropout (data 1)

For simple tasks, a simple network is often used to obtain higher accuracy. If a complex network is used, it may cause over-fitting and the gradient may not be returned. To test this, the network layer was deepened in this paper, and the network structure comparison is shown in Table 4. The experimental parameters of the two networks were the same. Net1 achieved a precision of 97.6% and recall 94.4% as mentioned earlier, and Net2 only achieved a precision of 96.1% and recall 93.8%. These results demonstrated that it is not suitable for very deep networks to classify a two-category task.

Table 4 Comparison of network structures

For the real-time data set (test data set 2), the test results are shown in Table 5. Due to the inconsistency between the training data and the test data, the test result was lower than that of the test data set 1. However, for both fog-containing and fog-free images, the precision was greater than 90%.

Table 5 Test results of test data set 2

To further verify the effectiveness of the proposed method, several representative algorithms [13, 37, 38] mentioned in the introduction were selected as the comparison methods. The performance of different methods tested on data 1 and test data 2 is shown in Table 6.

Table 6 Comparison of different methods

The methods [13, 37, 38] output many weather types including fog. For comparison, the detection accuracy of fog image is only given here. From Table 6, it can be seen that the above four methods can achieve high detection accuracy. Both for data 1 or data 2, the detection accuracy of the proposed method has obvious advantages compared with the others. At the same time, it can be found that the test performance of data 1 is higher than that of data 2, no matter the proposed method or other methods. This shows that the actual data is more complex than the simulation data.

5 Conclusion

The image detection algorithm of fog hazard USES the roadside video surveillance to realize the prediction and early warning. The algorithm has the characteristics of fast and accurate emergency response. With the characteristics of fast and accurate emergency response, this algorithm can play a role in disaster prevention and mitigation, and provide safety services for vehicle travel, which has important practical application value. An agglomerate fog detection method based on the convolutional neural network were described in this paper. The work includes an analysis of each network layer, selection and analysis of the activation function, and selection and analysis of the loss function. After an image was divided into sub-areas (blocks), the proposed network is used to fog detection on each block, and then combined the determination results of all sub-areas to obtain a final result. The test results of simulation data and real data show that the image detection method proposed in this paper has high detection accuracy.