1 Introduction

Early detection of fire or smoke is critical to timely intervention to avoid large-scale damage. There are several methods and tools used to identify fire or smoke in a visual scene of confined spaces. As in aircraft cargo hold, traditional fire and smoke detection methods mostly use sensor-based detection tools, and their main disadvantages are limited detection range and high false alarm rate according to the U.S. Federal Aviation Administration (FAA) Technology Center’s report [1]. The basic framework of the traditional method is shown in Fig. 1. In addition, such detection sensors cannot provide sufficient information about smoke and fire. Usually, when a fire occurs, infrastructure or materials will block people’s view, and the initial smoke will appear in the surveillance cameras. Therefore, detecting smoke from video surveillance can provide early warning of fire hazards. In the early detection stage of smoke, many features are extracted from motion areas of images. At present, video-based fire smoke detection algorithms are usually based on one or more characteristics of smoke, and make decisions through direct or indirect methods. Toreyin et al. [2] mainly use motion and edge features for smoke detection. Due to the need to analyze the background of the integrated scene, the scope of application of this algorithm is limited. Chenebert et al. [3] proposed a non-temporal image fire detection method. The advantage of this method is that it does not require any temporal information. Fujiwara et al. [4] proposed a smoke extraction technique that uses the concept of fractal coding in the image area. This method is not suitable for low-contrast or blurred smoke images. Yuan [5] proposed a video smoke detection method using cumulative motion model and block-based algorithm to improve detection efficiency. Toreyin [6] proposed a one-dimensional and two-dimensional wavelet transform method based on the typical characteristics of smoke. It also uses a support vector machine as a smoke classifier. This detection technique can only distinguish between smoke and non-smoke, but cannot specifically detect the specific smoke position. Tung et al. [7] proposed to perform smoke detection by segmenting and classifying the motion area. Xu [8] proposed a BP neural network training method based on the static and dynamic characteristics of smoke, so that the detection system has better accuracy and anti-interference ability. Frizzi [9] and others proposed a fire video smoke detection model based on convolutional neural network. In this method, the training model is obtained by directly performing operations on the original frame, instead of extracting features separately. Wang HB improves the photoelectric smoke detection algorithm and uses the dual-wavelength mechanism to reduce the false alarm rate of fire detection in confined space [10]. Cheoi [11] also used convolutional neural networks to process fire images and extract feature data of suspected fire areas. Maksymiv [12] first used AdaBoost and LBP algorithm to make a preliminary judgment on the image, and then carried out feature extraction on the image, which improved the recognition accuracy. Feiniu Yuan [13] uses convolutional neural networks to classify smoke images and compares them with other neural networks. Chang J et al. [14] proposed a new type of infrared detection system based on the inclined porthole and the elliptical bow under the nose of the aircraft. This new type of infrared optical detection system has met the requirements of fire warning detection, but this method cannot detect smoke. It is not suitable for the cargo hold of the aircraft. Hackner A et al. [15] realized fire detection based on flame image information and smoke gas concentration. This method is more effective than differential image (DI) mode fire warning technology. Emily J et al. [16] simulated indoor building fires and used image processing and smoke characteristic detection techniques to determine whether there was a burning phenomenon in the target environment. To sum up, it is concluded that the accuracy of fire detection results is related to fire detection algorithm and many fire detection characteristic parameters when conducting fire detection tasks in confined space environment. However, the above algorithms do not take into account the problem that video signals are weak and difficult to process when visible light is insufficient. Fire detection in confined spaces such as aircraft cargo holds, warehouses, hangars, etc. Due to the more complicated changes of smoke color, texture, height and other characteristics, and the possibility of insufficient or no light, it will become more difficult for us to use traditional video smoke detection technology. Therefore, we consider using infrared frames to detect possible fires in such spaces. In order to detect smoke in infrared images, the key problem that video detection algorithm needs to solve is the extraction and recognition of effective smoke feature parameters in infrared images.

Fig. 1
figure 1

Traditional detection methods

2 Source of the Data

In this part, the construction of the source of the data is introduced in details. And the video source and infrared source of the data are compared.

2.1 Image Data Collection

Image recognition tasks using deep learning methods usually have a large number of parameters. As far as we know, there is no publicly available infrared data set containing fire smoke information. Therefore, we created a data set consisting of 2000 high-resolution indoor confined space infrared images, which is called Train2. Figure 2 shows some sample images of Train2. Train2 includes the infrared image of the confined space of the fire detection laboratory warehouse and the infrared image of the confined space of the simulated aircraft cargo hold. So far, our data set covers infrared smoke frames in different scenes, including confined space fires and moving objects. The photo was taken by a professional uncooled infrared camera.

Fig. 2
figure 2

Example images from the Train2 image dataset

2.2 Visual and Infrared Source Comparison

Since the infrared image and visible image of the object are similar to some extent, such as shape, size, edge, motion characteristics, etc., we use infrared camera and visible camera to shoot the same smoke scene for comparison, as shown in Fig. 3. At the same time, we also selected some fire smoke image data in indoor and outdoor space from the network database as comparative experimental data.

Fig. 3
figure 3

Comparison of infrared and visual video source

The infrared image and the visual image of the object are similar to some extent, such as shape, size, edge, movement characteristics, etc. Figure 3 is a comparison of the same smoke scene using an infrared camera and a video camera. From the perspective of similarity, the trucks in the infrared image and the visual image in Fig. 3 are very similar in shape, size, and structural features, and these features are consistent in information processing. However, the smoke characteristics in the infrared image are not obvious. It is difficult to obtain smoke characteristic information directly from infrared images. We found that when the picture changes continuously, the position of the smoke can be identified by the movement characteristics of the smoke. This method can effectively learn motion characteristics.

3 Deep Convolution Dual-Network for Smoke Detection

Although deep learning methods, especially convolutional neural network (CNN), have achieved good results in solving visual recognition problems, few studies have applied these methods to smoke recognition in infrared images. In order to detect smoke in infrared images, this paper proposes a virtual fire detection system that includes a double convolution network with motion and texture feature extraction mechanism, and verifies it in a limited space environment. We found that, unlike video image detection, smoke and flame images in infrared images have very similar characteristics, such as the movement characteristics and diffusion characteristics of smoke. In addition, both flame and smoke generate a lot of heat, but the heat of flame is obviously higher than that of smoke. Because of the above characteristics, the smoke in infrared images can be easily filtered as noise in the recognition algorithm, and it is difficult to distinguish between flame and smoke. In this paper, the proposed dual-depth CNN model is applied to the classification of fire and smoke images in confined space to improve the detection efficiency. This method has the ability to classify smoke and fire images at the same time, and has many advantages and performance over the existing CNN model based on visible light images in smoke and fire identification in confined spaces. In our model, learning texture and motion features from source infrared frames containing smoke is carried out by a dual network called texture network and motion network. The CNN1 network is composed of 8 layers, which is used for texture extraction, and its depth is different from the original VGG network [17]. CNN2 network has a different structure from CNN1, and has three convolution layers, a collection layer for motion extraction and a fully connected layer for classification [18].

3.1 CNN for Texture Features

The dual CNN proposed in this paper consists of both motion and texture network and the fire detection network learning the spatial representation of the fire from source video frames as an auxiliary. CNN is used in these networks which are multiple layer neural networks composed of convolutional layers and pooling layers. The basic network structure of the texture feature of the smoke detection algorithm based on deep convolutional neural network proposed in this paper is shown in Fig. 4.

Fig. 4
figure 4

Architecture of the CNN to detect smoke texture

As depicted in [19], the texture detection network is composed of multiple layers with different functions, which are used to for deep learning of smoke features. The network contains five convolutional layers marked convX and two full connection layers. In convolutional layers, 5 × 5 sized filters are used for the first two steps with 32 feature maps. In the third convolution layer, the number of filters is doubled. Therefore, the feature maps are multiplied by 2 and 4 in the following convolutional layers. For the last layer of the output layer, it is designed as 2 nodes, and is fully connected with 2176 neurons of the full connection layer. For the classifier set by the output layer, the result obtained by the full connection layer can be regarded as the high-level feature information of the smoke image extracted layer by layer from the previous several layers of the convolution layer and the down-sampling progress, and mapped to a 2176, the feature vector of the dimension. Then, for the feature vector, a binary classification process is performed.

In order to reduce the amount of calculation and improve the training speed of the entire network, this paper adopts the correction linear unit (ReLUs) function [20] as the activation function of the smoke detection deep convolutional neural network model, as shown in formula 1. This function has high applicability to deep data convolutional network models with large data volume.

$${\text{Re}}lu(a)={\text{max}}(0,a)$$
(1)

When the parameter a is an positive number, the equation is computed as 1. Thus, the computation expression of one neural in the convolution layer can be expressed by formula 2. Where the parameter M denotes the depth of the filter, v and n is the weight vector and the bias term.

$${a}_{i}={\text{Re}}lu({\sum {a}_{i-1}v}_{i}+{n}_{i}) \, i=\mathrm{1,2},...,M$$
(2)

3.2 CNN for Motion Features

The details of the motion feature network are similar to the structure of the previous network, but the difference lies in the input data set and pool layer. The movement characteristics of the smoke are closely related to the continuity of the smoke video. It is necessary to effectively compare and analyze the previous frame image and the next frame image and obtain the motion area of the image. In most fires, in the initial stage of combustion, the smoke produced is heated and floats upward, so the most obvious feature of smoke is upward movement. In the network initialization stage, the early smoke video of the fire is selected, and the motion feature network is input in chronological order. Since the video smoke image and the infrared smoke image have similarities in the motion characteristics of the smoke, the method of combining the video image and the infrared image is used to improve the processing efficiency and at the same time solve the problem of insufficient infrared video images. Figure 5 shows part of the smoke image used for learning enhancement in the fire video.

Fig. 5
figure 5

Video smoke images for learning enhancement

In the case that the amount of smoke video data is large enough, in order to avoid overfitting, we added a pooling layer to the second CNN network. Figure 6 shows the details of the CNN network used for motion learning from video and infrared frames. 3 convolutional layers and 2 pooling layers are introduced and each convolutional layer is followed by a nonlinear ReLu layer. The middle part is the maximum pooling layer and the convolutional layer, repeated twice, and finally the fully connected layer. The step size in all middle layers are limited to 2. The original input frames of the network are infrared thermal image after gray processing.

Fig. 6
figure 6

CNN network for motion characteristic extraction

3.3 Dual CNN Model for Smoke Detection

The block diagram of the dual CNN model for smoke detection is shown in Fig. 7. First, manually tag the video and infrared smoke images, and then train the first CNN separately for motion feature extraction. The second CNN is trained by the infrared frames individually and output the texture feature of the smoke. After that, the output of CNN1 and CNN2 are concatenated to formulate the joint features by different combination methods.

Fig. 7
figure 7

Block diagram of the dual CNN model

When the output feature maps have the same resolution, the combination of the two can be achieved through simple superposition operations. In order to better preserve the two smoke features, we propose a superposition method to describe the joint features, as shown below.

$$\left\{\begin{array}{c}{\gamma }_{m,n,2p}={\alpha }_{m,n,p}\\ {\gamma }_{m,n,2p+1}={\beta }_{m,n,p}\end{array}\right. \, p=\mathrm{0,1},2,...,N-1$$
(3)

The stack operation is implemented after the full-connection layer for the consideration of the complexity of network structures. The result of the concatenation is demonstrated in our experiments. The training process is solved through the parameter θ which can be achieved by minimum calculation of the loss function [21]. The loss function is computed in the end with the estimated value inputs and the identities which are depicted in L(θ).

$$L(\theta )=-\frac{1}{N}\sum_{i}{\text{log}}[Soft{\text{max}}({\alpha }_{k})],i=\mathrm{0,1},...,N-1$$
(4)

where θ denotes the weights of the vector in current network. The training aim is to get a θ corresponding to the minimum L(θ). The formula is processed by stochastic gradient funciton with prpogations [22]. At the beginning of training, the network model is loaded first, and then the model parameters are initialized. The momentum factor is set to 0.935, the weight attenuation coefficient is set to 0.0005, the initial learning rate is set to 0.01, the training times are set to 300 epochs, and the batch size is set to 32 according to the CPU and video memory.

4 Experiments

4.1 Experiment Data

The data sets used in the CNN2 of motion feature learning experiment is partly from Toreyin et al. [6], and the rest from the experiments and collection online. The data sets for CNN1 is all from our own experiments. The input infrared and video frames are initialized by 238 × 238 size. Considering the different characteristics of CNN1 and CNN2 structure learning objects in the algorithm proposed in this paper, we set the learning rate of CNN1 and CNN2 to 0.01 and 0.02 respectively. The Train1 contains 6000 images which get from the smoke videos on the website of Key Laboratory of fire science, University of science and technology of China. The data sets contain training and testing frames respectively, and nominated as TrainX and TestX for CNN1 and CNN2 training. The details of data sets used in the paper is shown in Table 1.

Table 1 Experimental data sets

4.2 Evaluation Protocol

The algorithms in this paper are evaluated through the famous criteria in image level. Firstly, we use the true positive ratio (TPR) and the true negative ratio (TNR) to evaluate our method as depicted in following.

$$TPR=\frac{TP}{TP+FN}$$
(5)
$$TNR=\frac{TN}{TN+FP}$$
(6)

where TP is the number of images which contains smoke and correctly identified. TN is the number of images which are doesn’t contain smoke and correctly identified. FN is the number of images which contains smoke but misidentified. FP is the number of images which doesn’t contain smoke but misidentified.

The false alarm rate can be computed as follows.

$$false \, alarm \, rate=\frac{FP}{TN+FP}$$
(7)

In addition, as in [23], we exploit the receiver operator characteristic (ROC) to evaluate the performance of smoke detection. ROC curves are achieved based on the experiments of our data sets.

4.3 Results and Analysis

The dual CNN model is experimented under two different conditions with both video and infrared frames and infrared frames only. The TPR and TNR are computed according to Eq. (5) and (6), as shown in Fig. 8. It can be observed that the accuracy gets higher when use video frames as an aid for training.

Fig. 8
figure 8

Comparison of TPR and TNR under different conditions

The relationship between the number of iterations and the value of the loss during network learning period and the ROC curves of the final classification output under these conditions are shown in Fig. 9. The CNN1 is trained under two different conditions, the one is with infrared images only, the other is with both video and infrared images. We can see that it is useful to train the CNN1 network for the motion characteristic extraction. The detection results obtained by our algorithm are shown in Fig. 10. It is evident for the method we proposed works well in smoke detection. It has good performance under various real scenes.

Fig. 9
figure 9

Loss value and ROC curves under different conditions

Fig. 10
figure 10

Detection results in real scenes with infrared frames

We use ablation experiments to verify the role of the selected characteristic parameters in this paper. When the detection accuracy and detection speed are used as the performance index parameters of the detection model, the detection accuracy evaluation indexes include precision, recall, F1 curve, PR curve, and map (mean average precision); The detection speed indicators include frames per second (FPS) and model floating-point operations (flops). When the map is higher, the FPS is larger, which means the model detection performance is stronger. In this paper, the network model using smoke motion features and texture features is trained and compared with the network model using motion features, texture features, diffusion features and edge features. The results are shown in Table 2.

Table 2 Results of the ablation experiments

In Table 2, Ex1 represents the model without the edge feature, Ex2 represents the model without the diffusion feature, Ex3 represents the model without the texture feature and Ex4 represents the model without the motion feature. From Table 2, we can see that the three weight files are in precision, recall and map_ The results on 0.5 are similar, but map_ 0.5:0.95, there is a big gap between the three, map_ 0.5:0.95 means to calculate the map under 10 IOU thresholds, and then calculate the average value. This index can better reflect the accuracy of the model. cThrough comparative analysis, we know that EX3 and EX4 have better effects, because the motion and texture features of infrared images are easier to capture than edge features and diffusion features, and can better indicate whether a fire occurs.

We compare the algorithm proposed in this paper with other advanced deep learning algorithms in two performance parameters of fire image detection which are recall and precision of the detetion process, as shown in Fig. 11. As can be seen from the figure, FCN-8 s algorithm has the highest accuracy and is better than the other three algorithms. The performance of the algorithm proposed in this paper is second only to FCN-8 s algorithm and better than GoogLeNet and VGG-16 algorithm. However, the computation of FCN-8 s algorithm is much higher than that of the algorithm used in this paper, and the training time is the longest. The algorithm in this paper gradually converges after the number of iterations exceeds 8000, while the former needs more than 10,000. Combining the performance parameters such as convergence speed and accuracy, the comprehensive performance of this algorithm is the best. The computation of this algorithm is larger than VGG-16 algorithm and lower than GooLeNet algorithm. This is because the algorithm proposed in this paper has two convolution network channels, which are trained with different characteristic graphs respectively, so as to improve the calculation speed and reduce the network complexity. At the same time, this dual channel structure also separates the training of video images and infrared images, avoiding the mutual interference of their features, so as to ensure the accuracy and reliability of the network.

Fig. 11
figure 11

Performance comparison

5 Conclusion

Smoke varies greatly in many features and is vulnerable to interference, so accurately identifying smoke at the early stages in real scenes remains a challenging task. In this paper, the dual convolutional neural network model is proposed and successfully applied to fire smoke detection with infrared frames. Part of the video frames have been added to expand the data sets. The learning process is conducted with both visual and infrared frames, which solves the limitation of infrared frames for motion feature extraction of smoke and improves the timeliness of detection. Experiment results shown that the dual CNN is effective for smoke detection where data is scarce, and the accuracy of classification is guaranteed. Our algorithm has a good application prospect where video detection means are limited such as dark closed environment fire detection. In future research, the model can be further adjusted and improved according to the specific application environment.