Fire and smoke precise detection method based on the attention mechanism and anchor-free mechanism

Substantial natural environmental damage and economic losses are caused by fire. For this problem, automatic fire-smoke detection and identification are needed. Fire-smoke detection methods based on vision still suffer from significant challenges that fail to balance model complexity and accuracy. We propose an improved YOLOv3 fire-smoke detection and identification method to address these problems and include a fire and smoke dataset. The neck module (1) adds an attention mechanism to enhance the ability to extract features from pictures, and (2) uses an anchor-free mechanism in the anchor box mechanism to solve the problem of significant variances in smoke texture, shape, and color in real applications, and (3) uses a lightweight backbone to reduce the model complexity. The proposed dataset is based on VOC, which contains images of complex scenes and high diversity. The dataset includes pictures that (1) combine fire with smoke, (2) only have smoke or fire objects, and (3) contain a single cloud object. The experimental results demonstrate that the method achieves 50.8 AP, which outperforms the suboptimal method by 3.8. Moreover, the inference speed of our method is 13% faster on the GPU than the suboptimal method.


Introduction
Approximately 250 thousand fires occur in China every year, resulting in considerable economic losses and environmental damage. Approximately 2 thousand people have suffered fire and resulting economic losses of approximately 40 billion. It is well known that the early stages of fire produce smoke and flame. The flame is difficult to detect, because the burn is slow in early fires. Therefore, smoke detection is more important for extinguishing fires. To prevent fire spread from causing larger losses, rapid and accurate fire detection is crucial.
In the past few years, numerous studies have been conducted on contact sensors for fire-smoke detection, including smoke sensors, temperature sensors, and particle sensors. However, fire-smoke detection based on sensors has apparent disadvantages. The sensor-based fire-smoke detection method works well in indoor spaces but is unsuitable for larger open scenes. Vision-based methods are faster and more suitable for vast space and robustness compared with sensorbased methods. Therefore, vision-based method has drawn increasing attention.
Traditional vision-based fire-smoke detection methods [5,14,32] usually rely on manual feature extraction from images [29]. However, designing a corporate feature is very difficult, because (1) designing an effective feature requires specialized knowledge, (2) the features of flame and smoke are notably unstable, and (3) different inflammables produce different flames and smoke. For traditional vision-based methods, it is difficult to obtain high accuracy and robustness.
In recent years, CNN-based methods, including image classification [13], object detection [11], and image enhancement, have achieved success in computer vision. To solve the difficulty of extracting flame and smoke features, several CNN-based methods [22,23] have recently been proposed. Compared with traditional vision-based methods, CNN-based methods can automatically extract features from images. However, these methods still have some problems in special scenes, as shown in Fig. 1. Additionally, the rapid nature of the model cannot satisfy the real-time requirement.
In this paper, we propose an outdoor fire-smoke detection method based on a channel attention mechanism and an anchor-free mechanism. In particular, to enhance the discriminating power of fire-like and smoke-like objects, we presume that the channel attention mechanism [35] selectively emphasizes the contribution between different feature maps. Then, we change the anchor box into anchor free to enhance the generalization ability of the model for unstable flame and smoke shapes. Moreover, being anchor-free can dramatically decrease the detection head complexity.
The main contributions of this paper can be summarized as follows: (1) We propose an outdoor fire-smoke detection method based on an attention mechanism and an anchor-free mechanism to overcome the problem of similar-object detection. Finally, we propose a lightweight backbone to reduce the model complexity. The method has obvious improvements over the original method and has advantages over state-of-the-art methods in terms of detection performance. Our method achieves 50.7 AP, which outperforms the sun-optimal method by 3.8. The parameters of our method are 48.8 M, reducing 12.7 M compared to the baseline. For a given detection performance level, the inference speed of our method is 73.0 FPS which achieves the best speed of these methods. The optimization strategies in this paper also have great reference value for similar detection. (2) We establish a fire-smoke detection dataset of 16,714 images. All images are labeled by annotation platform labeling. The dataset comprises different fire-smoke images in normal and difficult scenes and many firelike and smoke-like images, such as the sun, lighting, cloud, and steam. In this field of fire-smoke detection, an effective evaluation standard is absent, similar to the case in which VOC and COCO are benchmarks of vision-based object detection, and it is difficult to evaluate the performance of different methods. Our dataset will contribute to vision-based firesmoke detection research. https://drive.google.com/ file/d/1kWjy1msWm3DiHMs_MGE1KFU0lMmQ0Tsl/ view?usp=sharing.
The paper has been organized as follows. Section "Related works" describes the details of the related work. Section "Proposed methods" is concerned with the methodology used for this study. Section "Experiment and discussion" analyzes the results of the method. Section "Conclusion" summarizes this paper and discusses further work.

Related works
Existing fire detection methods can be categorized into conventional and deep-learning-based methods. Traditional fire detection methods rely on handcrafted feature extraction, such as color, shape, texture, and motion features. The method [5] utilized different color models to discriminate between fire and nonfire pixels. Moreover, T Celike et al. [4] proposed a generic fire detection model for fire color in the YCbCr color space. The method of using color information and wavelet transform coefficients to detect smoke in open space was proposed in [3]. To exploit the shape information, [7] proposed an improved smoke detection method with RGB contrast images and shape constraints. Pasquale Foggia et al. [9] presented an approach using YUC instead of RGB color space and obtained a better result.
However, using a single feature results in a high false alarm rate. To avoid the problem of individual information leading to a high false rate, [24] explored static characteristics, color and shape, and dynamic characteristics such as smoke and fire disorder. However, the false-negative rate remains high, because there are fire-like objects in the images. Töreyin et al. [34] utilized the hidden Markov model to discriminate between real fires and fire-like objects. The spatiotemporal domain wavelet transform was used to analyze the dynamic behavior of fire in [33]. Nevertheless, too many thresholds reduce the generalization of the model.
Heuristic rule-based methods still have high false alarm rates. Some researchers use statistical learning methods to avoid the effect of personal experience in detection. For example, Zhao et al. [42] proposed a method based on texture analysis and used the SVM to classify this texture. Moreover, [39] proposed a novel dynamic texture descriptor based on a surfacelet transform and hidden Markov tree mode. Then, the texture was used to train the SVM classifier. In [15], they utilized the SVM to classify the covariance-based features that were extracted from the spatial-temporal blocks. Chen et al. [6] and Han et al. [12] investigated the Gaussian mixture model (GMM) to detect fire. In Ref. [1], the paper proposed a local binary co-occurrence patterns (RGB-LBCoP), utilized fuzzy C-Means (FCM) to extract the features from smoke regions, and used the support vector machine (SVM) for classification based on these features. To solve the problem of distinguishing fire and fire-like objects, [40] utilized a BP network based on the dynamic characteristics of fire to detect fire.
The conventional methods rely on handcrafted features and require rich experience with this file. At other sites, these methods have difficulty maintaining a low false alarm rate. With the rapid development of deep learning, CNN-based methods have been explored for fire and smoke detection. For example, in [10] and [20], LeNet-5 was used to classify the images into fire and nonfire. These methods achieved Fire and smoke detection aims to determine whether there is fire and smoke. Some special scenes make fire and smoke detection more difficult. The top row shows fire-like objects, such as the sun, glare light, and red color objects. Smoke-like objects are shown in the images in the middle row, such as clouds, dust, and fog environments. The images in the bottom row show that some fire scenes produce heavy smoke higher accuracy than traditional methods. Muhammad et al. [21] proposed an early fire detection method using a CNN structure. They modified the AlexNet model's classification from 1000 categories to two categories. However, the model size is comparatively heavy, which restricts its application. Thus, they introduced GoogLeNet [22], SquezeNet [21] and MobileNet [23] to detect fire, which has lower complexity and better performance. Sharam et al. compared ResNet50 and VGG16 and found that ResNet50 performed better than VGG16 in fire detection. However, the dataset in this paper is too small (containing 651 images). The normal convolution structure has high performance for normal scenes. These methods cannot effectively distinguish between real fires and fire-like objects. To address this problem, [18] built a complicated large-scale dataset containing 25,000 fire images and 25,000 fire-like images and proposed a model named EFD-Net.
Moreover, some researchers have expanded smoke and fire detection into object detection. Object detection methods distinguish between fire and nonfire scenes and can locate the fire object and nonfire object in the image. In Ref. [37], Wu et al. used YOLO, R-CNN, and SSD for forest fire detection and built a forest fire benchmark. Reference [17] used YOLO for fire detection, and Ref. [28] used YOLO for flame detection.
However, these methods can only detect fire. Reference [16] used the mean square error (MSES) to obtain flames and smoke from a camera. Second, the flame and smoke areas were extracted using Faster R-CNN. However, the method does not satisfy the real-time requirement. To solve this problem, Aponara et al. [27] used YOLOv2 [25] as a real-time fire and smoke detection pipeline. In this model, the dataset was too small and failed to detect the like-fire object. Reference [38] built a dataset containing 10,581 images and presented a detection system based on ensemble learning. However, the method cannot distinguish between smoke and smoke-like objects.

Proposed methods
YOLOv3 has good performance for normal object detection [43]. YOLOv3 is more easily modified to meet the requirements of other application scenarios [36]. However, the original model cannot efficiently recognize real objects and likely objects, which leads to a high number of false alarms. To solve this problem, we use attention mechanism to enhance the capacity of feature extraction. Then, the original method uses the anchor mechanism that conducts a set of fixed-size anchors before training, which damages the model generation. The anchor mechanism has more complicated detection heads than the anchor-free mechanism. Therefore, we switch the anchor mechanism to an anchor-free mechanism.

Overview
As shown in Fig. 2, the fire and smoke detection network comprises a backbone network, a multi-scale feature extraction neck, and a detection head with an attention mechanism module. The input is the 416×416×3 RGB image. In the first step, we feed the image into the backbone to generate detailed features. To retain more information, the model divides the output into three branches, and the shapes of the output features of the three branches are 13 × 13 × 1024,26 × 26 × 512 and 52 × 52 × 256. In the second step, the three features flow through three detection necks to generate the contextual information, as shown in the neck part of Fig. 2. In the neck, a feature pyramid structure is used to fuse different scale output features. At the start of every branch, we utilize the CBAM module to selectively enhance features. In the final step, the detection head includes the convolutional layer, batch normal layer, and ReLU activation. The detection head generates three feature maps containing classification prediction, bounding box prediction, and conference information. The shape of the three output features is reduced to the shape of the input image divided by 32, 16, and 8.

Attention mechanism
The detection of likely objects is a difficult task in fire and smoke detection. Since the likely objects have similar features in shape and color, the eventual extracted features are similar, which results in recognition error on two similar objects. To solve this problem, we utilize channel attention and spatial attention modules, as shown in Fig. 5, in the branches not the backbone. First, channel attention enhances channels representative targets and suppresses other channels. Then, the spatial attention can emphasize the position of the object. Therefore, the network will focus on to the area where the target is present to obtain more detailed information. This information contributes to distinguishing similar objects.

Channel attention module
As shown in Fig. 3, utilizing both average pooling and max pooling generates two descriptors: F c avg and F c max , which represent the average pooling features and max pooling features, respectively. Then, both features are input into the shared network, which is composed of a multi-layer perceptron (MLP) with one hidden layer to produce the channel attention map M c ∈ R C×1×1 . Then, the output feature of the shared network is added, which can be obtained by (1) σ denotes the sigmoid activation function, δ denotes the ReLU activation function, and W 0 and W 1 are the weight matrices.

Spatial attention module
As illustrated in Fig. 4, using two pooling operations along the channel dimension generates two 2D maps: F s avg ∈ R 1×H ×W and F s max ∈ R 1×H ×W . Avg and Max indicate avgpooled and max-pooled across the channel. Then, those two maps are concatenated and convolved by a convolution layer to produce a 2D spatial attention map. The spatial attention can be expressed by σ denotes the sigmoid activation function, f 7×7 denotes the convolution layer with filter size 7×7, and [...] denotes the concatenation operation. As illustrated in Fig. 5, we connect Resoperator (1) Resoperator (4) Resoperator (4) Resoperator (8) Resoperator (8)  the spatial attention module behind the channel attention. Let a feature map F ∈ R C×H ×W be the input, the channel attention module infers a 1D weight matrix M c ∈ R C×1×1 , and the spatial attention module infers a 2D weight matrix M s ∈ R 1×H ×W . The overall computation can be expressed as follows: In the equation, ⊗ represents the multiplication. In multiplication, the channel attention value is multiplied by the channel, and the spatial attention is multiplied by the spatial attention. F o is the final refined output.

Anchor-free mechanism
The anchor mechanism has some problems with fire detection. First, to optimize detection performance, clustering analysis is used to determine a set of optimal anchors before training. However, the shapes of smoke and flame are variable. Those cluster anchors are less generalized to the detection. Second, the anchor mechanism increases the complexity of the detection head. However, fire detection requires high levels of real-time performance. For these reasons, we use the use of anchor-free mechanism instead of the anchor mechanism.
We replace the anchor mechanism with an anchor-free mechanism to decrease the model complexity and improve the generalization ability. We reduce the predictions of each grid cell from 3 to 1, as shown in Fig. 6. The prediction  6 The yellow rectangle is a grid cell. For the anchor mechanism, every grid cell produces three prediction boxes.
For the anchor-free mechanism, every grid cell produces one prediction box Anchor mechanism Anchor free mechanism directly outputs four values: the predicted box's height and width and offsets of the left-top corner of the grid. Then, we assign the prediction box center located in the ground truth as a positive sample.
To maintain the assigning rule of the baseline method, the above anchor-free strategy selects only one positive sample for each object, which means ignoring some high-quality predictions. However, high-quality prediction may also benefit the gradients, which may mitigate the extreme imbalance of positive/negative sampling during training. Therefore, using the ground-truth center point as a reference, a 5 × 5 square is set up and the center points of the prediction boxes inside the square are assigned as the positive samples. As shown in Fig. 7, we used a sun picture as a sample, the sun is a ground truth, and the blue rectangles are prediction boxes. A highquality label assignment strategy is important for training models to achieve better performance. The label assignment strategy is used on positive samples selected by the above rule. First, the strategy calculates the cost between each positive samples and gt i (ground-truth), which can be calculated as where λ is a weight. L cls i j and L reg i j are of classification and regression losses, respectively, between ground truth gt i and sample s j . Second, for gt i , the top k samples with the lowest cost are chosen as its positive samples. For gt i , we dynamically calculate K using the following equation: The iou i j are the IoU between gt i and sample s j . Finally, those samples are assigned as positive, and the remaining samples are negative.

Lightweight backbone
Fire detection is a real-time visual task. However, the backbone of the baseline "DarkNet-53" is heavy for real-time tasks. To further reduce the complexity, we propose a new backbone to replace the original backbone. We used a lightweight cross-stage module for the new backbone, as shown in Fig. 8. First, we remove the bottleneck structure. As shown in Table 1, the new structure has fewer  Our used parameters than the bottleneck structure. Then, the 1×1 convolution layer is used to reduce the repeatability of gradients.
The backbone structure contains 51 layers, as described in Table 2. We use the 3 × 3 convolution with a stride of 2 to increase the number of channels and decrease the size h × w. To maintain the original head structure, the backbone still outputs three features, 256 channels, 512 channels, and 1024 channels.

Experiment and discussion
In this section, we first describe the dataset details, implementation, and evaluation metric. Then, we perform various The n indicates the number of residual blocks in the cross-stage module ablation experiments to verify the performance. Finally, the proposed method is compared with the state-of-the-art methods.

Datasets
To evaluate the proposed method, we create novel fire and smoke datasets. We refer to the dataset [8]. Dataset [8] has only 226 images with fire and without fire. To fertilize the dataset, we collected images of fire smoke and no fire smoke from the Internet. We established a fire and smoke dataset of 10,029 images. The fire and smoke images include many scenes, such as cars, forests, buildings, and grasslands. In the remainder of the article, it is named Dataset 1. Moreover, we added images of fire-like and smoke-like objects, such as glare lights, burning clouds, sunset, water, steam, and sand dust. In the remainder of the article, this dataset is named Dataset 2. As shown in Fig. 9, our self-created dataset contains fire, fire-like, and nonfire three-type images. In Fig. 9, picture (a) shows some command images of fire scenarios containing house fire, wildfire, and smoke at the scene of the fire. Picture (b) shows some images of fire-like scenarios containing clouds, dust droplets, lighting, and fog. The detailed statistics of our dataset are shown in Table 3.

Implementation details
We use PyTorch to implement our method. The SGD optimizer is chosen, we set the parameters of the optimizer at the initial learning rate to 1e−3, and the momentum is 0.9. The batch size is set to 16, and the training time is 300 epochs. We use Mosaic [2] and MixUp [41] of the data augmenta- Fig. 9 Characteristic images in the dataset. a The fire smoke images include fire images, smoke images, and fire mixed with smoke images. b The fire-like and smoke-like images include clouds, dust droplets, and fog tion strategies. After every epoch, we use a validation set to validate the model. This paper configures the hardware platform with an NVIDIA RTX3060 graphics card with 12 GB memory. The GPU's single-precision floating computing capability is 12.74 TFLOPS. We implement our method in the Windows 10 system.

Evaluation metrics
The VOC metric is a significant standard for assessing the detection method performance. In object detection, the average precision (AP) is a common evaluation metric. The specific description is shown as follows.
For any threshold, the true positive is the number of detections with IoU > threshold; the false positive is the number of detection boxes with IoU ≤ threshold, and the false negative is the number of ground truths with no detection boxes.
Precision is the ratio of the number of real positive samples to the number of detected positive samples. The recall is the ratio of the number of real positive samples to the number of objects. The mathematical equation is Using precision as the y-axis and recall as the x-axis, we build the precision-recall curve. Then, we calculated the AP (average precision) as the average of 11 values of the area under the curve, which can be expressed by Here, r is the recall. Assume that there are n classes. The MAP is the mean AP of all classes, which can be calculated by

Ablation study
In this section, we compare the improved method with the baseline method. Our method is based on YOLOv3 and introduces two mechanisms to enhance the performance of the base method. To confirm the effectiveness of the two mechanisms, we present experiments with different settings: (1) only the YOLOv3 baseline network, (2) applying the CBAM (convolution block attention module) in three branches that are on different scales, (3) switching the anchor mechanisms to anchor-free mechanisms, (4) simultaneously adding the CBAM and switching the anchor mechanisms, and (5) replacing the original backbone with a lightweight backbone. The experimental results of the ablation study are presented in Table 4. Compared with the YOLOv3 baseline, the structure that applies the CBAM (channel attention and spatial attention module) achieves better performance in AP 50 and AP 70 , but AP is a slightly baseline lower. It can be concluded that CBAM enhances the classification ability and drops the location ability of the model. In addi- CBAM denotes the convolution block attention module. Anchor-free implicates the anchor-free mechanism. Lightweight backbone denotes the proposed backbone. All the models have been tested at 416×416 resolution The blod means best performance The blod means best performance tion, the inference speed is reduced from 62.9 FPS to 58.8 FPS due to the increase in model complexity. The AP 50 of applying the anchor-free mechanism individually outperforms the YOLOv3 baseline by 0.4, which reaches 69.3. Moreover, the structure achieves significant improvement in AP, AP 60 , and AP 70 . It can be concluded that removing the anchor mechanism can improve the positioning ability of the model. Additionally, the speed of the model increases from 58.8 FPS to 83.3 FPS causing a decrease in the number of detection boxes. When we apply these two mechanisms simultaneously, the AP further improves to 43.7. Finally, the structure replaces "DarkNet-53" with a lightweight backbone, decreasing the model parameter from 61.8 M to 48.8 M. It achieves 43.8 and 53.3 in terms of AP, and AP 70 , respectively, which outperforms the structure of "DarkNet-53". However, "DarkNet-53" outperforms the "Lightweight" backbone. The reason may be that the classification ability becomes weak, but the localization ability becomes strong. In addition, the model maintains a high FPS, which increases from 62.9 to 73.0 compared with the baseline method.
To further evaluate the effectiveness of the improved method, we conduct a comparison in Dataset 2, and the results are shown in Table 5. Our method obviously improves the AP, AP 50 , AP 60 , and AP 70 . In particular, our method achieves 50.8 AP, which outperforms the baseline by 4.7.

Comparison with CNN-based detection methods
In this section, we compare the performance of our method and existing CNN-based methods. Traditional fire disaster detection methods are mainly based on manual features. The complex environment easily influences the manual fea-tures, whose detection performance is lower than that of network extraction. Thus, the comparison between the proposed method and the traditional method is unfair. First, we conducted a comparison experiment on Dataset 1. Then, we conducted a series of experiments on Dataset 2 to further analyze the performance of our method. Finally, based on the AP indicator, we present a wide comparison of the model size and inference speed.

AP comparison using dataset 1
In this part, the experiment is performed on the benchmark Dataset 1 to evaluate the performance of the proposed method and compare it with some representative detection methods based on CNNs (convolution neural networks), including [19,26,30,44]. The comparison indices are AP(AP 50:95 ), AP 50 , AP 60 , and AP 70 .
We reimplement all given methods on Dataset 1 and use the same testing set to evaluate the performance of these methods. The results are shown in Table 6. Our method achieves 43.8 AP, which outperforms the second-best method [26] by 3.9. The performance indicates that our method has good position ability. Method [30] has the worst AP on Dataset 1, which is 25.4. Moreover, our method has suboptimal performance in terms of AP 50 , which is 70.1. The performance indicates that our method has good classification performance. Moreover, our method achieves 53.3 AP 70 , which is the best method. This indicates that our method has better positioning ability than the other methods. [26] achieves 39.9 AP on Dataset 1, which is the second-best method. [2] achieves 39.2 AP, which is slightly lower than [26]. However, [2] performs better than [26] in terms of AP 50 , AP 60 , and AP 70 . This is because [26] has a better precise positioning ability than [2]. From the perspective of comprehensive detection performance, our method achieves the best AP, AP 60 and AP 70 .

Detection performance in dataset 2
Considering that fire-like and smoke-like objects will affect fire disaster management, the detection systems must be The blod means best performance The blod means best performance robust against fire-like and smoke-like object attacks. To analyze the detection performance of different methods and enhance the ability against object attacks, we build a dataset that includes fire-like and smoke-like objects based on Dataset 1. All methods are trained on Dataset 2, and the experimental results are shown in Table 7. From the results, our model has the best detection performance, and [30] has the worst detection performance. The AP of method [30] only reaches 26.5 on Dataset 2, which is 24.3 lower than our method. In addition to AP, our method shows the best performance on other evaluation metrics, particularly AP 70 , and our method achieves 60.4, which outperforms the worst and suboptimal by 29.1 and by 2.3, respectively. This is because removing the anchor frame enhances the positioning of the model. In terms of AP 60 , our method achieves 69.4 and outperforms the suboptimal method by 1.0. The method [2] achieves the best AP 50 , which is 75.2 and higher by 0.6 than our method. The reason may be that method [2] has better classification ability.
Furthermore, to explore the effect of similar targets on detection, we validated the performance of the model on both datasets. As shown in the first row of Fig. 10, some nonfire scenes are recognized as fire accidents. Using Dataset 2 to train the model, the model can correctly identify similar targets. The proposed method is tested on images including fire and smoke, and the results are shown in Fig. 11. The first row are includes images containing fire mixed with smoke. It can be seen that the model can basically identify fire and smoke, but in some areas, it is difficult for the model to distinguish fire from smoke. In the second row and bottom row, the model clearly recognizes the fire and smoke. Some small objects are still difficult to detect, as shown in the last images in Fig. 11. Fairly, all images are not included in Dataset 1 and 2. We can see that using a dataset containing similar targets for training can effectively reduce the false alarm rate.

Comparison of the experimental results of fire and smoke
The AP of fire and smoke is an important indicator of the detection performance of detection methods. In this subsection, we present more detailed comparative results in terms of the detection performance of each category in Dataset 2. We present AP 50 of fire and smoke under similar detection performance. The experimental results are illustrated in Table 8.   The blod means best performance Table 8 shows that AP smoke is higher than AP fire for all methods. The reason is that smoke targets are usually larger than flames. For smoke detection, [19,26,44] have very similar AP smoke values, all of which exceed 70. This result is also due to the relatively large area of smoke. However, [19] achieves the worst AP fire at only 58.6. The reason may be that VGG does not contain a residual structure; therefore, it is not easy to detect fire. A low AP fire implies that the model causes a high fire error alarm rate. For AP fire , the suboptimal method achieves 62.5. Our method obtains the best results for the two indices and outperforms the suboptimal methods by 0.2 and 1.1, respectively. From the experimental results, it can be concluded that our method achieves the lowest alarm rate.

Comparison of the model size, inference speed, and AP
Smaller fire detection methods are more feasible for deployment on hardware with limited computing resources. However, the inference speed of the methods is often uncontrolled, since the speed varies with software and hardware. Thus, in this section, we provide a detailed experimental analysis of the feasibility of our method and the state-of-the-art CNNbased method. Experiments are performed in two settings: (1) an NVIDIA GTX3060 supporting deep learning acceleration with 12 GB on-board memory and floating-point arithmetic capability of 12.74 TFLOPS, and (2) a CPU with 16 GB RAM and 4.0-GHz 64-bit Intel Core-i5(12400).
AP is the basic nature of the object detection method. Therefore, the comparison of results is discussed consider-ing AP. The parameters of the methods under similar AP are compared in Table 9. First, in terms of model parameters, [44] appears to be the best method. However, it has a lower AP than our method. In contrast, method [44] has only 51.5 FPS. The reason is that high-resolution images can lead to increased inference time. The parameter of our method is 48.8 M, which is similar to [25]. However, the method [25] has only 61.8 FPS. In terms of inference speed, the suboptimal method achieves 62.9 FPS, which is 13% lower than that of our method. In addition, Ref. [19] is the worst among these methods. Our method achieves the best inference speed on the GPU. Furthermore, Refs. [44] and [19] have lower parameters than the other methods. However, these methods achieve lower speeds than other methods. The reason for this is that it takes time to process the predicted data. According to the above experimental results, our method reaches a good balance between detection performance and inference speed.

Conclusion
In this paper, we discuss the successful application of the current CNN-based methods in computer vision tasks and the possibility of enhancing the performance of fire detection methods based on computer vision. Nevertheless, in some complicated scenes, the performance of current CNN-based methods remains limited. Most existing methods have difficulty recognizing similar objects and lack generalization ability. To address these limitations, an attention mechanism module that combines channel attention and spatial attention is proposed to improve the discrimination of smoke, fire, and similar objects. Then, the anchor-free mechanism is proposed to solve the model shortage of generalization ability because of the variable shapes of fire and smoke. Moreover, we used a lightweight backbone to reduce the complexity of our method. Compared with existing CNN-based methods in two datasets, the results show that our method achieves a better tradeoff on the AP, model size, and FPS. Further work will focus on expanding the detection dataset and using semantic segmentation [31] to precisely mark the fire area in images.