Introduction

As one of the everyday disasters, fire can cause severe damage to the natural ecology and challenge human property safety. How to effectively prevent and accurately monitor the spread of fires has become a research hotspot. At the same time, accurate fire detection Methods can help firefighters and researchers develop optimal fire management strategies. To accurately locate the fire scene and determine the scope of fire damage, early researchers mostly used traditional machine learning methods to detect and segment the fire area, such as evaluating the fire extent and orientation through color information, shape changes, and optical flow analysis [1]. Similarly, using simple image processing [2] and edge detection methods [3] to determine the flame position and extent by judging the image color and using the three-dimensional point cloud of flame pixels [4], a gaussian mixture model is established to segment the possible flame areas in a single image, and calculate changes in color, texture, roundness, area, and outline to detect forest fires. Although these simple image processing and machine learning methods can improve the efficiency and accuracy of fire detection, they require professionals to participate in feature design and screening, which is time-consuming and expensive. Expert experience seriously affects the final segmentation and detection results. At the same time, they often use a single pixel as a sensing unit, which results in low detection or segmentation accuracy.

With the continuous development of deep learning technology and its successful application in many fields [5,6,7,8,9,10], it provides a new idea for accurate fire positioning and detection. In recent years, a target detection method based on the convolutional neural network, which relies on the state changes of neurons in the convolution layer to obtain target details and accurately locate and segment fire targets, has been widely used. Such as YOLO [11], U-net [12], and Deep-CNNs [13,14,15]. These methods obtain deep discriminative information of fire targets in images by increasing the network depth, avoiding relying entirely on expert experience. The error is caused by the simple image processing method. Still, the scale change of the convolution kernel is small, and it is difficult to fully capture the detailed information of the fire target due to the limited receptive field. At the same time, the ability to detect fire changes in a smaller range could be better, and its detection there is still a lot of room for improvement in segmentation accuracy. Although the multi-scale strategy [16, 17] increases the receptive field of the network by changing the convolution kernel, it can easily cause the network to fall into a local optimum. The attention mechanism [18,19,20] focuses on salient areas by assigning weights, which can make up for the shortcomings caused by the convolutional network ignoring detailed information. Still, this method only enhances the utilization of deep features, while low-level features the rich semantics of inclusion cannot be effectively exploited. The transformer [21, 22] relies on powerful feature extraction capabilities to accurately locate the fire area and is widely used in fire detection and segmentation tasks. Compared with simple image processing and other methods, these deep learning methods rely on the powerful self-learning ability of neurons to model the fire area in the image and learn the difference between the characteristics of the fire area and the surrounding tissue structure. Fire area detection or segmentation. Although the errors caused by manual parameter feature design and screening are reduced when external environments such as weather and lighting and when the semantic content of fire images is relatively complex, these methods are limited by the limitations of the convolution kernel's acceptance field and it is difficult to obtain the fire image fully. The details of the area change as the fire intensity changes. At the same time, the fire information in the image cannot be modeled over a long distance, and the feature information reuses redundant information during the transmission process, resulting in the loss of detailed semantics and reducing the accuracy of detection or segmentation. In addition, these detection or segmentation methods will be interfered with by objects with similar colors to the fire area in the image while modeling the fire area. They will misjudge areas similar to the firing range, providing false information to firefighters or researchers. This results in incorrect fire management strategies with more significant losses.

In conclusion, existing fire detection methods repeatedly introduce irrelevant noise during the feature extraction process, reducing the detailed representation ability of salient features and ignoring the role of global features and contextual semantic information in distinguishing the differences between the fire area and peripheral tissue structures. Effectiveness. Therefore, to solve the above limitations and provide more accurate fire information to firefighters or researchers, a novel dynamic adaptive assigned transformer detection framework (DATFNets) is designed, which learns and fuses local information of subspaces, as well as adopts multiple strategies to reduce the use of redundant information, and at the same time, establishing complementary and interactive relationships between local, global and contextual semantics to enrich the representation of fire areas with spatial details. The main contributions of this paper are as follows.

  1. 1.

    Developed a context aggregation module (Mask-CAM) with mask strategy, embedded Mask-CAM into the patch merging layer of each stage of the feature extractor and divided the patch merging layer into Multiple subspaces learn representations independently to refine local features and focus on contextual details and global semantics of the fire area. Furthermore, to prevent problems such as vanishing and exploding gradients during feature extraction, we design an independent auxiliary weighted loss to improve semantic representation by independently supervising each stage of the feature extractor.

  2. 2.

    We proposed a new dynamic adaptive point representation and discrete point space constraint strategy (DAPSC), using the adaptive direction conversion function to locate fire targets, which helps screen out representative fire samples from complex backgrounds and pass penalizing discrete points, improving the network's ability to perceive subtle changes in the fire. Finally, extensive experiments are conducted on three sets of baseline data sets to demonstrate the effectiveness and robustness of the framework.

The remainder of this paper is organized as follows. Related work is presented in “Related work” section. The DATFNets framework for fire detection will be elaborated in “Proposed DATFNets method” section, and each critical module will be introduced. Detailed experimental results and discussion are given in “Experimental results and discussion” section. Conclusions and future research plans are presented in “Conclusions” section.

Related work

Early flame detection methods primarily focused on simple machine learning methods with manual participation in feature design and screening. For example, Sharma et al. [23] combined sensors, drone technology, and image processing to design a fire detection system to avoid damage caused by fire events. Similarly, Sungheetha et al. [24] combined cloud computing, IoT sensors, wireless technology, and drones for fire detection, and to improve the accuracy of the system, image processing technology was integrated with the system further to Avoid the loss of large amounts of resources. This image processing method often requires expert experience to judge the fire area, which is time-consuming, labor-intensive, and expensive. To reduce errors caused by manual participation in feature design and screening, Mahmoud et al. [25] first used a background subtraction algorithm to detect moving areas. They used color space transformation to determine candidate flame areas. Secondly, considering that candidate areas may contain moving classes of fire objects, we use particular wavelet analysis to differentiate between actual fire and fire-like objects and use support vector machines to classify areas of interest as natural fires or non-fires. Dampage et al. [26] proposed a machine learning regression model to improve the performance of fire detection systems. To evaluate the fire area, Alves et al. [27] applied descriptors to a logistic regression classifier in the classification process of fire images. In the post-analysis stage, image processing techniques are applied to analyze the images at different color levels to evaluate the fire area. These fire detection and assessment methods using simple machine learning are beneficial to improving fire detection accuracy, but their low automation efficiency makes it difficult to meet the growing demand.

To solve the errors caused by manual participation in feature design and screening in traditional fire detection methods, researchers have widely used deep learning technology and applied it to fire detection tasks. For example, Kim et al. [28] used a fast regional convolutional neural network (R-CNN) to extract the spatial features of fire and non-fire areas and detect suspected fire areas. Secondly, the bounding box features of consecutive frames are summarized through the extended short-term memory network and classified as to whether a fire occurs in the short term. Finally, decisions from successive short-term periods are consolidated into decisions for the long-term period through a majority voting mechanism. Pincott et al. [29] explored and applied the Faster R-CNN model and conducted experiments on indoor-specific fire images of different resolutions. Wang et al. [30] proposed a lightweight two-stage detection framework (TSSD) based on image analysis technology to monitor real-time whether a fire occurs in a factory and affects property safety. The framework embeds prior knowledge and contextual information into the relationship guidance module, reducing the search space for fire smoke. The framework can effectively capture multi-level, multi-scale fire features with rich semantic information while maintaining the accuracy of single feature map prediction using two robust encoders. Lestari et al. [31] proposed using the Faster Region-Based Convolutional Neural Network (Faster R-CNN) method to detect fire hot spots. Venâncio et al. [32] proposed an automatic fire detection method based on spatial (visual) and temporal patterns. This hybrid method searches for possible fire events (spatial processing) through a CNN detector based on visual patterns and analyzes these candidate areas. The above fire detection methods based on existing two-stage detectors can effectively extract Significant information about the fire area, but the network structure is complex and inefficient. At the same time, it cannot accurately locate the fire in a small area of the fire scene.

Many researchers have designed single-stage fire detection methods to improve fire monitoring efficiency. For example, Seydi et al. [33] proposed a deep learning framework called Fire-Net based on Landsat-8 images, aiming to detect active fire areas and burning materials, that is, using a combination of red, green, blue, and thermal infrared and band images to obtain a more efficient representation and utilize residual convolution and separable convolution blocks to extract deeper features from coarse data sets. Avazov et al. [34] to prevent ship fires from causing severe impacts on crew, cargo, environment, finance, reputation, and safety, based on YOLOv7 architecture with improved E-ELAN (Extended Efficiency Layer Aggregation Network) as the backbone, A fire detection system based on YOLOv7 (You Only Look Once version 7) is proposed. Talaat et al. [35] proposed an intelligent fire detection system (SFDS) that improved YOLOv8, using the advantages of deep learning to detect specific fire characteristics in real time. Compared with traditional fire detection methods, the SFDS method effectively improves fire detection accuracy, reducing false positives and reducing cost-effectiveness. Chen et al. [36] proposed a fire detection method based on improved PP-YOLO, introducing a feature fusion network to fuse two adjacent output feature maps of the backbone network so that High-level features can better integrate the details of low-level features. Then, an attention module is added to the intermediate fused feature maps of two adjacent outputs, allowing the network to selectively fuse valuable information in the feature maps self-learning to alleviate the dilution and aliasing of information during the feature fusion process effect. Jia et al. [37] used pre-processed and pre-trained data sets to construct a forest fire detection strategy. They analyzed the YOLOv8, YOLOv7, and YOLOv5 models from two perspectives: fire detection accuracy and model training speed. Compare. Wang et al. [38] proposed a lightweight detection model MA-YOLO based on yolov5, which designed a multi-directional weighted pyramid structure (MiFPN) to fuse information of different scales and introduced the decoupling head (AD-head) with learning capabilities that can obtain small fire target information in complex environments. Jamal et al. [39] combined motion flicker detection with flicker detection. The former used a background subtraction algorithm to extract moving objects in the video, and the latter used a flicker detection algorithm to verify flicker in suspicious areas. Then, using the color model, it was verified that the color of the flashing area matched the characteristics of the flame. Furthermore, a YOLOv7 model trained on a large fire image dataset is used to validate the results and issue an alarm to warn of the presence of fire. These single-stage-based fire detection methods have good real-time performance but use irrelevant information multiple times during the feature extraction process, which reduces the representation performance of salient features.

Proposed DATFNets method

This section first gives the basic process of the proposed DATFNets fire detection framework. Secondly, each important module involved is elaborated. Finally, the training process of the proposed algorithm is given, as well as the spatially constrained weighted loss function.

Overview

The overall structure of proposed DATFNets fire detection framework is shown in Fig. 1. The fire detection framework mainly consists of feature extractors, dynamic adaptive point learning, sample screening, and spatial constraints. The feature extractor mainly consists of three components: the backbone network of the SWin-Transformer, the context aggregation module with masking strategy (Mask-CAM), and the two-way feature fusion. The input feature information is divided into four-sized subspaces. Each subspace obtains the local details of the fire area through the backbone network of the SWin-Transformer and uses Mask-CAM to embed and fuse this local information to improve the representation of local features and reduce the transmission and aggregation of redundant information. Then, the bidirectional feature fusion component is used to better model features at different levels from both positive and negative directions better to understand the fire area’s global and contextual semantics and strengthen the complementary relationship between local, global, and contextual semantics. The dynamic adaptive point learning module obtains each sampling point through DCNv2 + learning. It assigns corresponding weights to each sampling point to reduce the interference of irrelevant contextual information on the target area and help enhance the deformable convolution pair. The ability to represent the salient information of the target area and, simultaneously, learn the angle information through the directional transformation function. The sample screening and spatial constraint module (DASCM) mainly divides the learned sample points into positive, negative, and abnormal discrete points. Then, it summarizes these abnormal discrete points into positive samples through similarity distance calculation points or negative sample points. At the same time, penalty constraints are imposed on these discrete points to benefit the detection framework in obtaining non-axisymmetric compelling detail features from adjacent targets or complex backgrounds. It is worth noting that each DATFNets fire detection framework module has a separate loss function for supervised adjustment and optimization. In particular, there is a classification loss for supervised training at each extraction stage of the feature extractor.

Fig. 1
figure 1

a Represents proposed DATFNets fire detection framework. b represents the initial feature extractor we designed. c Represents the designed dynamic allocation and spatial constraint module (DASCM)

In Fig. 1, where \(\{ Stages_1 ,Stages_2 ,Stages_3 ,Stages_4 \}\) respectively represents the four different stages of the feature extractor. \(\{ \vec{P}_{1,2,3,4} ,\mathop{P}\nolimits^{\leftarrow} _{1,2,3,4} \}\) respectively represents the forward and reverse pyramid convolution layers in the cross bidirectional pyramid convolution module (CBPCM). \(F_b\) represents the multi-scale fusion features of four different stages. \(F_c\) represents the feature information used for the classifier. \(F_b^{\prime}\) represents the initial features generated by \(DCNv2 +\), mainly used to filter samples to improve positioning accuracy. \(F_d\) represents refined feature information used for anchor box regression. \(DCNv2 +\) stands for deformable convolutional network. \(\{ L_s ,L_r ,L_k \}\) represents classification loss, regression loss, and dynamically allocated spatial constraint loss, respectively. \(\{ \chi_1 ,\chi_2 ,\chi_3 ,\chi_4 \}\) respectively represents the four divided subspaces. \(B(\theta )\) represents the collection points of all samples to be screened. \(\{ \chi_p ,\chi_l ,\chi_d \}\) respectively filtered positive samples (see the dark red points in Fig. 1c), negative samples (see the bottom cyan points in Fig. 1c), and discrete points (see the dark red points in Fig. 1c purple dots). \(\lambda\) represents the penalty factor of spatial constraints. \(SWinTB\) stands for SWin-Transformer block. 'Mask-CAM' stands for contextual attention aggregation module with masking strategy.

Feature extractor

TMask-CAM

Considering that multi-scale local and global context semantics help improve the detection accuracy of fire areas, traditional transformers tend to ignore local details in the initial feature extraction stage. At the same time, a large amount of irrelevant noise is introduced in the path merge layer stage, reducing the salient features that represent performance. We designed a context aggregation module (Mask-CAM) to improve this situation with a masking strategy to replace the path merge layer. In addition, to further improve the representation performance of local details, the merged feature information is divided into four subspaces and refined to improve the representation of local and global semantics. The structure diagram is shown in Fig. 2b. The steps are as follows:

Fig. 2
figure 2

Hot maps demonstration of different fire detection methods on the FireDets datasets. Where (1) represents the input FireDets fire detection image, and (2)–(9) represents different detection methods. (10) represents our proposed DATFNets fire detection framework

Step 1. Assume that the feature map input to the first Mask-CAM is \(X_F^0 \in R^{\frac{H}{4} \times \frac{W}{4} \times C}\), and the feature map \(X_F^0\) contains \(4 \times 4\) feature blocks. We use the traditional attention mechanism [40] to assign an attention weight [41] to each feature block, sort the attention weight value from large to small, and perform refinement operations on the feature block with the slightest attention weight value, that is, in the feature block, each pixel is subjected to attention processing, and \(25\%\) of the pixels with the minimum attention weight value are replaced with \(0\). These pixels replaced with \(0\) cannot improve the representation of local details. The first Mask-CAM output feature \(\chi_1^F\) is shown in Eq.

$$ \left\{ {\begin{array}{*{20}l} {\chi_1^F = Concat\left( {\varpi^0 \cdot \chi^0 , \ldots ,\varpi^N \cdot \chi^N } \right),\quad N \le 4 \times 4} \hfill \\ {\varpi = Att\left( {\chi^0 , \ldots ,\chi^N } \right)} \hfill \\ \end{array} } \right. $$
(1)

where \(\varpi\) represents the attention weight value. \(Att( \cdot )\) indicates the attention operation. \(Concat( \cdot )\) represents the splicing operation. \(N\) represents the number of feature blocks. \(\{ \chi^0 , \ldots ,\chi^N \}\) represents the feature block. The \(i^{th}\) feature block with the minimum attention weight value \(\varpi^i\) and feature block \(\chi^i\) is shown in Eq.

$$ \left\{ {\begin{array}{*{20}l} {\varpi^i = Min\left( {\chi^0 , \ldots ,\chi^N } \right)} \hfill \\ {\chi^i = Mask\left( {\varpi_p^0 p_0 , \ldots ,\varpi_p^\nu p_\nu , \ldots ,\varpi_p^\mu p_\mu , \ldots ,\varpi_p^K p_K } \right),\quad \nu < \mu < K} \hfill \\ \end{array} } \right. $$
(2)

where \(Mask( \cdot )\) represents mask operation. \(p_\cdot\) represents the pixel in the \(i^{th}\) feature block. \(\varpi_p\) represents the attention weight value of the pixel in the \(i^{th}\) feature block. \(Min( \cdot )\) indicates the minimize.\(\varpi_p\) and \(p\) are shown in Eq.

$$ \left\{ {\begin{array}{*{20}l} {\{ \varpi_p^\nu , \ldots ,\varpi_p^\mu \} = Min\left( {\varpi_p^0 , \ldots ,\varpi_p^\nu , \ldots ,\varpi_p^\mu , \ldots ,\varpi_p^K } \right)} \hfill \\ {\{ p_\nu , \ldots ,p_\mu \} = 0} \hfill \\ \end{array} } \right. $$
(3)

where \(Min( \cdot )\) indicates the minimize. \(p_u\) represents the pixel in the uth feature block. \(p_v\) represents the pixel in the vth feature block.

Step 2. To obtain better local details, we divide the output feature \(\chi_1^F\) of the first Mask-CAM into four subspaces, namely \(\{ \chi_1 ,\chi_2 ,\chi_3 ,\chi_4 \}\), and input the subspace \(\{ \chi_2 ,\chi_3 ,\chi_4 \}\) into SWinTB for refinement processing so that \(\chi_1\) can be retained to the maximum extent prior knowledge enriches the representation of underlying semantic information. The output feature \(\{ f_{\chi_2 } ,f_{\chi_3 } ,f_{\chi_4 } \}\) of subspace \(\{ \chi_2 ,\chi_3 ,\chi_4 \}\) is shown in Eq.

$$ \left\{ {\begin{array}{*{20}l} {f_{\chi_2 } = \delta_{SWinTB} \left( {\chi_2 } \right)} \hfill \\ {f_{\chi_3 } = \delta_{SWinTB} \left( {ConCat\left( {f_{\chi_2 } } \right),\chi_3 } \right)} \hfill \\ {f_{\chi_4 } = \delta_{SWinTB} \left( {ConCat\left( {f_{\chi_3 } } \right),\chi_4 } \right)} \hfill \\ \end{array} } \right. $$
(4)

where \(\delta_{SWinTB} ( \cdot )\) indicates SWinTB operation. \(ConCat( \cdot )\) indicates the feature splicing operation. \(\{ \chi_1 ,\chi_2 ,\chi_3 ,\chi_4 \}\) indicates the input feature of four subspaces. \(\{ f_{\chi_2 } ,f_{\chi_3 } ,f_{\chi_4 } \}\) indicates the output feature of four subspaces.

In short, this form of subspace division can help model local semantics, and at the same time, the introduction of prior knowledge enhances the representation of the underlying semantics.

CBPCM

In detecting fire targets, it is a standard operation to use features at different levels for prediction. At the same time, there are many different methods for fusion between different feature layers, but basically, they all zoom first zoom to the same level of resolution and then add or splice them. Although this feature fusion method can improve feature representation performance, there are semantic differences between fire target features at different levels. At the same time, there is an excellent correlation between these feature layers. The closer the feature layer is, the better the performance. The more significant the correlation. Therefore, if only corresponding element values are added without interaction between feature maps, a large amount of detailed information will be lost, and the detection accuracy will be reduced. The bidirectional cross-layer pyramid convolution [42] module we designed mainly transfers information flow from both forward and reverse directions and realizes the interaction between different feature layers. The specific steps are as follows:

Step 3. Assuming that the feature of the input CBPCM is \(X_F\), the operation of the lth layer in the standard pyramid convolution (PConv) is as shown in Eq.

$$ \begin{aligned} y^{(l)} & = W_1 \ast_{ST_{0.5} } X_F^{(l + 1)} + W_0 \ast X_F^{(l)} + W_{ - 1} \ast_{ST_2 } X_F^{(l - 1)} \\ & = Up\left( {W_1 \ast X_F^{(l + 1)} } \right) + W_0 \ast X_F^{(l)} + W_{ - 1} \ast_{ST_2 } X_F^{(l - 1)} \\ \end{aligned} $$
(5)

where \(l\) indicate the layers of standard pyramid convolution. \(\{ W_0 ,W_1 ,W_{ - 1} \}\) indicates three independent convolutional kernels. \(\ast_{ST_2 }\) indicates a convolution with stride \(2\). \(\ast_{ST_{0.5} }\) indicates the kernel of stride \(0.5\), and it is further replaced by a normal convolution with stride of 1 and a consecutive bilinear up sampling layer. \(Up( \cdot )\) indicates the up-sample operation.

Step 4. Unlike the standard PConv, the CBPCM we designed gathers detailed information on fire targets across layers from both forward and reverse directions simultaneously, further improving the representation of contextual semantics without adding additional salient features and enhancing features at different levels of interaction. The inputs feature of forward and reverse at layer lth is shown in Eq.

$$ \overrightarrow {X_F^{(l)} } = LAEM\left( {\overrightarrow {X_F^{(l - 1)} } ,\overleftarrow {X_F^{(l - 2)} } ,X_F^s } \right),s,l = 3,4 $$
(6)
$$ \overleftarrow {X_F^{(l)} } = LAEM\left( {\overleftarrow {X_F^{(l + 1)} } ,\overrightarrow {X_F^{(l - 2)} } ,\overrightarrow {X_F^{(l)} } } \right),l = 3,4 $$
(7)

where \(\vec{ \cdot }\) represents forward input features, and \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{ \cdot }\) represents reverse input features. \(s\) represents the feature information generated by the \(s^{th}\) stage of the backbone network with Mask-CAM. \(LAEME( \cdot )\) stands for local attention embedding operation. \(\overrightarrow {X_F^{(l)} }\) indicate forward input features by \(l^{th}\) layer. \(\overleftarrow {X_F^{(l)} }\) indicate reverse input features by \(l^{th}\) layer.\(LAEME( \cdot )\) is shown in Eq.

$$ LAEM\left( {\overrightarrow {X_F^{(l - 1)} } ,\overleftarrow {X_F^{(l - 2)} } ,X_F^s } \right) = \frac{1}{3}\left[ {\overrightarrow {X_F^{(l - 1)} } + \overleftarrow {X_F^{(l - 2)} } + X_F^s + \alpha \left( {\overrightarrow {X_F^{(l - 1)} } + \overleftarrow {X_F^{(l - 2)} } + X_F^s } \right)} \right] $$
(8)

where \(\alpha\) indicates we generate attention map by these inputs feature of \(\overrightarrow {X_F^{(l - 1)} } ,\overleftarrow {X_F^{(l - 2)} }\) and \(X_F^s\). \(\alpha\) is shown in Eq.

$$ \alpha = Up\left( {\sigma_1 \left( {X_F^s \cdot \sigma_2 \left( {Conv_{1 \times 1,r}^\downarrow \cdot \left( {\overrightarrow {X_F^{(l - 1)} } + \overleftarrow {X_F^{(l - 2)} } } \right)} \right)} \right)} \right) $$
(9)

where \(\sigma_1 ,\sigma_2\) represents the activation function sigma and \(RReLu\), respectively. \(Conv_{1 \times 1,r}^\downarrow\) represents the dimensionality reduction convolution operation with a reduction rate of \(1 \times 1\). \(Up( \cdot )\) indicates the up-sample operation.

Therefore, the output of CBPCM layer \(l^{th}\) is shown in Eq.

$$ \begin{aligned} y^{(l)} & = Up(W_1 \ast \left( {\overrightarrow {X_F } + \overleftarrow {X_F } )^{(l + 1)} } \right) + W_0 \ast \left( {\overrightarrow {X_F } + \overleftarrow {X_F } } \right)^{(l)} \\ & \quad + W_{ - 1} \ast_{ST_2 } \left( {\overrightarrow {X_F } + \overleftarrow {X_F } } \right)^{(l - 1)} + X_F \\ \end{aligned} $$
(10)

where \(Up( \cdot )\) indicates the up-sample operation. \(l\) indicates the layers of CBPCM. \(\{ W_0 ,W_1 ,W_{ - 1} \}\) indicates three independent convolutional kernels. \(y^{(l)}\) indicates the output of CBPCM layer \(l^{th}\).

To compensate for the shortcomings of single-scale features, we up sample and splice features at different levels to obtain the final input feature \(F_b\) for classification and regression operations. \(F_b\) is shown in Eq.

$$ F_b = \sum_{l = 1}^L U p\left( {y^{(l + 1)} } \right) + y^1 ,L = 4 $$
(11)

where \(Up( \cdot )\) indicates up-sample operation. \(L\) indicates the number layer of CBPCM and \(l \in L,l \le L\).

Object-oriented detector

The detection module consists of a classifier and a repressor, where the classifier is designed to classify fire targets accurately, and the repressor is designed to accurately locate fire targets further to improve the overall performance of the fire detection framework. The steps are as follows:

Step 5. We input the fused feature \(F_b\) obtained by the feature extractor into the detector and use an improved version of deformable convolution \(\left( {DCNv2 + } \right)\) to capture the geometric information of the fire target in any direction. Unlike traditional DCNv2 [43], where all \(3 \times 3\) convolutional layers in \(Conv3\), \(Conv4\), and \(Conv5\) are replaced, we use deformable convolution only to replace the \(3 \times 3\) convolutional layers in the Conv3 and Conv5 modules. On the one hand, this replacement method can improve the network detection ability of geometric information of arbitrary fire targets by increasing the receptive field. On the other hand, the local details of the fire target in a specific direction of the network are retained to the maximum extent, which helps the network to locate the fire target and regional scope. In addition, to further enhance the control ability of the improved deformable convolutional neural network \(\left( {DCNv2 + } \right)\) on the spatial support area, at the same time, consider that we only use deformable convolution in the low layer (\(Conv3\)) and high layer (\(Conv5\)), giving each sampling point sets a set of offsets and weight coefficients. Add weight and offset to each sampling point as shown in Eq.

$$ Z(\rho ) = \sum_{k = 1}^K {(\omega + W_k )} \cdot F_b (\rho + \rho_k + \vartriangle \rho_k ) \cdot (\omega^{\prime} + \vartriangle {\rm{\mathcal{M}}}_k ) $$
(12)

where \(\vartriangle \rho_k\) represents the learned offset, \(\vartriangle {\rm{\mathcal{M}}}_k\) represents the learned modulation scalar, and its range is [0, 1]. \(\omega\) and \(\omega^{\prime}\) represent balancing factors. \(W_k\) represents the sampling point weight.

Step 6. Due to inadequate supervision, screening high-quality positive sample points from the adaptive point set helps capture the physical appearance and geometric characteristics of dense and arbitrarily oriented fire areas. Therefore, we design an effective dynamic adaptive point screening and allocation strategy to screen out high-quality salient points from these point sets as positive samples and use the worst-quality points as negative samples. This allocation strategy is beneficial to distinguish fire targets from surrounding backgrounds.

Assuming that the adaptive point set is \(B(\theta )\), we use the density-based spatial clustering of applications with noise method to divide these adaptive points into \(4\) groups and find the group \(Q_p\) with the highest average classification confidence from these 4 groups as positive samples; the group \(Q_l\) with the lowest classification confidence is used as a negative sample, and the group \(Q_d\) that is farthest from the other three groups and has the fewest adaptive points is used as discrete outlier points for spatial constraints. The overall mass \(Q_a\) of these collection points is shown in Eq.

$$ Q_a = Q_p (\theta ) + \zeta_1 \cdot Q_l (\theta ) + \zeta_2 \cdot Q_d (\theta ) + \zeta_3 \cdot Q_m (\theta ) $$
(13)

where \(\zeta_1 = 0.1,\zeta_3 = 0.3\).

Step 7. The positive sample point is \(B_p (\theta )\), the average classification confidence is \(\eta_h\), and the cluster center point of the positive sample \(B_p (\theta )\) is \(C_B^p\). The negative sample point is \(B_l (\theta )\), the average classification confidence is \(\eta_l\), and the cluster center point is \(C_B^l\). \(Q_m\) represents the remaining. The clustering sample points have an average classification confidence \(\eta_l\), and the cluster center point is \(C_B^m\). The spatial constraints for discrete outlier points are as shown in the equation.

$$ \zeta_2 = \left\{ {\begin{array}{*{20}l} {\zeta_1 ,} \hfill & {\xi_D = Max\left( {|x_{C_B^l } - x_d |,|y_{C_B^l } - y_d |} \right)} \hfill \\ {\zeta_3 ,} \hfill & {\xi_D = Max\left( {|x_{C_B^M } - x_d |,|y_{C_B^M } - y_d |} \right)\vphantom{_{1_{1_{1_{1_{1_{1_{1}}}}}}}}} \hfill \\ {1,} \hfill & {\xi_D = Max\left( {|x_{C_B^p } - x_d |,|y_{C_B^p } - y_d |} \right)\vphantom{_{1_{1_{1_{1}}}}}} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(14)

where \(\xi_D\) represents the minimum distance. \(Max(| \cdot |,| \cdot |)\) represents the Chebyshev distance. \(x_\cdot\) represents the abscissa of the discrete point. \(y_\cdot\) represents the ordinate.

In summary, our constraint method can maximize the quality of adaptive points, thereby improving detection and classification accuracy.

Reconstructed loss function

To ensure that the proposed DATFNets fire detection framework can converge smoothly and maintain optimal performance, we designed a weighted loss function with spatial constraints. Classification loss plays a vital role in the network as a subject loss to ensure that the network generates practical adaptive point sets. The regression loss and spatial constraint loss refine the adaptive point set to ensure positioning accuracy and obtain a precise adjustment. The total loss L is shown in Eq.

$$ L_T = L_c + \beta_1 L_c^s + \left( {\beta_2 + \beta_2^{\prime} } \right)\left( {L_r + L_k } \right),\quad s = \{ 1,2,3,4\} $$
(15)

where \(L_c\) represents classification loss. \(L_c^s\) represents the classification loss at four different stages of the feature extractor. \(L_r\) represents regression loss. \(L_k\) represents the spatial constraint loss of set point screening. \(\{ \beta_1 ,\beta_2 ,\beta_2^{\prime} \}\) represents the balance factor respectively, namely, \(\beta_1 = 0.2,\beta_2 = 0.7,\beta_2^{\prime} = 0.5\). \(L_c\) and \(L_c^s\) are shown in Eq.

$$ L_c = \frac{1}{M}\sum_i {{\rm{\mathcal{F}}}^c } \left( {{\rm{\mathcal{R}}}_i^c \left( \theta \right),{\rm{\mathcal{T}}}_j^c } \right) $$
(16)
$$ L_c^s = \sum_i {{\rm{\mathbb{F}}}^c } ({\rm{\mathcal{R}}}_i^c (\theta ),{\rm{\mathcal{T}}}_j^c ) $$
(17)

where \(M\) represents the total number of adaptive point sets. \({\rm{\mathcal{R}}}_i^c (\theta )\) represents the confidence of the predicted category. \({\rm{\mathcal{T}}}_j^c\) represents the ground truth class. \(F^c ( \cdot )\) stands for Quality Focal Loss. \({\rm{\mathbb{F}}}^c ( \cdot )\) means Focal loss. \(L_r\) are shown in Eq.

$$ L_r = \frac{1}{M}\sum_i {[{\rm{\mathcal{T}}}_j^c \ge 1]} {\rm{\mathbb{F}}}^g ({\rm{\mathcal{O}}}_i^r (\theta ),{\rm{\mathcal{T}}}_j^r ) + \frac{\gamma }{M}\sum_i {[{\rm{\mathcal{T}}}_j^c \ge 1]{\rm{\mathbb{F}}}^d ({\rm{\mathcal{O}}}_i^r (\theta ),{\rm{\mathcal{T}}}_j^r )} $$
(18)

where \({\rm{\mathcal{T}}}_j^r\) indicates the location of ground-truth box. \({\rm{\mathcal{O}}}_i^r (\theta )\) indicates the location of predicted box. \({\rm{\mathbb{F}}}^g ( \cdot )\) indicates the GIoU loss. \({\rm{\mathbb{F}}}^d ( \cdot )\) indicates the DIoU loss. \(\gamma\) indicates the balance factor. \(L_s\) are shown in Eq.

$$ L_k = \frac{1}{(M_p + M_l ) \cdot M_d }\sum_{i = 1} {\sum_j {\varphi_{i,j} } } = \frac{1}{(M_p + M_l ) \cdot M_d }\sum_{i = 1} {\sum_j | } \varphi^{\prime} (\chi_p ,\chi_l ) - \chi_d | $$
(19)

where \(\varphi ( \cdot )\) indicates the penalty function for spatial constraints. \(\varphi^{\prime} ( \cdot )\) indicates center points function of positive and negative sample by density-based spatial clustering of applications with noise. \(\chi_d\) indicates a discrete outlier point, namely, a point outside the box. \(M_p ,M_l ,M_d\) indicates the number of positive sample collection points, negative sample collection points, and the number of collection points outside the box, respectively. \(\left| \cdot \right|\) indicates the module operation, namely, distance calculation.

The training process of the proposed DATFNets fire detection framework is shown in Algorithm 1.

Algorithm 1. The training process of our proposed DATFNets fire detection framework

 

Input: Given a fire image of \(X \in R^{C \times H \times W}\), \(C,H\) and \(W\) represent the channel, height, and width, respectively. The maximum number of iterations is \(MaxEpochs\), and the reconstructed weighted total loss function is \(L_T\). The features generated by TMask-CAM are \(X_F^s ,s = \{ 1,2,3,4\}\). The fusion feature of CBPCM is \(F_b\). The generated adaptive point set is \(B(\theta )\)

 

For \(epochs = 0\) to \(epochs = MaxEpochs\) do:

 

   1. \(\{ \chi_2^F ,\chi_3^F ,\chi_4^F \} \xleftarrow[{TMask - CAM}]{Eq.(1)}X \in R^{C \times H \times W}\)

 

   2. \(\{ X_F^s ,s = \{ 1,2,3,4\} \} \xleftarrow[{TMask - CAM}]{Eq.(2),Eq.(3),Eq.(4)}\{ \chi_2^F ,\chi_3^F ,\chi_4^F \}\)

 

   3. \(F_b \xleftarrow[{CBPCM}]{Eq.(5)\ Eq.(11)}\{ X_F^s ,s = \{ 1,2,3,4\} \}\)

 

   4. \(F_d \xleftarrow{DCNv2 + }\{ F_c ,F_b^{\prime} \} \xleftarrow{DCNv2 + }F_b\)

 

   5. \(\{ Q_a \} \xleftarrow{Eq.(12)\ Eq.(14)}\{ F_b^{\prime} \}\)

 

   6. \(\{ \zeta_2 \} \xleftarrow{Eq.(15)}\{ B_l (\theta ),B_p (\theta ),B_m (\theta )\} \xleftarrow{DBSC}B(\theta )\)

 

   7. optimization the training during by \(AdamW\) and total loss function of \(L_T\)

 

Output: Finally, the type and detection box of the target in the fire image are given.

 

Experimental results and discussion

This section provides the data sources required for the experiment and the corresponding evaluation indicators and initialization network parameters. Secondly, ablation experiments of the internal modules of the proposed DATFNets fire detection framework are provided, and at the same time, comparison results with current advanced detection methods are provided. Finally, a visual demonstration is provided and discussed in detail.

Datasets and metrics

Data preparation

FireDets

This data set mainly includes 6675 images of high-risk fire scenes of different sizes, such as residences, gas stations, highways, and forests. These data are used to train the network, which helps to detect smoke and fire in the monitoring area automatically and helps firefighting or Researchers make timely fire management strategies to minimize casualties and property losses. To ensure the correctness and effectiveness of the experiment, we randomly divided the data into training, verification, and test samples in a ratio of 6:1:3; that is, the number of test samples is 2002.

WildFurgFires

This data set contains a total of 25 videos. We selected 17 videos containing fire scenes, deleted the video frames without fire scenes, and generated 9968 fire pictures with a size of \(1280 \times 720\). In addition, to ensure the experiment's smooth progress, we randomly selected 600 additional wild smoke images with a size of \(640 \times 480\) from the Wildfire data set to form a data sample. We divided the data set into three parts according to 6:1:3 training, verification, and test samples, of which there are 245 wild smoke images in the test sample.

FireAndSmokes

This dataset consists of early fire and smoke images captured in real-life scenarios using mobile phones. The images were captured under various lighting conditions (indoor and outdoor scenes), weather, etc., and included typical household scenes. such as garbage incineration, paper and plastic incineration, field crop burning, home cooking, etc. We divide 1247 images of different sizes into training, validation, and testing samples. In addition, to ensure the experiment's consistency and effectiveness, we performed a series of data enhancements on the above three sets of baseline data sets before training to avoid errors caused by sample quality. At the same time, these images were cropped to \(512 \times 512\) size.

Evaluation metrics

To ensure the fairness and consistency of the experiment, we use recall (R), average precision (AP) and mAP as the evaluation indicators of all methods. The calculation of these indicators is shown in Eq.

$$ R = \frac{TP}{{TP + FN}} $$
(20)
$$ mAP = \frac{1}{N} \cdot \sum_{i = 1}^N {\int_0^1 {P_i } } (R_i )d(R_i ) $$
(21)

where \(TP,TN,FP\) and \(FN\) indicate true positive, true negative, false positive, false negative, respectively. \(N\) is the total number of defect categories. 'AR' indicates the average recall.

Parameter settings

In the training phase of the proposed DATFNets fire detection framework, the initial learning rate (lr) is 0.00025, the batch size of each GPU is 16, and the number of training times is 36. For the network to fall into a local optimum, we use dynamic changes to control the learning rate. that is, the learning rate becomes \(lr \times 0.5\) every 6 training rounds. Finally, AdamW is the optimizer to adjust and optimize the overall network, and the decay rate is 0.095, The T_max is \(18\). All the experiment development was completed based on python3.7.6 of ubuntu20, and the deep learning libraries include torch1.7.0 + cuda11.0 and Numpy1.19.5. At the same time, training and testing were conducted on four RTXA6000 GPU.

Comparison with state-of-the-art methods

To demonstrate the effectiveness of the proposed DATFNets fire detection framework, we compare it with the current advanced detection methods on three sets of baseline datasets of FireDets WildFurgFires and FireAndSmokes and provide corresponding experimental results and analysis. At the same time, the detection efficiency of different methods is given.

The experimental results on the FireDets datasets

Table 1 shows that the DATFNets fire detection framework we proposed has achieved optimal performance, such as mAP increased by 0.025 and 0.051, respectively, compared with mask2former and dino. There are two possibilities. On the one hand, we embedded the clock in the backbone network of the context aggregation module with a masking strategy (Mask-CAM), which can better obtain the detailed information and contextual global semantics of fire targets from images than Swin-L. At the same time, dividing subspace helps model local semantics. On the other hand, the designed dynamic adaptive point representation and allocation strategy can provide the detector with more efficient and higher-quality samples. In addition, SASM showed the worst performance. This may be because many objects in life are close to fireworks, and it is difficult for this method to distinguish them. For example, clouds, red lights, etc., are similar in color to fire targets, which can easily cause model misdetection. The hot maps of different methods on the FireDets data set are shown in Fig. 2.

Table 1 Experimental results of different methods on the FireDets datasets

The experimental results on the WildFurgFires datasets

According to Table 2, we can draw the following conclusions:

  1. 1.

    Our proposed DATFNets fire detection method still performs best on the WildFurgFires baseline dataset, such as AR and mAP of 0.977 and 0.909, respectively. The possible reason is that the fire targets in this data set are more pronounced, there are fewer fire-like targets, and the differences between different targets in the image are significant. At the same time, the reconstruction loss function used in our proposed framework performs constraint adjustments at each stage, prompting the network to obtain better deep discriminative features. In addition, compared to ResNet50 and ConvNeXt\-V2 as the backbone network, using Swin\-L as the feature extractor may help the network obtain more effective fire target details. For example, the mAP of mask2former is 0.074 and 0.074 higher than that of s2net and DiffusionDet, respectively. 0.095. The AR and mAP of dino are improved by 0.068 and 0.102, respectively, compared with mask-rcnn.

  2. 2.

    Using ResNet50 as the backbone network, the mAP of SASM is 0.117 and 0.096 lower than s2net and DiffusionDet, respectively. The DiffusionDet method may provide practical target candidate areas for detection by generating random frames, thereby improving detection accuracy. The s2net method effectively solves the mismatch problem between anchor boxes and directed features through aligned feature maps and alleviates the inconsistency between classification scores and positioning accuracy. SASM still achieves the worst performance among all detection methods. The hot maps of different methods on the FireDets data set are shown in Fig. 3.

Table 2 Experimental results of different methods on the WildFurgFires datasets
Fig. 3
figure 3

Hot maps demonstration of different fire detection methods on the WildFurgFires datasets. Where (1) represents the input FireDets fire detection image, and (2)–(9) represents different detection methods. (10) represents our proposed DATFNets fire detection framework

The experimental results on the FireAndSmokes datasets

From Table 3, we can see that the proposed DATFNets fire detection still performs best on this data set. For example, mAP is improved by 0.043 and 0.103, respectively, compared with mask2former and LSKNet, and AR is improved by 0.119 compared with DiffusionDet and yolov8, respectively. and 0.221. This also proves that our proposed method has good robustness and generalizability. Similarly, mask2former, dino, and LSKNet are highly competitive on this data set compared to other fire detection methods, where LSKNet mAP is 0.054 higher than s2net. The LSKNet detection method effectively oversees different complex backgrounds of fire targets by dynamically adjusting the receptive field of the backbone network. At the same time, it uses a spatial selection mechanism to filter redundant information and improve detection performance. The hot maps of different methods on the FireDets data set are shown in Fig. 4.

Table 3 Experimental results of different methods on the FireAndSmokes datasets
Fig. 4
figure 4

Hot maps demonstration of different fire detection methods on the FireAndSmokes datasets. Where (1) represents the input FireDets fire detection image, and (2)–(9) represents different detection methods. (10) represents our proposed DATFNets fire detection framework

According to Figs. 3, 4 and 5, we can see that the DATFNets fire detection framework we proposed can effectively locate the fire area, and at the same time, it is more effective in exploring the scope of the fire area. As shown in Fig. 3, DiffusionDet and mask2former can also effectively locate the fire area compared to the heat maps of other methods, but the fire range determination effect is poor. The DATFNets method achieves such effective results because screening high-quality adaptive point fire samples and spatial constraints and penalties on discrete points help determine the scope of the fire.

Fig. 5
figure 5

Detection efficiency of different fire detection methods. FLOPs represent the number of floating-point operations, and Params represent the detection method overall parameter size

Efficiency of different fire detetion methods

We can get the detection efficiency of different fire detection methods in Fig. 5.

Figure 5 shows that although mask2former and dino have achieved good competitive performance in detection accuracy on the three data sets, their detection efficiency is low; that is, FLOPs are 7.1G and 8.2G, respectively. The reason may be that the model parameters are significant. The detection efficiency of SASM and yolov8 is better, namely, the FLOPs are 18.7G and 21.8G, respectively, but the detection accuracy of the fire area in the image is poor, which will lead to easy misjudgment of fire targets during the monitoring process, providing incorrect information to firefighters and researchers, resulting in fire emergencies or unsatisfactory management strategies. The proposed DATFNets fire detection framework has better accuracy without detection than other methods. At the same time, the detection efficiency is good; that is, FLOPs are 13.9G. It can meet the need for real-time monitoring of fire scenes and help firefighters and researchers formulate optimal management strategies, to prevent severe damage to life and property.

Ablation study

To demonstrate that each module in the proposed DATFNets fire detection framework plays a positive role in the overall performance, we separately tested the feature extractor, positioning detection, classification module, and loss function on three baseline datasets of FireDets, WildFurgFire, and FireAndSmokes for multiple modules to be evaluated and verified.

Experiment results of different feature extractors

According to Tables 4, 5 and 6, we can draw the following conclusions:

  1. 1.

    On the three baseline datasets, the CBPCM feature aggregation component we designed played a positive role in the overall fire detection framework and can effectively improve the detection performance of fire targets. For example, the mAP of Swin-L + CBPCM is 0.007, 0.067, and 0.042 higher than that of Swin-L + FPN, respectively. The AR of ResNet50 + CBPCM is improved by 0.056, 0.046, and 0.058 compared with ResNet50 + FPN. The mAP of ReResNet + CBPCM is improved by 0.028, 0.053, and 0.033, respectively, compared with ReResNet + ReFPN, which proves the effectiveness of CBPCM. On the one hand, BCPCN uses bidirectional cross-layer aggregation to aggregate fire details from both forward and reverse directions. It improves the representation of contextual global semantics. At the same time, the cross-layer information flow reduces the utilization of irrelevant information and establishes effective long-term relationships between features at different scales. On the other hand, pyramid convolution increases the receptive field by increasing the convolution kernel size, which helps represent local features.

  2. 2.

    On the FireAndSmokes data set, the mAP of TMask-CAM + FPN is 0.04 and 0.038 higher than that of Swin-L + FPN and Swin-S + FPN respectively. On the WildFurgFires data set, the mAP of TMask-CAM + ReFPN mAP is 0.095 and 0.113 higher than ReResNet + ReFPN and ResNet50 + FPN, respectively. This shows that TMask-CAM plays a positive role in the overall detection framework. The possible reason is that the Mask-CAM replacement path merging layer reduces the transmission and aggregation of irrelevant information. At the same time, dividing it into four subspaces can further improve the representation of local features and highlight the detailed information about the fire area.

Table 4 Experimental results of different feature extractors on the FireDets datasets. 'CBPCM' indicates our proposed cross-layer bidirectional aggregation pyramid convolution module
Table 5 Experimental results of different feature extractors on the WildFurgFires datasets
Table 6 Experimental results of different feature extractors on the FireAndSmokes datasets

Dynamic adaptive point representation and discrete point space constraint strategy (DAPSC)

The experimental results of our proposed DATFNets using dynamic adaptive point representation and discrete point space constraint strategy (DAPSC) on three baseline datasets are shown in Table 7.

Table 7 Experimental results of different feature extractors on the FireDets datasets

It can be seen from Table 7: In the FireDets datasets, the AR and mAP of TMask-CAM + CBPCM + DPSC are lower than TMask-CAM + CBPCM + DAP by 0.013 and 0.007, respectively. Compared with the Swin-L + CBPCM + DPSC method, the AR and mAP are improved by 0.013 and 0.023, respectively. In the WildFurgFires datasets, the mAP of TMask-CAM + CBPCM + DPSC is 0.003 higher than that of Swin-L + CBPCM + DAP but 0.02 lower than TMask-CAM + CBPCM + DAP. In the FireAndSmokes data set, the mAP of TMask-CAM + CBPCM + DAP method AR and mAP are improved by 0.015 and 0.012 compared with the TMask-CAM + CBPCM + DPSC method. This indicates two problems: the modules DPSC and DAP play a positive role in the overall performance of the proposed framework and contribute to the fire detection framework, enabling more precise location and scope assessment of fire areas. Secondly, the positive role played by DAP is significantly more significant than that of the DPSC module, thanks to the dynamic adaptive point allocation strategy that provides the detector with higher-quality fire samples, which helps locate the fire area.

Reconstructed loss function

To verify theeffectiveness of thedesigned loss function, we evaluated and demonstrated three sets of baseline data sets. The experimental results are shown in the below table.

In Table 8, where \(L_c\) means using QualityFocalLoss as the classification loss. \(L_r\) indicates using GIoULoss as the regression loss. \(L_c^s\) represents the classification loss for each feature extraction stage of our design. \(L_{rc}\) represents the spatially constrained classification loss we designed. \(L_{rr}\) represents the regression loss of our design.

Table 8 Experimental results for different loss functions

From Table 8, we can draw the following conclusions:

  1. 1.

    On the three baseline datasets, the mAP value of \(L_c^s + L_c + L_r\) is significantly higher than \(L_c + L_r\) by 0.008, 0.006, and 0.007, respectively. This shows that introducing a loss function to each feature extractor stage helps improve the overall performance of the fire detection framework. The loss function at each stage effectively prevents the network from falling into a local optimum and causing overfitting, thereby improving the representation ability of fire target features.

  2. 2.

    In the WildFurgFires dataset, the mAP of \(L_c^s + L_{rc} + L_r\) is 0.003 and 0.005 higher than \(L_c^s + L_c + L_r\) and \(L_{rc} + L_r\), respectively. In the FireAndSmokes data set, the AR of \(L_c^s + L_{rc} + L_r\) is 0.003 and 0.002 higher than \(L_c^s + L_c + L_r\) and \(L_{rc} + L_r\), respectively. This illustrates the spatial constraint classification we designed. The loss function can improve fire detection accuracy. The loss function we designed improves the transmission and convergence of detailed information by the network in the form of spatial weighting.

  3. 3.

    In the FireDets dataset, the AR and mAP of \(L_c^s + L_c + L_{rr}\) are improved by 0.006 and 0.009 compared to \(L_c + L_{rr}\). On the WildFurgFires dataset, the mAP of \(L_c^s + L_c + L_{rr}\) is improved by 0.002 compared to \(L_c^s + L_c + L_r\). This illustrates the overall performance of our reconstructed regression loss function on the fire detection framework played a positive role. In addition, on the FireAndSmokes dataset, the mAP of \(L_c^s + L_c + L_{rr}\) is 0.001 lower than that of \(L_c^s + L_{rc} + L_r\). This shows that the spatial constraint classification loss has a more significant impact on the overall detection performance than the reconstruction regression loss. The visual demonstration of different loss functions when iterating 10 times is shown in Figs. 6, 7 and 8.

Fig. 6
figure 6

Visual demonstration of different loss functions on the FireDets dataset. Where, \(L_c^s\) represents the classification loss for each feature extraction stage of our design. \(L_{rc}\) represents the spatially constrained classification loss we designed. \(L_{rr}\) represents the regression loss of our design

Fig. 7
figure 7

Visual demonstration of different loss functions on the WildFurgFires dataset. Where, \(L_c^s\) represents the classification loss for each feature extraction stage of our design. \(L_{rc}\) represents the spatially constrained classification loss we designed. \(L_{rr}\) represents the regression loss of our design

Fig. 8
figure 8

Visual demonstration of different loss functions on the FireAndSmokes dataset. Where, \(L_c^s\) represents the classification loss for each feature extraction stage of our design. \(L_{rc}\) represents the spatially constrained classification loss we designed. \(L_{rr}\) represents the regression loss of our design

From Figs. 6, 7, and 8, we see that all loss functions show a decreasing trend as the number of iterations increases. It is worth noting that on the FireDets datasets when training reaches nine rounds, Lrc and Lrr increase. This may be because the FireDets data set contains complex fire-like targets, which interferes with the extraction of fire targets.

Examples and discussion

We visually demonstrate a fire target and evaluate the damaged area to demonstrate further the effectiveness of the proposed DATFNets fire detection framework. The visually demonstrate of different datasets is shown in Fig. 9.

Fig. 9
figure 9

Visually demonstrate of different datasets with our proposed DATFNets methods. Among them, FiresCov, WildFurgFires and FireAndSmokes represent different baseline datasets

It can be seen from Fig. 9 that the fire detection framework of DATFNets we proposed has a good representation effect on large-scale fire targets. At the same time, it has good detection performance on small-scale area fire targets. On the one hand, we use mask context aggregation (Mask-CAM) in the feature extractor to model the global semantics of the context and refine local features by dividing subspaces to reduce the transmission of irrelevant information. On the other hand, in the detection phase, the dynamic adaptive point representation and allocation method are used to screen high-quality fire samples and impose penalty constraints on discrete points, further improving the representation of the salient features of the fire area and promoting better network performance. Distinguish between fire areas and background information. In addition, the designed spatially constrained loss function is beneficial to detecting fire targets.

Conclusions

To help firefighters and researchers specify the best fire management strategies, as well as reduce property losses and ensure personnel safety when a fire occurs, this study develops a dynamic adaptive distribution transformer detection framework (DATFNets), which is adopted in the feature extraction stage by using the local details and contextual global semantics of the fire target is obtained by masking context aggregation and subspace division. Secondly, the dynamic adaptive point representation and allocation learning strategy are used in the detection stage to improve fire samples' representation and constrain discrete point samples. At the same time, the designed spatial constraint loss function plays an effective optimization and adjustment role in the overall framework. Finally, the proposed fire detection framework achieved optimal performance by evaluating and verifying three baseline data sets: FireDets, WildFurgFires, and FireAndSmokes.

Due to inadequate supervision, screening high-quality positive sample points from the adaptive point set helps capture the physical appearance and geometric characteristics of dense and arbitrarily oriented fire areas. Therefore, we design an effective dynamic adaptive point screening and allocation strategy to screen out high-quality salient points from these point sets as positive samples and use the worst-quality points as negative samples. This allocation strategy is beneficial to distinguish fire targets from surrounding backgrounds.