1 Introduction

Deep convolutional neural networks are extensively used in various fire-related tasks [1,2,3] and several vision intelligence tasks [4,5,6,7] but are contingent on the size and quality of the samples in the training dataset. Making fire detection algorithms more effective and versatile requires a large amount of data from diverse environments, such as factories, warehouses, and wildfires. Owing to the practical difficulties in controlling and spreading fires, expanding the existing fire image integration in simulation images remains a critical concern. However, using computer graphics software to create high-quality fire images is costly, time-consuming, and extremely restricted in real-world applications. Thus, the generation of fire images using deep learning networks is necessary to address the sampling imbalance between fire and non-fire image categories. Additionally, it is crucial to identify a fire outbreak at its inception; this early detection goal usually drives research in the field of anomaly detection such as these. Accordingly, there seems to be a main scope in this domain for such systems that can detect fire as soon as it occurs when the visual cues are at their initial advent, such as the lowest possible size of the fire.

One simple solution is to combine the fire images obtained in laboratories with the above-mentioned typical environment(s). Several other traditional methods [8,9,10] are based on extracting a fire kernel from the original image (cutting) and pasting it into a background (patching). Levin et al. [8] introduced cost functions in the gradient domain to evaluate stitching quality, focusing on image similarity and seam visibility and optimizing for photometric and geometric consistency. RICAP [9] is proposed as a data augmentation method that constructs training images by randomly cropping and patching four images to enhance dataset variety in deep CNNs while mixing class labels for benefits akin to label smoothing. Walawalkar et al. [10] proposed Attentive CutMix, an augmentation strategy for CNNs that uses attention maps to selectively occlude the most discriminative parts of images, enhancing generalization over random occlusion methods. However, these approaches are challenging for the following reasons:

  • The areas between a fire kernel and the background scene cannot transition smoothly using only the cutting-patching process. Cutting the entire fire region is challenging, as a blurry halo typically surrounds the fire, and no transparent segregation edge exists.

  • Reflections of neighboring objects might result in an unrealistic merged image. Several methods have been proposed based on the Poisson equations to make the cropping border appear seamless and further refine the texture and style of the blending part [11]. However, the attributes of the fire are no longer compatible with reality in regard to the intensity, color, and texture when the foreground is modified to be compatible with the background.

Many deep learning-based approaches have been proposed for synthetic fire image generation. These can be divided into two main categories: generative adversarial network (GAN)-based methods [12] and variational auto-encoder (VAE)-based methods [13]. GAN-based methods frequently encounter issues regarding training stability and uncontrollable synthetic images. Image-to-image translation [14] is a GAN application that has recently been widely used to convert non-fire images to fire images while preserving the background content. However, background distortions and discolorations are two typical challenges in such image translations, leading to unexpected and uncontrollable results. In contrast, the main problem with VAEs-based methods is that they often produce blurry, unrealistic outcomes.

Despite the success in generating fire images, there are additional constraints on the proposed approaches [15,16,17]. First, manually choosing a suitable position and reasonable scale requires a significant amount of time and reduces the variety of synthetic fire images. Second, restrictions on the resolution of the generated fire kernel can lead to an incompatibility between the background and foreground in the synthesized image. Accordingly, this study presents a new fully automated approach for generating fires named "SynFAGnet" inspired by the various related methods and designed to be flexible in diverse conditions. The main contributions of the approach are as follows:

  • The structure of SynFAGnet includes two main parts: an object-scene placement network (OSPNet) and a local–global context-based GAN (LGC-GAN). The OSPNet automatically learns reasonable scales and positions for the fire kernel on a given background. The LGC-GAN generates more realistic synthetic fire images using styling information such as halos and reflections. In addition, the generator takes a mask as input, allowing it to produce the fire image while preserving the shape of the input object.

  • The feasibility of using SynFAGnet as a data augmentation technique for handling dataset imbalances and improving object detection and segmentation performance is verified.

This study consists of four main sections. Section 2 presents a summary of the related research on the technologies employed in this study. Section 3 discusses the data processing and proposed framework used to generate synthetic fire images with suitable positions and scales. Sections 4 and 5 analyze experiments and the influences of the augmented data on fire detection and segmentation. Sections 6 and 7 conclude the paper with a summary and discussion regarding future work.

2 Related Work

The study mainly focuses on finding suitable positions and scales for a fire kernel and synthesizing realistic-looking images. Therefore, this section primarily presents object-scene placement and image synthesis research.

2.1 Object-Scene Placement

Several studies have attempted to paste a foreground object into a given background scene at a suitable position and scale. Remez et al. [18] assumed that spots along the same horizontal scanline had comparable depth; thus, the actual scale of the foreground object could be preserved when moving it along the scanline on the background. Georgakis et al. [19] proposed merging support surface detection and semantic segmentation to uncover the perfect position for an object. The object’s size was decided based on the depth at that position and its original scale. In another study, a Gaussian mixture [20] was deployed to explicitly model the probability distribution of conditional bounding box information of the background image and foreground type. With the increasing popularity of deep neural networks, some approaches have deployed deep learning to automatically forecast object-scene placement. Li et al. [21] proposed a VAE for predicting the distribution of base classes and locations for humans in 3D indoor spaces while simultaneously reconstructing position pelvic joint coordinates and depths. In another study, spatial pyramid pooling was incorporated into GANs [22] to inpaint pedestrians with diverse sizes at specified locations in a scene. Volokitin et al. [23] investigated utilizing masked convolutions to aggregate the contextual information in a background semantic map as an input to predict a rational location for a bounding box.

2.2 Image Synthesis

Facebook AI extended the instance-conditioned GAN [24] to class-conditional image generation. This method enabled the proper integration of instances and class labels, thereby regulating the semantic content of the synthetic images. BlockGAN [25] was influenced by the computer graphics process; it trained object-aware 3D scene models from unlabeled 2D images and offered additional control over the 3D posture and identity of each object while preserving image realism. CSGNet and CSGNet-v2 [26] altered the latent regulations on smoke elements to restrain the smoke in generated images and create different smoke images or sequences. CycleGAN [27] introduced the cycle-consistency loss and identity mapping loss for converting non-wildfire images into wildfire images. In general, the synthetic images of CycleGAN are relatively blurry. Yang et al. [15] integrated an attention mechanism into the generator to improve the quality of the synthetic images and controlled the size of the fire using conditional code. The global-local mask GAN (FGL-GAN) [16] deployed a fire mask as the generator input to preserve the fire information in composite images.

However, in the above approaches, the time required to manually select suitable positions and scales for the fire objects remains a significant concern. Additionally, previous research experiments have not considered generating images of fires at night; thus, the comparative results need to be more accurate. Therefore, SynFAGnet is proposed for concurrently addressing both issues.

2.3 Fire Detection and Segmentation on Augmented Datasets

Various image augmentation techniques have been proposed to enlarge datasets for specific tasks cost-effectively. Some of them have been researched to enhance the efficiency of fire detection. Sousa et al. [28] proposed a transfer learning approach combined with data augmentation, evaluated using a tenfold cross-validation scheme. The model employs color segmentation and feature description algorithms to partition the original image into new images of size 299 × 299. Robust data augmentation GAN (RDAGAN) [29] is deployed to generate training data for object detection from a small dataset. Training the object generation and image translation networks separately mitigates instability issues during training. The CycleGAN model has been improved to translate non-fire images into wildfire images [27]. A significant limitation is its uncontrollable fire placement, making annotation of augmented images challenging for fire detection tasks.

In fire segmentation tasks, Yang et al. [15] proposed a model using GANs and the attention mechanism to generate fire images within a warehouse. However, a notable limitation is that the model focuses on translating only the square region surrounding the inserted fire, resulting in a distinct boundary between this area and the overall background. Qin et al. [16] introduced a model to generate realistic-looking fire images encompassing flame effects. Utilizing a cut-and-paste method, their model superimposes flames onto images, then employs image translation to infuse halo and reflection effects. In their approach, the global coordination module enhances the integration of the fire and background, yielding a seamlessly natural composite image. Both studies primarily focus on indoor images in low-light conditions, which is a limitation. Additionally, the manual selection of fire locations is time-consuming.

3 Proposed Model

This paper proposes a data augmentation (DA) method to learn the appropriate placements of fires in given background scenes without heavy human labeling. This study introduces OSPNet for automatic fire object placement in scenes inspired by EfficientDet, using a fire mask and background image as inputs. The network comprises a Swin Transformer backbone, a BiFPN feature network, and a prediction network. With the results obtained from the OSPNet, the generation network, LGC-GAN, focuses on local generation for realistic fire effects and a global combination to ensure seamless integration with the background. The details of the OSPNet and the LGC-GAN are discussed more fully below.

3.1 Data Acquisition

In the training phase, the datasets must include pairs of samples consisting of a fire kernel (foreground), a background scene without any fire, and designated locations for the proper fire placement. Currently, no common data set is compatible with this requirement. Accordingly, this study utilized 25 fire videos of different scenes through experiments, converted them into 50,000 images (Visionin dataset), and divided them into training sets, validation sets, and test sets at a ratio of 8:1:1. Furthermore, the Internet was searched for certain datasets to be used as part of the training sets: the AIHub dataset as shown in [30], VOC2020 dataset [31], and Dunning dataset as introduced in [32]. The ImageHash library was then utilized to filter out similar-structured images from each dataset to enhance the generalizability of the model.

The overall process is described in Fig. 1. The details are as follows.

  1. 1.

    This study used a segmented fire mask to cut out an object area and retain the original bounding boxes as the ground truth. The fire segmentation masks were obtained from the labeled dataset (AI Hub), and a mask region-based convolutional neural network (Mask R-CNN)-based segmentation model was trained on the AI Hub dataset.

  2. 2.

    A pre-trained inpainting network named LaMa was deployed in this study [33]; it utilized the original image and corresponding bounding box as inputs to fill the holes of the occluded region and generate a clean background without the fire.

  3. 3.

    The fire segmentation mask was pasted directly to the background image at the same scale and location to create a pair matching the original image.

Figure 1
figure 1

Data acquisition pipeline

Samples of the datasets utilized to train the OSPNet and LGC-GAN are displayed in Table 1.

Table 1 Summary of Training Datasets for SynFAGnet

3.2 Object-Scene Placement Network

In general, this study proposes a deep learning network for automatic object-scene placement with proper positions on a given scene and suitable scales for fire objects. The proposed network is inspired by EfficentDet [34], however, instead of using an image as the network’s input, the network uses two inputs: a fire mask and background scene image. The proposed placement network generally consists of three main parts: a Swin Transformer-based backbone [35], bidirectional feature pyramid network (BiFPN)-based feature network [34], and prediction network. The detailed architecture of OSPNet is shown in Fig. 2.

Figure 2
figure 2

Architecture of object-scene placement network (OSPNet)

3.2.1 Swin Transformer

The OSPNet uses the Swin Transformer (a hierarchical transformer whose representation is calculated with shifted windows) as a general-purpose backbone. The shifted windowing scheme achieves outstanding efficiency by adapting the self-attention computations to non-overlapping local windows while qualifying the cross-window connection. This hierarchical architecture is adaptable to modeling at various sizes and has a linear computing complexity concerning the image size. The Swin Transformer has demonstrated its prowess in terms of robust object detection and semantic segmentation models. The proposed OSPNet backbone extracts valuable features from the given fire and background scene images.

3.2.2 BiFPN Layer

The next part of the proposed network is the BiFPN-based feature network; it obtains multiple levels of combined object-background features from the backbone as input and then produces a list of fused features. The BiFPN supports the feature fusion block mechanism. By eliminating nodes with only one input edge, adding an edge from the original input to the output node based on certain level conditions, and considering each bidirectional route as a single feature network layer, the BiFPN improves the cross-scale connections. It is appropriate for simple and quick multi-scale feature fusion.

3.2.3 Prediction Network

The final prediction network utilizes the fused features to predict the relevant positions and scales of the fire on the given background. Before acquiring the final outputs, it utilizes two fully connected layers. The two networks operate on the outputs from the previous BiFPN layer as inputs. One layer is employed to predict the bounding box’s coordinates, whereas the other is used to predict the class and its confidence scores. The bounding boxes are then calculated by using one convolution for each pixel in the feature map. Another convolution predicts the probabilities of a class for each specific location on the grid.

3.3 Local–Global Context-Based Generalized Adversarial Network (GAN)

3.3.1 Model Architecture

This study proposes the LGC-GAN architecture inspired by Johnson et al. [36] for generating fire images. The structure is shown in Fig. 3. The proposed model generally contains a generator and discriminator. The generator operates nine residual blocks for training images with a resolution of 256 × 256 or higher. For the discriminator, 70 × 70 PatchGANs [7] are utilized to identify whether the 70 × 70 overlapping image patches are real or fake.

Figure 3
figure 3

Architecture of local–global context-based generative adversarial network (LGC-GAN)

The proposed synthetic network comprises two generation stages: local generation and global combination. By concentrating on local information, local generation aims to render more realistic reflections and halos surrounding the fires. However, if only using this stage, the local part and entire scene will have inconsistent boundaries, resulting in an unrealistic synthetic image. To ensure that the synthetic image is realistic and natural-looking, the global combination is used to smoothen the boundaries between the local part and the entire background.

The steps in the fire image synthesis inside the generator are as follows:

  1. 1.

    After determining the suitable position and scale of fire to match the background scene by the OSPNet, the local image Ii obtained using the cut-paste algorithm is input into the local generator.

  2. 2.

    The local generation creates a synthetic fire image with a halo and reflection Ilocal_G. At this stage, using both Ilocal_G and the actual local image Ilocal_R as inputs, the discriminator evaluates the realism of the image.

  3. 3.

    Ilocal_G combined with the background image Ibg is input into the combination module to map the local image back to the overall scene Iglobal_G. The discriminator can correct any inconsistencies between the Iglobal_G and real global image Ilglobal_R, thereby creating a seamless transition in the overall scene.

3.3.2 Loss Function

The combined loss deployed in this study contains an adversarial loss, local loss, and global loss. The overall loss is estimated as follows:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{G}={\mathcal{L}}_{ad{v}_{G}}+\lambda *{\mathcal{L}}_{local}\end{array}\end{array}$$
(1)
$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{D}={\mathcal{L}}_{av{d}_{D}}+\lambda *{\mathcal{L}}_{global}\end{array}\end{array}$$
(2)

The objective of the adversarial loss is to make the discriminator and generator learn in opposition to one another. While the discriminator attempts to distinguish between the generated Ilocal_G and actual local images Ilocal_R, the generator attempts to make the synthetic fire image look as real as possible. The adversarial loss functions of the final generator and discriminator are as follows:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{ad{v}_{G}}={\mathbb{E}}\left[{\left(D\left({{\text{I}}}_{i},G\left({{\text{I}}}_{i}\right)\right)-1\right)}^{2}\right]\end{array}\end{array}$$
(3)
$$\begin{array}{c}\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{av{d}_{D}}=E\left[{\left(D\left({{\text{I}}}_{i},{{\text{I}}}_{loca{l}_{R}}\right)-1\right)}^{2}\right]+E\left[D{\left({{\text{I}}}_{i},G\left({{\text{I}}}_{i}\right)\right)}^{2}\right]\end{array}\end{array}\end{array}$$
(4)

where Ilocal_R is the real local image, and Ii is the generator input.

L1 regularization is used to estimate the pixel-by-pixel difference between the generated and real images. The local L1 loss between Ilocal_G and target Ilocal_R is calculated as follows:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{local}={||{{\text{I}}}_{loca{l}_{G}}-{{\text{I}}}_{loca{l}_{R}}||}_{1}\end{array}\end{array}$$
(5)

To achieve a seamless transition in the composed image, the global discriminator tries to distinguish between the generated image Iglobal_G and the real global image Iglobal_R. The minimum mean square error (MSE) is used as the objective function of the global adversarial loss:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{global}={\mathbb{E}}\left[D{\left({{\text{I}}}_{globa{l}_{R}},{{\text{I}}}_{globa{l}_{G}}\right)}^{2}\right]\end{array}\end{array}$$
(6)

4 Experiment Results

4.1 Quantitative Evaluation Metrics

The Fréchet inception distance (FID) [37] and learned perceptual image patch similarity (LPIPS) [38] are two evaluation criteria close to human perception and are important strategies for evaluating image quality. The FID calculates the Fréchet distance between the activation values in each layer of the InceptionNet to measure the difference between the Gaussian distribution of the synthesized images and the distribution of the real images. The FID can be expressed as shown in Eq. (7) as follows:

$$\begin{array}{c}\begin{array}{c}FI{D}_{r,s}={||{\mu }_{r}-{\mu }_{s}||}_{2}^{2}+Tr({\Sigma }_{r}+{\Sigma }_{s}-2(\sqrt{{\Sigma }_{r}{\Sigma }_{s}})\end{array}\end{array}$$
(7)

In the above, (µ, Σ) denote the mean and covariance, respectively. The lower the FID value, the better the quality of the synthetic image.

The L2 distance between the features extracted from a pre-trained AlexNet is used to measure the similarity between the synthetic and real images. The average LPIPS distance is calculated in Eq. (8) as follows:

$$\begin{array}{c}\begin{array}{c}\begin{array}{c}d\left(x,{x}_{0}\right)=\sum_{l} \frac{1}{{H}_{l}{W}_{l}}\sum_{h,w} {||{w}_{l}\odot \left({\widehat{y}}_{hw}^{l}-{\widehat{y}}_{0hw}^{l}\right)||}_{2}^{2}\\ LPIPS=\frac{1}{N}\sum d\left(real,syn\right)\end{array}\end{array}\#\end{array}$$
(8)

In addition, a visual assessment was conducted to estimate the manual assessment in spotting the realistic image from a given two options (real and generated) with an identical size, background image, and fire placement. Fifteen subjects participated in this study to assess the visually realistic nature of the synthetic results. This assessment was designed to internally validate the realistic images generated by our framework alongside the real images in the context of subjective perseverance. For the global images, they mainly assessed whether the composition and blending of the fire and background image matched reality. The users mainly compared the generation effects for the fire halos and reflections in the local images.

$$\begin{array}{c}Realistic=\frac{Incorrect\, prediction}{Total\, cases}\end{array}$$
(9)

4.2 Quantitative Evaluation

Table 2 shows the FID and LPIPS values between the real image and images generated by the flow-based-importance-GAN, FGL-GAN, and our model. Compared with other networks, the FID and LPIPS values of the images generated by SynFAGnet are the best at 17.232 and 0.077, respectively. This indicates that the halos and reflections rendered by the proposed model are the most realistic and evident. With the local-global structure, the generator can limit background color changes around the fire, thereby producing more realistic results. Expanded datasets with diverse background scenes are expected to improve this capability further.

Table 2 Evaluation of the Generated Fire Kernel Quality Using the Proposed Method and Other Methods

Additionally, when considering realistic values, participants consistently rated images generated by SynFAGnet as more in line with human visual aesthetics than those from other networks. This assessment encompasses both the overall context of the images and the detailed rendering of specific elements like fire halos and reflections. This specified approach was followed by various state-of-the-art studies [15, 16] in literature to express the realism aspect of their results. Therefore, this approach was opted for this study under the same motivation of examining realism in the proposed study. In summary, SynFAGnet stands out for its ability to capture nuanced visual details, contributing to its perceived realism and aesthetic appeal compared to alternative network-generated images.

4.3 Qualitative Evaluation

The proposed model was trained on the NVDIA GTX 1050Ti GPU platform in this experiment. The loss curve outcomes following the training of the SynFAGnet model are depicted in Fig. 4. Both the generator and discriminator losses exhibit fluctuations and demonstrate a significant decrease after 1000 epochs, indicating successful model training. The results from training epochs 0, 200, 400, 600, 800, and 1000 are shown in Fig. 5. As the model performance improves over time, the quality of generated fire gradually improves, and the distortion decreases. The transition between the fire object part and the background image becomes smoother.

Figure 4
figure 4

The loss curve for the SynFAGnet model after completing 1000 training epochs

Figure 5
figure 5

Results of the training phase

The qualitative evaluation is implemented on unseen test sets. The backgrounds of the unseen sets differ from those of the training sets, and the fire masks are chosen randomly. The OSPNet finds the appropriate positions and scales for a given fire mask. One example is shown in Fig. 6. The proposed SynFAGnet significantly affects the synthetic fire images in the daytime and nighttime environments. Based on the given mask, Fig. 6c shows the proper positions and scales of the given fire on each corresponding background. The model selects the most suitable position and scale to generate a realistic-looking fire image. Among the images, Fig. 6b and d show the scene before and after the implantation of a fire kernel using the cut-paste algorithm, respectively. The cut-paste algorithm cannot reproduce the halo and reflection onto the new background, resulting in unrealistic synthetic images. In general, two factors affect the realism of a synthetic fire image: the light intensity and distance. Depending on the time of day, the reflection and halo of the fire will change. The light from the sun serves as the primary source of illumination during the day, making it difficult to observe the reflection; in contrast, as night falls, the fire gradually takes over as the primary source of illumination, making the reflection increasingly evident. In addition, the rendered reflection of the fire gradually attenuates with increasing distance from the fire. Using the local loss, the fire halo and reflection rendered by SynFAGnet are more reasonable. With the global loss, the SynFAGnet-synthetic images have a smooth transition between the background and fire. This makes the synthetic images appear more realistic and natural, as shown in Fig. 6f.

Figure 6
figure 6

Qualitative results of image synthesis

In more detail, the comparative synthetic result utilizing only the local loss and those combining the local-global loss are shown in Fig. 7. Without global loss, the synthetic images have a clear boundary between the fire reflection and background and are relatively blurry. In contrast, applying the global loss can ensure a seamless blending between the local image and overall scene, yet create a slight sharpness discrepancy between them. Fig. 7 demonstrates that the proposed method can maintain the background image quality and enhance the realism of the fire.

Figure 7
figure 7

Comparison results of local-patching and local–global image

5 Ablation Study

An ablation study was also conducted to assess the contribution of each module in the proposed method and the influences of the augmented data on fire detection/segmentation. The study comprised five cases: real dataset, utilizing only OSPNet (patching), using only the LGC-GAN (random placement), not using the global loss (local + patching), and SynFAGnet.

  • Patching DA: Once the appropriate scale and location were determined using the object placement model, the fire kernel was inserted into the background through a cut-and-paste algorithm.

  • Random Placement DA: The location and scale of the fire were randomly selected, with the random scale being bounded by the extreme fire size in the dataset.

  • Local + Patching DA: After selecting the suitable scale and location using the object placement model, only the local fire image Ilocal_G was generated; it was then placed in the background at the specified location without the global loss.

In the present study, a total of 4000 images were randomly chosen from the AIHub and Visionin datasets. These were partitioned into a training set comprised of 3500 images and a test set with 500 images. The baseline model was exclusively trained on this real training set. To evaluate the efficacy of SynFAGnet, an additional 1500 images were randomly sampled from the aforementioned datasets to serve as augmented data. Of these, 1000 augmented images were incorporated into the training set, while the remaining 500 augmented images were allocated to the test set. The samples of each DA dataset are shown in Fig. 8. Compared to using only the OSPNet and SynFAGnet, relying solely on OSPNet fails to generate the halo and reflection around the fire in the new background, resulting in an unrealistic image. In contrast, the FGL-GAN can create a seamless transition between the fire and background by producing the halo and reflection effects. Compared to using only the LGC-GAN and SynFAGnet, the fire has a halo and reflection effect, but the background is distorted owing to improper placement. This highlights the importance of finding a suitable location, as it preserves the quality of the background image. Compared to having no global loss and SynFAGnet, the synthetic images without global loss show a noticeable boundary between the local image and background. By incorporating the global loss into the network, the blending of the fire with the background becomes smoother.

Figure 8
figure 8

Samples of augmented datasets

5.1 Contribution of Augmented Data on Fire Detection

“You Only Look Once” V4 [39] is utilized in this study to verify the benefits of our proposed method in improving the performance of fire detection networks, with the mean average precision (mAP@0.5) serving as the metric for evaluation.

As seen in the last row of Table 3, the SynFAGnet-augmented images enhance the fire detection relative to the baseline. By generating rare object-and-context scenes to balance the bias in the initial dataset, our proposed augmentation method can improve performance for rare and difficult object detection, such as in high-transparency fires. The model, trained on patching-augmented images, is limited to identifying high-intensity pixels and disregards the halo surrounding the fire. This indicates that the accuracy of the detection model could be compromised if the fire has a high level of translucency. The random placement baseline shows how the contextual relationship might impact the DA for object detection. The quality of the augmented images could be significantly enhanced by locating the object in the appropriate area. By comparing only the local loss and combined local-global loss, making the transition between the fire and the overall scene as seamless as possible can further improve the quality of the augmented dataset for enhancing the performance of fire detection. From the visual comparison in Fig. 9, training the model on either the real images or random placement-augmented images misclassifies the reflection as fire pixels, resulting in a bounding box that covers an area too large for the actual fire scale. Using the local and patching-augmented images makes it challenging for the model to accurately identify high-transparency fire pixels.

Table 3 Detection and Segmentation Results With/Without Data Augmentation (DA)
Figure 9
figure 9

Visual comparison showing the results from the ground truth (white box), the baseline (blue box), patching (black box), random placement (yellow box), local and patching (green box), as well as the results from the proposed approach (red box) using “You Only Look Once” V4. The confidence score and the IoU compared with the ground truth are noted in each image, respectively

5.2 Contribution of Augmented Data on Instance Segmentation

In this experiment, the instance segmentation algorithm Mask R-CNN [40] was trained on all the augmented datasets with mAP@0.5 as the evaluation metric. Table 3 compares all three augmentation data methods and indicates that our proposed approach is outstanding, producing the best mAP@0.5. The quantitative results for both tasks allow us to conclude that finding suitable positions and scales for given fires by following the context background may loosen the dataset bias and enhance the accuracy with the special cases.

The segmentation results for the ground reflections can be observed from the visual comparison in Fig. 10. Owing to training on patch-augmented images, the model experiences challenges in detecting pixels with low intensity, resulting in the segmented mask not fully capturing the extent of the actual fire. Moreover, the model may produce incorrect segmentation if the region surrounding the fire-reflecting light is excessively bright. The first two rows demonstrate that by training on images augmented using the SynFAGnet method, the Mask R-CNN model can properly classify the part originally belonging to the fire as fire pixels, resulting in a more complete retained fire image. The remaining rows show that training on the real dataset or combining it with the random placement method fails to distinguish the fire reflection effectively. Moreover, training on augmented images without the global loss leads to incorrectly classifying parts of the original fire as non-fire pixels, in addition to misclassifying non-fire pixels owing to the appearance of noise factors between the local image and the overall scene in the training set. In conclusion, the accuracy of the fire recognition network can be significantly improved by training it with augmented images as similar to the real world as possible.

Figure 10
figure 10

Segmentation results of mask region-based convolutional neural network (Mask R-CNN). The confidence score and the IoU compared with the ground truth are noted in each image, respectively

6 Limitations and Future Works

  1. 1.

    Dataset exploration: The experiments on synthesizing authentic fire images were conducted primarily using the public datasets (AI Hub, VOC2020, Dunning) and the private dataset (Visionin). These datasets predominantly contain images of outdoor fires. However, for a more comprehensive study, data from varied environments are required to be collected and evaluated, such as indoors and warehouses.

  2. 2.

    Robustness towards low-light conditions: Efforts will be made to enhance the performance of SynFAGnet in generating fire images under challenging conditions, especially in low-light situations where the current results are unsatisfactory. This will be achieved by expanding the training dataset to include a broader range of scenarios in terms of quantity and diversity to optimize the model performance.

  3. 3.

    Fire detection under diverse scenarios: Expanding the scope of SynFAGnet beyond fire detection. Smoke detection is just as important as fire detection regarding fire safety, as smoke is often the initial sign of a fire. To address this issue, SynFAGnet will generate synthetic smoke data and integrate the fire and smoke data to improve its capability to detect fires in different scenarios.

  4. 4.

    Distributed endpoint: Currently, the proposed system constitutes a proprietary offline fire generation system that can be used on-premises servers. However, future work focuses on having a distributed endpoint for the system to be used ubiquitously online.

7 Conclusion

This study proposed a fully automated realistic synthetic fire generative network for generating fire-burning images in a new scene as a DA method. Based on two separate networks, after automatically finding suitable positions and scales for the given fire mask, the proposed approach generates realistic-looking fire images with high-quality halos and reflections. Compared with existing methods, this proposed method addresses two major challenges in fire image generation. First, the OSPNet aims to learn the distribution of diverse and plausible locations and scales where a given fire can be placed in the background. Second, the LGC-GAN module acquires the fire mask as input through a local-global structure to enhance the controllability of the fire shapes and generate a more realistic synthetic image by retaining high-quality background details. Compared to previous methods, SynFAGnet can automatically generate realistic synthetic quality flame images from both global and local perspectives in various environments. Qualitative evaluations reveal that SynFAGnet outperforms existing methods in generating realistic-looking images under varied conditions. Quantitatively, SynFAGnet achieves scores of 17.232 on FID, 0.077 on LPIPS, and 0.67 in user evaluations, surpassing other methods in all metrics. Furthermore, networks trained with the augmented dataset from SynFAGnet show a 32.18% and 33.11% (mAP@0.5) improvement in detection and segmentation, respectively, compared to only real datasets.