SynFAGnet: A Fully Automated Generative Network for Realistic Fire Image Generation

Nguyen, Quoc Dung; Mai, Ngoc Dau; Nguyen, Van Huan; Kakani, Vijay; Kim, Hakil

doi:10.1007/s10694-023-01540-2

SynFAGnet: A Fully Automated Generative Network for Realistic Fire Image Generation

Open access
Published: 03 February 2024

Volume 60, pages 1643–1665, (2024)
Cite this article

Download PDF

You have full access to this open access article

Fire Technology Aims and scope Submit manuscript

SynFAGnet: A Fully Automated Generative Network for Realistic Fire Image Generation

Download PDF

Quoc Dung Nguyen¹,
Ngoc Dau Mai²,
Van Huan Nguyen³,
Vijay Kakani⁴ &
…
Hakil Kim ORCID: orcid.org/0000-0003-4232-3804^1,2

935 Accesses
1 Citation
Explore all metrics

Abstract

This paper proposes a fully automated generative network (“SynFAGnet”) for automatically creating a realistic-looking synthetic fire image. SynFAGnet is used as a data augmentation technique to create diverse data for training models, thereby solving problems related to real data acquisition and data imbalances. SynFAGnet comprises two main parts: an object-scene placement net (OSPNet) and a local–global context-based generative adversarial network (LGC-GAN). The OSPNet identifies suitable positions and scales for fires corresponding to the background scene. The LGC-GAN enhances the realistic appearance of synthetic fire images created by a given fire object-background scene pair by assembling effects such as halos and reflections in the surrounding area in the background scene. A comparative analysis shows that SynFAGnet achieves better outcomes than previous studies for both the Fréchet inception distance and learned perceptual image patch similarity evaluation metrics (values of 17.232 and 0.077, respectively). In addition, SynFAGnet is verified as a practically applicable data augmentation technique for training datasets, as it improves the detection and instance segmentation performance.

A Step Beyond Generative Multi-adversarial Networks

A lightweight image inpainting model for removing unwanted objects from residential real estate’s indoor scenes

Article 06 March 2024

Enhancing Semantic Image Synthesis: A GAN-Based Approach with Multi-Feature Adaptive Denormalization Layer

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep convolutional neural networks are extensively used in various fire-related tasks [1,2,3] and several vision intelligence tasks [4,5,6,7] but are contingent on the size and quality of the samples in the training dataset. Making fire detection algorithms more effective and versatile requires a large amount of data from diverse environments, such as factories, warehouses, and wildfires. Owing to the practical difficulties in controlling and spreading fires, expanding the existing fire image integration in simulation images remains a critical concern. However, using computer graphics software to create high-quality fire images is costly, time-consuming, and extremely restricted in real-world applications. Thus, the generation of fire images using deep learning networks is necessary to address the sampling imbalance between fire and non-fire image categories. Additionally, it is crucial to identify a fire outbreak at its inception; this early detection goal usually drives research in the field of anomaly detection such as these. Accordingly, there seems to be a main scope in this domain for such systems that can detect fire as soon as it occurs when the visual cues are at their initial advent, such as the lowest possible size of the fire.

One simple solution is to combine the fire images obtained in laboratories with the above-mentioned typical environment(s). Several other traditional methods [8,9,10] are based on extracting a fire kernel from the original image (cutting) and pasting it into a background (patching). Levin et al. [8] introduced cost functions in the gradient domain to evaluate stitching quality, focusing on image similarity and seam visibility and optimizing for photometric and geometric consistency. RICAP [9] is proposed as a data augmentation method that constructs training images by randomly cropping and patching four images to enhance dataset variety in deep CNNs while mixing class labels for benefits akin to label smoothing. Walawalkar et al. [10] proposed Attentive CutMix, an augmentation strategy for CNNs that uses attention maps to selectively occlude the most discriminative parts of images, enhancing generalization over random occlusion methods. However, these approaches are challenging for the following reasons:

The areas between a fire kernel and the background scene cannot transition smoothly using only the cutting-patching process. Cutting the entire fire region is challenging, as a blurry halo typically surrounds the fire, and no transparent segregation edge exists.
Reflections of neighboring objects might result in an unrealistic merged image. Several methods have been proposed based on the Poisson equations to make the cropping border appear seamless and further refine the texture and style of the blending part [11]. However, the attributes of the fire are no longer compatible with reality in regard to the intensity, color, and texture when the foreground is modified to be compatible with the background.

Many deep learning-based approaches have been proposed for synthetic fire image generation. These can be divided into two main categories: generative adversarial network (GAN)-based methods [12] and variational auto-encoder (VAE)-based methods [13]. GAN-based methods frequently encounter issues regarding training stability and uncontrollable synthetic images. Image-to-image translation [14] is a GAN application that has recently been widely used to convert non-fire images to fire images while preserving the background content. However, background distortions and discolorations are two typical challenges in such image translations, leading to unexpected and uncontrollable results. In contrast, the main problem with VAEs-based methods is that they often produce blurry, unrealistic outcomes.

Despite the success in generating fire images, there are additional constraints on the proposed approaches [15,16,17]. First, manually choosing a suitable position and reasonable scale requires a significant amount of time and reduces the variety of synthetic fire images. Second, restrictions on the resolution of the generated fire kernel can lead to an incompatibility between the background and foreground in the synthesized image. Accordingly, this study presents a new fully automated approach for generating fires named "SynFAGnet" inspired by the various related methods and designed to be flexible in diverse conditions. The main contributions of the approach are as follows:

The structure of SynFAGnet includes two main parts: an object-scene placement network (OSPNet) and a local–global context-based GAN (LGC-GAN). The OSPNet automatically learns reasonable scales and positions for the fire kernel on a given background. The LGC-GAN generates more realistic synthetic fire images using styling information such as halos and reflections. In addition, the generator takes a mask as input, allowing it to produce the fire image while preserving the shape of the input object.
The feasibility of using SynFAGnet as a data augmentation technique for handling dataset imbalances and improving object detection and segmentation performance is verified.

This study consists of four main sections. Section 2 presents a summary of the related research on the technologies employed in this study. Section 3 discusses the data processing and proposed framework used to generate synthetic fire images with suitable positions and scales. Sections 4 and 5 analyze experiments and the influences of the augmented data on fire detection and segmentation. Sections 6 and 7 conclude the paper with a summary and discussion regarding future work.

2 Related Work

The study mainly focuses on finding suitable positions and scales for a fire kernel and synthesizing realistic-looking images. Therefore, this section primarily presents object-scene placement and image synthesis research.

2.1 Object-Scene Placement

Several studies have attempted to paste a foreground object into a given background scene at a suitable position and scale. Remez et al. [18] assumed that spots along the same horizontal scanline had comparable depth; thus, the actual scale of the foreground object could be preserved when moving it along the scanline on the background. Georgakis et al. [19] proposed merging support surface detection and semantic segmentation to uncover the perfect position for an object. The object’s size was decided based on the depth at that position and its original scale. In another study, a Gaussian mixture [20] was deployed to explicitly model the probability distribution of conditional bounding box information of the background image and foreground type. With the increasing popularity of deep neural networks, some approaches have deployed deep learning to automatically forecast object-scene placement. Li et al. [21] proposed a VAE for predicting the distribution of base classes and locations for humans in 3D indoor spaces while simultaneously reconstructing position pelvic joint coordinates and depths. In another study, spatial pyramid pooling was incorporated into GANs [22] to inpaint pedestrians with diverse sizes at specified locations in a scene. Volokitin et al. [23] investigated utilizing masked convolutions to aggregate the contextual information in a background semantic map as an input to predict a rational location for a bounding box.

2.2 Image Synthesis

Facebook AI extended the instance-conditioned GAN [24] to class-conditional image generation. This method enabled the proper integration of instances and class labels, thereby regulating the semantic content of the synthetic images. BlockGAN [25] was influenced by the computer graphics process; it trained object-aware 3D scene models from unlabeled 2D images and offered additional control over the 3D posture and identity of each object while preserving image realism. CSGNet and CSGNet-v2 [26] altered the latent regulations on smoke elements to restrain the smoke in generated images and create different smoke images or sequences. CycleGAN [27] introduced the cycle-consistency loss and identity mapping loss for converting non-wildfire images into wildfire images. In general, the synthetic images of CycleGAN are relatively blurry. Yang et al. [15] integrated an attention mechanism into the generator to improve the quality of the synthetic images and controlled the size of the fire using conditional code. The global-local mask GAN (FGL-GAN) [16] deployed a fire mask as the generator input to preserve the fire information in composite images.

However, in the above approaches, the time required to manually select suitable positions and scales for the fire objects remains a significant concern. Additionally, previous research experiments have not considered generating images of fires at night; thus, the comparative results need to be more accurate. Therefore, SynFAGnet is proposed for concurrently addressing both issues.

2.3 Fire Detection and Segmentation on Augmented Datasets

Various image augmentation techniques have been proposed to enlarge datasets for specific tasks cost-effectively. Some of them have been researched to enhance the efficiency of fire detection. Sousa et al. [28] proposed a transfer learning approach combined with data augmentation, evaluated using a tenfold cross-validation scheme. The model employs color segmentation and feature description algorithms to partition the original image into new images of size 299 × 299. Robust data augmentation GAN (RDAGAN) [29] is deployed to generate training data for object detection from a small dataset. Training the object generation and image translation networks separately mitigates instability issues during training. The CycleGAN model has been improved to translate non-fire images into wildfire images [27]. A significant limitation is its uncontrollable fire placement, making annotation of augmented images challenging for fire detection tasks.

In fire segmentation tasks, Yang et al. [15] proposed a model using GANs and the attention mechanism to generate fire images within a warehouse. However, a notable limitation is that the model focuses on translating only the square region surrounding the inserted fire, resulting in a distinct boundary between this area and the overall background. Qin et al. [16] introduced a model to generate realistic-looking fire images encompassing flame effects. Utilizing a cut-and-paste method, their model superimposes flames onto images, then employs image translation to infuse halo and reflection effects. In their approach, the global coordination module enhances the integration of the fire and background, yielding a seamlessly natural composite image. Both studies primarily focus on indoor images in low-light conditions, which is a limitation. Additionally, the manual selection of fire locations is time-consuming.

3 Proposed Model

This paper proposes a data augmentation (DA) method to learn the appropriate placements of fires in given background scenes without heavy human labeling. This study introduces OSPNet for automatic fire object placement in scenes inspired by EfficientDet, using a fire mask and background image as inputs. The network comprises a Swin Transformer backbone, a BiFPN feature network, and a prediction network. With the results obtained from the OSPNet, the generation network, LGC-GAN, focuses on local generation for realistic fire effects and a global combination to ensure seamless integration with the background. The details of the OSPNet and the LGC-GAN are discussed more fully below.

3.1 Data Acquisition

In the training phase, the datasets must include pairs of samples consisting of a fire kernel (foreground), a background scene without any fire, and designated locations for the proper fire placement. Currently, no common data set is compatible with this requirement. Accordingly, this study utilized 25 fire videos of different scenes through experiments, converted them into 50,000 images (Visionin dataset), and divided them into training sets, validation sets, and test sets at a ratio of 8:1:1. Furthermore, the Internet was searched for certain datasets to be used as part of the training sets: the AIHub dataset as shown in [30], VOC2020 dataset [31], and Dunning dataset as introduced in [32]. The ImageHash library was then utilized to filter out similar-structured images from each dataset to enhance the generalizability of the model.

The overall process is described in Fig. 1. The details are as follows.

1.
This study used a segmented fire mask to cut out an object area and retain the original bounding boxes as the ground truth. The fire segmentation masks were obtained from the labeled dataset (AI Hub), and a mask region-based convolutional neural network (Mask R-CNN)-based segmentation model was trained on the AI Hub dataset.
2.
A pre-trained inpainting network named LaMa was deployed in this study [33]; it utilized the original image and corresponding bounding box as inputs to fill the holes of the occluded region and generate a clean background without the fire.
3.
The fire segmentation mask was pasted directly to the background image at the same scale and location to create a pair matching the original image.

Samples of the datasets utilized to train the OSPNet and LGC-GAN are displayed in Table 1.

Table 1 Summary of Training Datasets for SynFAGnet

Full size table

3.2 Object-Scene Placement Network

In general, this study proposes a deep learning network for automatic object-scene placement with proper positions on a given scene and suitable scales for fire objects. The proposed network is inspired by EfficentDet [34], however, instead of using an image as the network’s input, the network uses two inputs: a fire mask and background scene image. The proposed placement network generally consists of three main parts: a Swin Transformer-based backbone [35], bidirectional feature pyramid network (BiFPN)-based feature network [34], and prediction network. The detailed architecture of OSPNet is shown in Fig. 2.

3.2.1 Swin Transformer

The OSPNet uses the Swin Transformer (a hierarchical transformer whose representation is calculated with shifted windows) as a general-purpose backbone. The shifted windowing scheme achieves outstanding efficiency by adapting the self-attention computations to non-overlapping local windows while qualifying the cross-window connection. This hierarchical architecture is adaptable to modeling at various sizes and has a linear computing complexity concerning the image size. The Swin Transformer has demonstrated its prowess in terms of robust object detection and semantic segmentation models. The proposed OSPNet backbone extracts valuable features from the given fire and background scene images.

3.2.2 BiFPN Layer

The next part of the proposed network is the BiFPN-based feature network; it obtains multiple levels of combined object-background features from the backbone as input and then produces a list of fused features. The BiFPN supports the feature fusion block mechanism. By eliminating nodes with only one input edge, adding an edge from the original input to the output node based on certain level conditions, and considering each bidirectional route as a single feature network layer, the BiFPN improves the cross-scale connections. It is appropriate for simple and quick multi-scale feature fusion.

3.2.3 Prediction Network

The final prediction network utilizes the fused features to predict the relevant positions and scales of the fire on the given background. Before acquiring the final outputs, it utilizes two fully connected layers. The two networks operate on the outputs from the previous BiFPN layer as inputs. One layer is employed to predict the bounding box’s coordinates, whereas the other is used to predict the class and its confidence scores. The bounding boxes are then calculated by using one convolution for each pixel in the feature map. Another convolution predicts the probabilities of a class for each specific location on the grid.

3.3 Local–Global Context-Based Generalized Adversarial Network (GAN)

3.3.1 Model Architecture

This study proposes the LGC-GAN architecture inspired by Johnson et al. [36] for generating fire images. The structure is shown in Fig. 3. The proposed model generally contains a generator and discriminator. The generator operates nine residual blocks for training images with a resolution of 256 × 256 or higher. For the discriminator, 70 × 70 PatchGANs [7] are utilized to identify whether the 70 × 70 overlapping image patches are real or fake.

The proposed synthetic network comprises two generation stages: local generation and global combination. By concentrating on local information, local generation aims to render more realistic reflections and halos surrounding the fires. However, if only using this stage, the local part and entire scene will have inconsistent boundaries, resulting in an unrealistic synthetic image. To ensure that the synthetic image is realistic and natural-looking, the global combination is used to smoothen the boundaries between the local part and the entire background.

The steps in the fire image synthesis inside the generator are as follows:

1.
After determining the suitable position and scale of fire to match the background scene by the OSPNet, the local image I_i obtained using the cut-paste algorithm is input into the local generator.
2.
The local generation creates a synthetic fire image with a halo and reflection I_{local_G}. At this stage, using both I_{local_G} and the actual local image I_{local_R} as inputs, the discriminator evaluates the realism of the image.
3.
I_{local_G} combined with the background image I_bg is input into the combination module to map the local image back to the overall scene I_{global_G}. The discriminator can correct any inconsistencies between the I_{global_G} and real global image I_{lglobal_R}, thereby creating a seamless transition in the overall scene.

3.3.2 Loss Function

The combined loss deployed in this study contains an adversarial loss, local loss, and global loss. The overall loss is estimated as follows:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{G}={\mathcal{L}}_{ad{v}_{G}}+\lambda *{\mathcal{L}}_{local}\end{array}\end{array}$$

(1)

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{D}={\mathcal{L}}_{av{d}_{D}}+\lambda *{\mathcal{L}}_{global}\end{array}\end{array}$$

(2)

The objective of the adversarial loss is to make the discriminator and generator learn in opposition to one another. While the discriminator attempts to distinguish between the generated I_{local_G} and actual local images I_{local_R}, the generator attempts to make the synthetic fire image look as real as possible. The adversarial loss functions of the final generator and discriminator are as follows:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{ad{v}_{G}}={\mathbb{E}}\left[{\left(D\left({{\text{I}}}_{i},G\left({{\text{I}}}_{i}\right)\right)-1\right)}^{2}\right]\end{array}\end{array}$$

(3)

$$\begin{array}{c}\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{av{d}_{D}}=E\left[{\left(D\left({{\text{I}}}_{i},{{\text{I}}}_{loca{l}_{R}}\right)-1\right)}^{2}\right]+E\left[D{\left({{\text{I}}}_{i},G\left({{\text{I}}}_{i}\right)\right)}^{2}\right]\end{array}\end{array}\end{array}$$

(4)

where I_{local_R} is the real local image, and I_i is the generator input.

L1 regularization is used to estimate the pixel-by-pixel difference between the generated and real images. The local L1 loss between I_{local_G} and target I_{local_R} is calculated as follows:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{local}={||{{\text{I}}}_{loca{l}_{G}}-{{\text{I}}}_{loca{l}_{R}}||}_{1}\end{array}\end{array}$$

(5)

To achieve a seamless transition in the composed image, the global discriminator tries to distinguish between the generated image I_{global_G} and the real global image I_{global_R}. The minimum mean square error (MSE) is used as the objective function of the global adversarial loss:

$$\begin{array}{c}\begin{array}{c}{\mathcal{L}}_{global}={\mathbb{E}}\left[D{\left({{\text{I}}}_{globa{l}_{R}},{{\text{I}}}_{globa{l}_{G}}\right)}^{2}\right]\end{array}\end{array}$$

(6)

4 Experiment Results

4.1 Quantitative Evaluation Metrics

The Fréchet inception distance (FID) [37] and learned perceptual image patch similarity (LPIPS) [38] are two evaluation criteria close to human perception and are important strategies for evaluating image quality. The FID calculates the Fréchet distance between the activation values in each layer of the InceptionNet to measure the difference between the Gaussian distribution of the synthesized images and the distribution of the real images. The FID can be expressed as shown in Eq. (7) as follows:

$$\begin{array}{c}\begin{array}{c}FI{D}_{r,s}={||{\mu }_{r}-{\mu }_{s}||}_{2}^{2}+Tr({\Sigma }_{r}+{\Sigma }_{s}-2(\sqrt{{\Sigma }_{r}{\Sigma }_{s}})\end{array}\end{array}$$

(7)

In the above, (µ, Σ) denote the mean and covariance, respectively. The lower the FID value, the better the quality of the synthetic image.

The L2 distance between the features extracted from a pre-trained AlexNet is used to measure the similarity between the synthetic and real images. The average LPIPS distance is calculated in Eq. (8) as follows:

$$\begin{array}{c}\begin{array}{c}\begin{array}{c}d\left(x,{x}_{0}\right)=\sum_{l} \frac{1}{{H}_{l}{W}_{l}}\sum_{h,w} {||{w}_{l}\odot \left({\widehat{y}}_{hw}^{l}-{\widehat{y}}_{0hw}^{l}\right)||}_{2}^{2}\\ LPIPS=\frac{1}{N}\sum d\left(real,syn\right)\end{array}\end{array}\#\end{array}$$

(8)

In addition, a visual assessment was conducted to estimate the manual assessment in spotting the realistic image from a given two options (real and generated) with an identical size, background image, and fire placement. Fifteen subjects participated in this study to assess the visually realistic nature of the synthetic results. This assessment was designed to internally validate the realistic images generated by our framework alongside the real images in the context of subjective perseverance. For the global images, they mainly assessed whether the composition and blending of the fire and background image matched reality. The users mainly compared the generation effects for the fire halos and reflections in the local images.

$$\begin{array}{c}Realistic=\frac{Incorrect\, prediction}{Total\, cases}\end{array}$$

(9)

4.2 Quantitative Evaluation

Table 2 shows the FID and LPIPS values between the real image and images generated by the flow-based-importance-GAN, FGL-GAN, and our model. Compared with other networks, the FID and LPIPS values of the images generated by SynFAGnet are the best at 17.232 and 0.077, respectively. This indicates that the halos and reflections rendered by the proposed model are the most realistic and evident. With the local-global structure, the generator can limit background color changes around the fire, thereby producing more realistic results. Expanded datasets with diverse background scenes are expected to improve this capability further.

Table 2 Evaluation of the Generated Fire Kernel Quality Using the Proposed Method and Other Methods

Full size table

Additionally, when considering realistic values, participants consistently rated images generated by SynFAGnet as more in line with human visual aesthetics than those from other networks. This assessment encompasses both the overall context of the images and the detailed rendering of specific elements like fire halos and reflections. This specified approach was followed by various state-of-the-art studies [15, 16] in literature to express the realism aspect of their results. Therefore, this approach was opted for this study under the same motivation of examining realism in the proposed study. In summary, SynFAGnet stands out for its ability to capture nuanced visual details, contributing to its perceived realism and aesthetic appeal compared to alternative network-generated images.

4.3 Qualitative Evaluation

The proposed model was trained on the NVDIA GTX 1050Ti GPU platform in this experiment. The loss curve outcomes following the training of the SynFAGnet model are depicted in Fig. 4. Both the generator and discriminator losses exhibit fluctuations and demonstrate a significant decrease after 1000 epochs, indicating successful model training. The results from training epochs 0, 200, 400, 600, 800, and 1000 are shown in Fig. 5. As the model performance improves over time, the quality of generated fire gradually improves, and the distortion decreases. The transition between the fire object part and the background image becomes smoother.

The qualitative evaluation is implemented on unseen test sets. The backgrounds of the unseen sets differ from those of the training sets, and the fire masks are chosen randomly. The OSPNet finds the appropriate positions and scales for a given fire mask. One example is shown in Fig. 6. The proposed SynFAGnet significantly affects the synthetic fire images in the daytime and nighttime environments. Based on the given mask, Fig. 6c shows the proper positions and scales of the given fire on each corresponding background. The model selects the most suitable position and scale to generate a realistic-looking fire image. Among the images, Fig. 6b and d show the scene before and after the implantation of a fire kernel using the cut-paste algorithm, respectively. The cut-paste algorithm cannot reproduce the halo and reflection onto the new background, resulting in unrealistic synthetic images. In general, two factors affect the realism of a synthetic fire image: the light intensity and distance. Depending on the time of day, the reflection and halo of the fire will change. The light from the sun serves as the primary source of illumination during the day, making it difficult to observe the reflection; in contrast, as night falls, the fire gradually takes over as the primary source of illumination, making the reflection increasingly evident. In addition, the rendered reflection of the fire gradually attenuates with increasing distance from the fire. Using the local loss, the fire halo and reflection rendered by SynFAGnet are more reasonable. With the global loss, the SynFAGnet-synthetic images have a smooth transition between the background and fire. This makes the synthetic images appear more realistic and natural, as shown in Fig. 6f.

In more detail, the comparative synthetic result utilizing only the local loss and those combining the local-global loss are shown in Fig. 7. Without global loss, the synthetic images have a clear boundary between the fire reflection and background and are relatively blurry. In contrast, applying the global loss can ensure a seamless blending between the local image and overall scene, yet create a slight sharpness discrepancy between them. Fig. 7 demonstrates that the proposed method can maintain the background image quality and enhance the realism of the fire.

5 Ablation Study

An ablation study was also conducted to assess the contribution of each module in the proposed method and the influences of the augmented data on fire detection/segmentation. The study comprised five cases: real dataset, utilizing only OSPNet (patching), using only the LGC-GAN (random placement), not using the global loss (local + patching), and SynFAGnet.

Patching DA: Once the appropriate scale and location were determined using the object placement model, the fire kernel was inserted into the background through a cut-and-paste algorithm.
Random Placement DA: The location and scale of the fire were randomly selected, with the random scale being bounded by the extreme fire size in the dataset.
Local + Patching DA: After selecting the suitable scale and location using the object placement model, only the local fire image I_{local_G} was generated; it was then placed in the background at the specified location without the global loss.

In the present study, a total of 4000 images were randomly chosen from the AIHub and Visionin datasets. These were partitioned into a training set comprised of 3500 images and a test set with 500 images. The baseline model was exclusively trained on this real training set. To evaluate the efficacy of SynFAGnet, an additional 1500 images were randomly sampled from the aforementioned datasets to serve as augmented data. Of these, 1000 augmented images were incorporated into the training set, while the remaining 500 augmented images were allocated to the test set. The samples of each DA dataset are shown in Fig. 8. Compared to using only the OSPNet and SynFAGnet, relying solely on OSPNet fails to generate the halo and reflection around the fire in the new background, resulting in an unrealistic image. In contrast, the FGL-GAN can create a seamless transition between the fire and background by producing the halo and reflection effects. Compared to using only the LGC-GAN and SynFAGnet, the fire has a halo and reflection effect, but the background is distorted owing to improper placement. This highlights the importance of finding a suitable location, as it preserves the quality of the background image. Compared to having no global loss and SynFAGnet, the synthetic images without global loss show a noticeable boundary between the local image and background. By incorporating the global loss into the network, the blending of the fire with the background becomes smoother.

5.1 Contribution of Augmented Data on Fire Detection

“You Only Look Once” V4 [39] is utilized in this study to verify the benefits of our proposed method in improving the performance of fire detection networks, with the mean average precision (mAP@0.5) serving as the metric for evaluation.

As seen in the last row of Table 3, the SynFAGnet-augmented images enhance the fire detection relative to the baseline. By generating rare object-and-context scenes to balance the bias in the initial dataset, our proposed augmentation method can improve performance for rare and difficult object detection, such as in high-transparency fires. The model, trained on patching-augmented images, is limited to identifying high-intensity pixels and disregards the halo surrounding the fire. This indicates that the accuracy of the detection model could be compromised if the fire has a high level of translucency. The random placement baseline shows how the contextual relationship might impact the DA for object detection. The quality of the augmented images could be significantly enhanced by locating the object in the appropriate area. By comparing only the local loss and combined local-global loss, making the transition between the fire and the overall scene as seamless as possible can further improve the quality of the augmented dataset for enhancing the performance of fire detection. From the visual comparison in Fig. 9, training the model on either the real images or random placement-augmented images misclassifies the reflection as fire pixels, resulting in a bounding box that covers an area too large for the actual fire scale. Using the local and patching-augmented images makes it challenging for the model to accurately identify high-transparency fire pixels.

Table 3 Detection and Segmentation Results With/Without Data Augmentation (DA)

Full size table

5.2 Contribution of Augmented Data on Instance Segmentation

In this experiment, the instance segmentation algorithm Mask R-CNN [40] was trained on all the augmented datasets with mAP@0.5 as the evaluation metric. Table 3 compares all three augmentation data methods and indicates that our proposed approach is outstanding, producing the best mAP@0.5. The quantitative results for both tasks allow us to conclude that finding suitable positions and scales for given fires by following the context background may loosen the dataset bias and enhance the accuracy with the special cases.

The segmentation results for the ground reflections can be observed from the visual comparison in Fig. 10. Owing to training on patch-augmented images, the model experiences challenges in detecting pixels with low intensity, resulting in the segmented mask not fully capturing the extent of the actual fire. Moreover, the model may produce incorrect segmentation if the region surrounding the fire-reflecting light is excessively bright. The first two rows demonstrate that by training on images augmented using the SynFAGnet method, the Mask R-CNN model can properly classify the part originally belonging to the fire as fire pixels, resulting in a more complete retained fire image. The remaining rows show that training on the real dataset or combining it with the random placement method fails to distinguish the fire reflection effectively. Moreover, training on augmented images without the global loss leads to incorrectly classifying parts of the original fire as non-fire pixels, in addition to misclassifying non-fire pixels owing to the appearance of noise factors between the local image and the overall scene in the training set. In conclusion, the accuracy of the fire recognition network can be significantly improved by training it with augmented images as similar to the real world as possible.

6 Limitations and Future Works

1.
Dataset exploration: The experiments on synthesizing authentic fire images were conducted primarily using the public datasets (AI Hub, VOC2020, Dunning) and the private dataset (Visionin). These datasets predominantly contain images of outdoor fires. However, for a more comprehensive study, data from varied environments are required to be collected and evaluated, such as indoors and warehouses.
2.
Robustness towards low-light conditions: Efforts will be made to enhance the performance of SynFAGnet in generating fire images under challenging conditions, especially in low-light situations where the current results are unsatisfactory. This will be achieved by expanding the training dataset to include a broader range of scenarios in terms of quantity and diversity to optimize the model performance.
3.
Fire detection under diverse scenarios: Expanding the scope of SynFAGnet beyond fire detection. Smoke detection is just as important as fire detection regarding fire safety, as smoke is often the initial sign of a fire. To address this issue, SynFAGnet will generate synthetic smoke data and integrate the fire and smoke data to improve its capability to detect fires in different scenarios.
4.
Distributed endpoint: Currently, the proposed system constitutes a proprietary offline fire generation system that can be used on-premises servers. However, future work focuses on having a distributed endpoint for the system to be used ubiquitously online.

7 Conclusion

This study proposed a fully automated realistic synthetic fire generative network for generating fire-burning images in a new scene as a DA method. Based on two separate networks, after automatically finding suitable positions and scales for the given fire mask, the proposed approach generates realistic-looking fire images with high-quality halos and reflections. Compared with existing methods, this proposed method addresses two major challenges in fire image generation. First, the OSPNet aims to learn the distribution of diverse and plausible locations and scales where a given fire can be placed in the background. Second, the LGC-GAN module acquires the fire mask as input through a local-global structure to enhance the controllability of the fire shapes and generate a more realistic synthetic image by retaining high-quality background details. Compared to previous methods, SynFAGnet can automatically generate realistic synthetic quality flame images from both global and local perspectives in various environments. Qualitative evaluations reveal that SynFAGnet outperforms existing methods in generating realistic-looking images under varied conditions. Quantitatively, SynFAGnet achieves scores of 17.232 on FID, 0.077 on LPIPS, and 0.67 in user evaluations, surpassing other methods in all metrics. Furthermore, networks trained with the augmented dataset from SynFAGnet show a 32.18% and 33.11% (mAP@0.5) improvement in detection and segmentation, respectively, compared to only real datasets.

References

Muhammad K, Ahmad J, Lv Z, Bellavista P, Yang P, Baik SW (2019) Efficient deep CNN-based fire detection and localization in video surveillance applications. IEEE Trans Syst Man Cybern Syst 49(7):1419–1434. https://doi.org/10.1109/TSMC.2018.2830099
Article Google Scholar
Khan A, Hassan B, Khan S, Ahmed R, Abuassba A (2022) DeepFire: a novel dataset and deep transfer learning benchmark for forest fire detection. Mobile Inf Syst. https://doi.org/10.1155/2022/5358359
Article Google Scholar
Xu Z, Guo Y, Saleh JH (2021) Tackling small data challenges in visual fire detection: a deep convolutional generative adversarial network approach. IEEE Access 9:3936–3946. https://doi.org/10.1109/ACCESS.2020.3047764
Article Google Scholar
Juraev S, Ghimire A, Alikhanov J, Kakani V, Kim H (2022) Exploring human pose estimation and the usage of synthetic data for elderly fall detection in real-world surveillance. IEEE Access 10:94249–94261
Article Google Scholar
Miraliev S, Abdigapporov S, Kakani V, Kim H (2023) Real-time memory efficient multitask learning model for autonomous driving. IEEE Trans Intell Veh. https://doi.org/10.1109/TIV.2023.3270878
Article Google Scholar
Syed T, Kakani V, Cui X, Kim H (2021) Exploring optimized spiking neural network architectures for classification tasks on embedded platforms. Sensors 21(9):3240. https://doi.org/10.3390/s21093240
Article Google Scholar
Kakani V, Jin C-B, Kim H (2023) Segmentation-based id preserving iris synthesis using generative adversarial networks. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-16508-1
Article Google Scholar
Levin A, Zomet A, Peleg S, Weiss Y (2004). Seamless image stitching in the gradient domain. In: Pajdla T, Matas J (eds) Computer vision - ECCV 2004. ECCV 2004. Lecture notes in computer science, vol 3024. Springer, Berlin. https://doi.org/10.1007/978-3-540-24673-2_31
Takahashi R, Matsubara T, Uehara K (2018) Random image cropping and patching data augmentation for deep CNNs. In: Asian conference on machine learning. PMLR, pp 786–798
Walawalkar D, Shen Z, Liu Z, Savvides M (2020) Attentive cutmix: an enhanced data augmentation approach for deep learning-based image classification. arXiv preprint arXiv:2003.13048
Pérez P, Gangnet M, Blake A (2003) Poisson image editing. ACM Trans Graph 22(3):313–318. https://doi.org/10.1145/882262.882269
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144. https://doi.org/10.1145/3422622
Article MathSciNet Google Scholar
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134
Yang Z, Wang T, Bu L et al (2022) Training with augmented data: GAN-based flame-burning image synthesis for fire segmentation in warehouse. Fire Technol 58:183–215. https://doi.org/10.1007/s10694-021-01117-x
Article Google Scholar
Qin K, Hou X, Yan Z, Zhou F, Bu L (2022) FGL-GAN: global-local mask generative adversarial network for flame image composition. Sensors 22(17):6332. https://doi.org/10.3390/s22176332
Article Google Scholar
Liu C, Liang Y, Wen W (2022) Fire image augmentation based on diverse alpha compositing for fire detection. In: 2022 15th International Congress on image and signal processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 2022, pp 1–6.https://doi.org/10.1109/CISP-BMEI56279.2022.9979846
Remez T, Huang J, Brown M (2018) Learning to segment via cut-and-paste. In: Proceedings of the European conference on computer vision (ECCV), pp 37–52
Georgakis G, Mousavian A, Berg AC, Kosecka J (2017) Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836
Zhang SH, Zhou ZP, Liu B et al (2020) What and where: a context-based recommendation system for object insertion. Comput Visual Media 6:79–93. https://doi.org/10.1007/s41095-020-0158-8
Article Google Scholar
Li X, Liu S, Kim K, Wang X, Yang M-H, Kautz J (2019) Putting humans in a scene: learning affordance in 3d indoor environments. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12368–12376
Ouyang X, Cheng Y, Jiang Y, Li C-L, Zhou P (2018) Pedestrian-synthesis-GAN: generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047
Volokitin A, Susmelj I, Agustsson E, Van Gool L, Timofte R (2020) Efficiently detecting plausible locations for object placement using masked convolutions. In: Bartoli A, Fusiello A (eds) Computer vision – ECCV 2020 workshops. ECCV 2020. Lecture notes in computer science(), vol 12538. Springer, Cham. https://doi.org/10.1007/978-3-030-66823-5_15
Casanova A, Careil M, Verbeek J, Drozdzal M, Romero Soriano A (2021) Instance-conditioned gan. Adv Neural Inf Process Syst 34:27517–27529
Google Scholar
Nguyen-Phuoc TH, Richardt C, Mai L, Yang Y, Mitra N (2020) Blockgan: learning 3d object-aware scene representations from unlabelled images. Adv Neural Inf Process Syst 33:6767–6778
Google Scholar
Xie C, Tao H (2020) Generating realistic smoke images with controllable smoke components. IEEE Access 8:201418–201427. https://doi.org/10.1109/ACCESS.2020.3036105
Article Google Scholar
Park M, Tran DQ, Jung D, Park S (2020) Wildfire-detection method using DenseNet and CycleGAN data augmentation-based remote camera imagery. Remote Sens 12(22):3715. https://doi.org/10.3390/rs12223715
Article Google Scholar
Sousa MJ, Moutinho A, Almeida M (2020) Wildfire detection using transfer learning on augmented datasets. Expert Syst Appl 142:112975. https://doi.org/10.1016/j.eswa.2019.112975
Article Google Scholar
Lee H, Kang S, Chung K (2023) Robust data augmentation generative adversarial network for object detection. Sensors 23(1):157. https://doi.org/10.3390/s23010157
Article Google Scholar
Park D, Kim M (2023) Design of a deep learning model to determine fire occurrence in distribution switchboard using thermal imaging data. J Converg Inf Technol 9(5):737–745. https://doi.org/10.17703/JCCT.2023.9.5.737
Article Google Scholar
Johnston NAC (2021) LCSC VOC 2020 NW Fire Dataset, Mendeley Data, V1. https://doi.org/10.17632/nchppjr9nr.1
Dunnings AJ, Breckon TP (2018) Experimentally defined convolutional neural network architecture variants for non-temporal real-time fire detection. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 1558–1562.https://doi.org/10.1109/ICIP.2018.8451657
Suvorov R, Logacheva E, Mashikhin A, Remizova A, Ashukha A, Silvestrov A et al (2022) Resolution-robust large mask inpainting with fourier convolutions. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2149–2159
Tan M, Pang R, Le QV (2020) EfficientDet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 10012–10022
Johnson J, Alahi A, Fei-Fei L (2016). Perceptual losses for real-time style transfer and super-resolution. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. ECCV 2016. Lecture notes in computer science, vol 9906. Springer, Cham. https://doi.org/10.1007/978-3-319-46475-6_43
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv Neural Inf Process Syst 30
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595
Wang CY, Bochkovskiy A, Liao, HY M (2021) Scaled-yolov4: scaling cross stage partial network. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp 13029–13038
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

Download references

Acknowledgements

This research was supported by the BK21 Four Program funded by the Ministry of Education (MOE, Korea) and the National Research Foundation of Korea (NRF).

Author information

Authors and Affiliations

Electrical and Computer Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea
Quoc Dung Nguyen & Hakil Kim
Visionin Inc., 704 Ace Gasan Tower, 121 Digital-ro, Geumcheon-gu, Seoul, 08505, Republic of Korea
Ngoc Dau Mai & Hakil Kim
School of Interdisciplinary Studies, Vietnam National University, Hanoi, 123105, Vietnam
Van Huan Nguyen
Integrated System Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea
Vijay Kakani

Authors

Quoc Dung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Dau Mai
View author publications
You can also search for this author in PubMed Google Scholar
Van Huan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Kakani
View author publications
You can also search for this author in PubMed Google Scholar
Hakil Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hakil Kim.

Ethics declarations

Competing Interests

The authors declare that there are no conflicts of interest.

Appendix

An experiment was performed to understand the relationship between the size of a mask and the fire reflection area with the indoor background. Background images were combined with various sizes of masks and then processed through SynFAGnet to generate synthetic fire images. The results in Fig. 11 reveal increased fire halo and reflection as the mask size grows. Additionally, minor variations appear in the surrounding background, likely because the model was trained using an outdoor fire in the forest dataset. Training the model with more diverse datasets is essential to address this.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nguyen, Q.D., Mai, N.D., Nguyen, V.H. et al. SynFAGnet: A Fully Automated Generative Network for Realistic Fire Image Generation. Fire Technol 60, 1643–1665 (2024). https://doi.org/10.1007/s10694-023-01540-2

Download citation

Received: 24 March 2023
Accepted: 26 December 2023
Published: 03 February 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s10694-023-01540-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SynFAGnet: A Fully Automated Generative Network for Realistic Fire Image Generation

Abstract

Similar content being viewed by others

A Step Beyond Generative Multi-adversarial Networks

A lightweight image inpainting model for removing unwanted objects from residential real estate’s indoor scenes

Enhancing Semantic Image Synthesis: A GAN-Based Approach with Multi-Feature Adaptive Denormalization Layer

1 Introduction

2 Related Work

2.1 Object-Scene Placement

2.2 Image Synthesis

2.3 Fire Detection and Segmentation on Augmented Datasets

3 Proposed Model

3.1 Data Acquisition

3.2 Object-Scene Placement Network

3.2.1 Swin Transformer

3.2.2 BiFPN Layer

3.2.3 Prediction Network

3.3 Local–Global Context-Based Generalized Adversarial Network (GAN)

3.3.1 Model Architecture

3.3.2 Loss Function

4 Experiment Results

4.1 Quantitative Evaluation Metrics

4.2 Quantitative Evaluation

4.3 Qualitative Evaluation

5 Ablation Study

5.1 Contribution of Augmented Data on Fire Detection

5.2 Contribution of Augmented Data on Instance Segmentation

6 Limitations and Future Works

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation