1 Introduction

Global warming and climate change is an issue with multiple negative repercussions. Wildfires are one of the many consequences of increased pressure on the planet due to changes in temperature levels. In recent years, we have witnessed some of the deadliest forest fires. Compared to 1984, the annual fire incidents have doubled and have become a cause of concern for forest departments, governments, and the public [1]. These fires not only engulf a huge mass of flora and fauna but also render the land useless for many decades to come. The loss of vegetation and change in soil properties results in flash floods if the fire is immediately followed by heavy rain. The frequency of extreme fires followed by heavy rainfall is said to increase by 100% in the western USA by the end of the century [2]. This will put the forests of California, Colorado, and the Pacific Northwest at high risk.

WHO states that wildfire activities have affected over 6.2 million people in the last two decades. In California alone, a five-fold increase in burnt area is observed compared to 1972 [3]. Californian summers have become warmer by 1.4 degrees, contributing to an increased frequency of summer wildfires. In fall, windy conditions and delayed winter caused significant fires. Wildfires naturally cause a rise in air pollution around the burnt area. Farmers working in close proximity to these areas experience severe health issues [4]. This polluted air has also shown a significant impact on the Covid-19 death cases [5]. The infertile land left after the fire extinguishes is a considerable cause of concern for many. Farmers who cultivate crops near these areas face great losses, and the government and economies are adversely affected.

Many researchers have worked on various techniques to detect and map these wildfires. However, more research is needed to map the burnt area after the fire ceases. Tracing the burnt area has several advantages – additional help to the forest department in the restoration process will help prepare for the aftermath and extreme climatic conditions that will follow massive fires, efficient evacuation plans for neighboring villages, and can also help the government aid the affected people and farms particularly. Our study aims to map the burnt area using image processing followed by deep learning techniques on images of the Californian forest land extracted from the Landsat-8 satellite.

The first Landsat satellite was launched in 1972 ​[6], and since then, 8 of its successors have been launched. As of March 2022, Landsat 7, 8, and 9 are in space, sending real-time images of forests, farms, civilization, and freshwaters. Deep Learning techniques [7]​ are widely used for fire detection using terrestrial, aerial, and satellite-based systems. However, there is a lack of systems that detect the burnt area once this fire ceases. We implement our model to help map the burnt area using infrared satellite images.

This research work provides various insights that could be helpful in the process of tracking forest fires and wildfires. Contributions of this paper are:

  • We have conducted an extensive literature review, presenting insights on the topic, and segregated by sections into machine learning, deep learning, image processing, and hybrid models.

  • We created a bespoke dataset derived from the Landsat 8 satellite database, leveraging the advanced capabilities of the Google Earth Engine API.

  • Engineered ground truth data, including segmentation masks, boundary-boxed images, and segmented images, meticulously crafted from the images extracted through the Google Earth Engine API.

  • Harnessed the power of Convolutional Neural Networks and strategically used transfer learning with AlexNet architecture to classify the extracted images into the fire proficiently and no fire classes.

  • Applied segmentation algorithms of UNet and SegNet to intricately delineate and precisely map the regions affected by fire in the classified images, yielding a nuanced understanding of the spatial impact.

We present a way of tracking forest fires and wildfires using deep learning and machine learning classification and segmentation techniques. We have also evaluated the models on various evaluation parameters. The subsequent sections of the paper have been arranged as follows: Section II reviews the recent literature and segregates the papers into deep learning, machine learning, image processing, and hybrid; indels, in section III, presents the data collection and processing along with the details of the algorithms used. Section IV talks about the evaluation parameters and the results of the models developed. Subsequently, in section V, we extrapolate the work done so far and present the future directions for our work.

2 Literature review

With the advent of automation techniques, forest fire identification and segmentation are now made using various Machine learning, Deep Learning, and Image Processing tools. Twenty papers published in the gap of six years (2017–2022) are reviewed in this section. These papers ponder over a common topic – Wildfire. However, they present the use of different techniques for detecting wildfires. While some papers compared certain commonly used methods, others proposed new methods.

2.1 Deep learning based approaches

Deep learning is a machine learning subdomain that imitates human neurons' properties to solve various functions. A neuron acts as the building block of the neural network [5]. Deep learning has been used in multiple applications and is popularly being used for satellite imagery classification, and based on our survey, we identified that it is also a popular tool for early fire detection.

In [7], different types of Convolution Neural Networks were compared. The authors proposed the use of an Unmanned Aerial Vehicle (UAV) to capture videos of forests for wildfire detection. These videos were used to extract images, which were then passed to pre-trained neural networks. In this study, the authors compared AlexNet, GoogLeNet, Modified GoogLeNet, Modified VGG13, and VGG13. The dataset consisted of aerial images consisting of 10,985 fire images and 12,068 nonfire images. Based on the classification results, it was observed that nonfire images with low altitudes were classified as fire images. The best accuracy was produced by GoogLeNet at 99%. However, it took the highest time, 11.657 s, to make the predictions while taking almost 3 h to train. A close second was Modified VGG13, which gave an accuracy of 96.2% and took 7.951 s to classify the image, which was the least time taken by any model. However, in terms of training time, it was interesting to note that the modified GoogLeNet took only 1.5 h while still giving a satisfactory testing accuracy of 96.0%.

Researchers have proposed using Spatio Spectral novel Neural Network to classify satellite images collected from various sources in [8]. The raw data in the form of images were cleaned using streaming data processing methods that converted the image to vectors. A 3D Convolutional Neural Network with five layers was applied to these images; the model was evaluated using precision, recall, and weighted F1 score. After training, the output of the model is streamed to a dashboard, which provides the user with a focused and monitoring view. The proposed system gave a weighted F1 score of 93.89%, a recall of 91.87%, and a precision of 96.05% and was faster than the baseline models by a ratio of 1.5.

In [9], U-ResNet34 was used for wildfire segmentation after being trained on two different datasets of colored images using various spatial resolutions; the datasets consisted of 1457 and 393 images, respectively. To increase the size of the datasets, the authors opted for image augmentation techniques of flipping, image shift, and image rotation. After training the model on both datasets, the authors got a dice score of 0.812 and 0.508 on the two datasets, respectively. The f1 scores for the two datasets were 0.465 and 0.321. The low scores for the second dataset might be due to the small data size.

The study in [10] uses the popular convolution neural network, Inception V3, to extract features followed by the classification of two types of images – Fire and No Fire. The dataset used was gathered by the EODIS and MODIS instruments of NASA Worldview. The instruments used were the Aqua and the Tera satellites. The dataset consisted of 239 fire images and 295 nonfire images. To further increase the scores, a linear binary pattern was used to trace the exact location of the fire in case the image was classified under the ‘Fire class.’ The authors got an accuracy of 98%, which was comparatively higher than the three models used for comparison in their study. The future scope of this study is to identify fires from videos and make the model more dynamic.

Authors of [11] used the InceptionV3 model to create a smoke detection model that utilizes cloud-based workflow and can scan hundreds of cameras every minute. Authors claim their system can detect fire in under fifteen minutes of ignition, and the proposed system has less than one false positive per day. An important point mentioned by the authors is that the proposed system was tested in Southern California and was detecting better than the existing methods being used in the region. To make the system, the authors used HPWREN cameras, which helped them capture 184,160 images, out of which they noted 103 images. Of these 184,160 images, 250 were taken as the test dataset. This test dataset consisted of 100 smoke images and 150 nonsmoke images. The proposed model gave an accuracy of 0.91 and an F1 score of 0.89. The future work of the proposed system will use more training data and try other models for classification.

To research a large custom dataset, the authors of [12] used 150,000 images produced by Landsat-8 to detect active wildfires. Each image is made up of 10 channels. They proposed the use of deep convolutional neural networks to do the task. To generate annotated masks for the training data, the authors used three well-established techniques proposed by [a], [b], and [c]. To validate this model that automatically generates the mask, they used around 9000 images for which manual masks were generated. Five sets of masks were generated – one from each of the techniques, one was the intersection of the three, and the last one was selected by voting. The three versions of U-Net used for training are – U-Net (10c), which uses all ten channels of the image; U-Net (3c), which only uses channels 7, 6, and 2; and lastly, U-Net-Light (3c), which a compressed version of U-Net (3c). The authors used a combination of different masks and U-Net architectures and then compared the results. The dataset was divided randomly into three sets: 40% training, 10% validation, and 50% testing data. Precision, recall, F1-score, and Intersection Over Union (IoU) were used to evaluate the performance of the models. While the best results were obtained with U-Net (10c), the other two architectures did not fail to produce acceptable results as well proving that the use of just three channels can be enough. In the case of the manually annotated data to train the models, the results showed more tolerance towards noisy data. The study concluded that the best way to detect fire was by combining the results of different techniques.

The authors of [13] proposed a state-of-the-art neural network, Long Term Recurrent Convolutional Network (LTRCN). It is an ensemble of 2 neural networks – RNN (Recurrent Neural Network) and CNN (Convolution Neural Network). The model has three parts. First, CNN works on feature extraction. Next, LSTM uses timestamps to understand how the fire spread and the last part predicts the TFL–terrain fire likelihood. Once TFL is calculated, each pixel is passed through the Markov decision process to classify the pixel as the presence of wildfire or no wildfire. If the area was burnt, the pixel is made 000 (black) or 111 (white) otherwise. Thus, the output is an image with the burnt area black and the unburnt one white. Based on this, the LRCN-LSTM method was compared with Naive Bayes, Logistic Regression, and Decision Tree. For LRCN-LSTM, the Burn Area Ratio (BAR) was found to be 0.86 and Burn Boundary Similarity (BBS) was found to be 0.78, whereas for Naive Bayes, the BAR was found to be 0.71, and the BBS was found to be 0.62. For Logistic Regression, the BAR was found to be 0.75, and the BBS was found to be 0.63, whereas for the Decision Tree, the BAR was found to be 0.35, and the BBS was found to be 0.56. Based on the BBS and BAR values, it was concluded that the LRCN-LSTM model was better than all other compared models. It was also observed that since the Decision Tree does not give a probabilistic output, it is unsuitable for the application.

[14] made use of SqueezeNet to predict wildfires. However, the authors brought novelty to the mundane SqueezeNet model by introducing the dilation convolution neural network. The stride layer was replaced by the dilation layer, and the pooling layer was eliminated, as it resulted in a decrease in accuracy. Finally, the padding from intermediate feature maps was also eliminated. Finally, a multi-scale context model is used to segment the fire region in the image. The dataset used for the study consisted of five forest fire monitoring videos collected from various web pages. The proposed novel CNN architecture, with SqueezeNet as its backbone, gave an accuracy of 94.2%, which was higher than other models compared in the paper.

2.2 Machine learning based approaches

The method of collecting data and utilizing data for generating valuable insights using various algorithms from the data gathered from various sources is called machine learning [5]. In this section, we describe some of the recent publications that have been reviewed that utilize machine learning for detecting forest fires.

The authors of [15] used burn probability, i.e., the probability of ignition evolving into a wildfire of a given intensity, and grid-based mapping to model fire spread from cell to cell using shortest path algorithms. This study considers a forest region as a two-dimensional grid, hence, using Bayesian networks for predicting the behavior of the fire based on geospatial data like the terrain of the area, wind direction, and wind speed, and the fuel sources available. Authors have compared the performance of their model with existing algorithms such as ‘farsite’ and ‘behave’ and claim that their proposed system is faster than the existing systems. The authors also simulated an industrial fire and tested their algorithm, where the fire was spread among fuel chambers.

In [16], the authors gathered data from various sources and mined and cleaned the data of different types to maintain uniformity in the dataset. They applied various machine learning to classify fire and no-fire zones. They collected various geographical and meteorological parameters for creating the dataset. Since the collected data was raw data, they had to process the data in multiple steps, including cleaning, interpolation, and extrapolation. The collected data was from the British Columbia and Quebec regions of Canada. The data collected was used to find the Normalized Difference Vegetation Index (NDVI), Land Surface Temperature (LST), and thermal anomalies. The authors have used Databricks for analyzing the dataset to facilitate big data operations. After data processing, the dataset was split into training and testing, followed by the training of the Multilayer Perceptron and Support Vector Machine. On the testing data, the authors achieved an accuracy of 97.48% using a support vector machine and an accuracy of 98.32% using a multi-layer perceptron. Apart from accuracy, the authors also evaluated the models on true positive rate, f1-score, recall, false positive rate, and precision.

In [17] the authors used a Multilayer Perceptron for predicting the rate of spread (RoS) of an unburnt region by considering the RoS of the burnt region and the environmental parameters. The authors performed the experiment in five environmental conditions, namely, wind influence, slope influence, wind and slope influence at different time intervals, no winds, and two fire fronts merging. The simulation was performed by considering the region as shrubland and the direction to be normal on the firefront. The authors had employed the Narrow-band level-set method to predict the fire front evolution. This proposed method was also validated with numerically generated data, and it was observed that the method was efficient in predicting short-term fire spread.

The study in [18] aims to identify the potential utility of the combination of supervised and unsupervised machine learning algorithms like CART, random forest, and support vector machine on Landsat-8 and Sentinel-2 on GEE to identify fire patches. They used multiple datasets, including the Land use Land Cover map (six forest classes are merged into forest class, and another dataset was Fire CCI-51. The authors got an accuracy of 99% for all three classifiers for both satellites. However, if we consider the percentage Kappa estimator, it was observed that the SVM model does not perform that well on the Landsat-8 satellite images.

The authors of [19] used Fuzzy C-Means to detect and segment wildfires. In this study, the authors have made modifications to the fuzzy c-means algorithm and then compared it with k-means clustering; both these models were evaluated using parameters defining image quality. The models were trained over RGB and CIELab images and evaluated over structural similarity index, mean squared error, normalized mean square error, root mean square error, Laplacian mean square error, peak signal-to-noise ratio, mean absolute error, normalized cross-correlation, normalized absolute error, average distance, maximum distance, and structural content. Based on the assessment, modified fuzzy c means using CIELab images gave a mean square error (MSE) of 908.67, modified fuzzy c-means gave an MSE of 2592.6, k-means gave an MSE of 5527.7, and k-means using CIELab images gave MSE as 1848. Apart from MSE, root mean square error, peak signal-to-noise ratio, normalized cross-correlation, average difference, structural content, maximum difference, normalized absolute error, mean absolute error, Laplacian mean square error, normalized mean square error, structural similarity index measure was used to evaluate the models. Based on the comparison, it was concluded that Modified Fuzzy C-Means was the best model. The authors propose the use of genetic algorithms along with modified fuzzy c-means to improve the performance of the system.

Authors of [20] made use of Random Forests for predicting forest fires. They used the global fire emission database for collecting the data about the forest fires, and the meteorological data was collected with the help of the European Center for Medium-Range Weather Forecasts. The entire study focused on the Bruno Islands using data from 1998 to 2015, where the data from 1998 to 2013 was used to train the random forest model, and the data from 2014 to 2015 was used to test the model. The model was used as a regressor to predict the location of the fire on the GFED map. Random Forest was then evaluated using mean absolute error and relative mean square error.

Another study [21] used random forests to classify the satellite images into various classes based on the severity of the burn. The dataset was classified into five classes: unburnt, crown consumption, crown unburnt, partial crown scorch, and crown scorch. Apart from the images, the normalized burnt ratio, burn area index, normalized difference vegetation index, visible atmospherically resistant index, and normalized difference water index were also used for aiding the classification tasks. Post-training, the model was evaluated on test data and achieved an accuracy of 90%. It is to be noted that the images were collected using the Google Earth engine of sixteen large wildfires that have taken place in the South-Eastern regions of Australia, and the dataset consisted of 10,855 images.

The authors of [22] also used Random Forest along with other trivial machine learning algorithms for classifying the satellite images into four classes based on the severity of the fire; the classes are high severity, low severity, moderate low severity, and moderate-high severity. The study was carried out in the Zagros Mountain range situated in West Iran, and the data was collected from the Sentinel 2 satellite using the Google Earth engine for the time period 2012–2019, covering 1840 fires. The collected data was used to calculate the difference in the Normalized Burn Ratio, which helped as an essential factor in deciding the classes. The authors have evaluated the models using AUC ROC values. After training, the models were tested. Logistic regression gave an AUC of 0.875, and the fuzzy multi-criteria evaluation model gave an AUC ROC of 0.585.

2.3 Image processing based approaches

Image processing consists of various ways to process, create, display, and communicate images. Image processing helps in finding multiple insights in images, assisting in detecting various features [23]. In our literature review, we identified multiple techniques used in image processing by which we can identify burnt and burning areas in an image, which is explained briefly in this section.

The authors in [24] used histogram averaging and image entropy to detect smoke. Based on the spread and dispersion of the histogram, it could be concluded whether there is smoke (or clouds) or not. However, this method could not be used for detecting small smoke, i.e., when the fire has just started, and the smoke is less. Thus, they proposed a new method. The image was divided into smaller regions, converted to grey scale and then binarized. A histogram was then drawn for each sub-region. The sub-region having the highest density of white pixels indicated that there was smoke in that area. However, one significant limitation of this approach is that the algorithm wouldn’t be able to differentiate between smoke and clouds, often resulting in false positive cases.

In [25] the spread of forest wildfires. Combining the novel method of fire heat conduction and image processing for the estimation of vegetation density, the authors aim to reduce wildfire. The study was conducted in the California Ring of Fire region by querying data to the Google Earth engine and extracting the image in RGB. This image was then converted to HIS and frequency spectrums. Apart from the images, the authors also profile the vegetation in the region and incorporate that for probabilistic modeling. Since the system relies on uncrewed aerial vehicles, it overcomes the demerits caused due to human flexibility, safety, convenience, and affordability. The authors have compared and evaluated the model's performance using the existing data of forest fires occurring in 2013 around the California Rim of Fire.

[26] aims at identifying the feasibility of using data generated by the Himawari-8 geostationary satellite for the 2015 Esperance, Western Australia wildfires, thereby using this data to identify the rate and direction of the spread of fire. They did this using the MODIS algorithm. To remove the water and clouds from the satellite images, they used water and cloud masking. The sensitivity of the satellite images is 2KM of the infrared band of the Advanced Himawari Imagery (AHI).

2.4 Hybrid approaches

Hybrid models indicate the use of more than one technique for the proposed task; the techniques could be machine learning, deep learning, or even image processing. The advantage of using hybrid models is that it incorporates the pros of all the techniques that are involved in the model, hence reducing the combined cons [27]. In our study, we went through some hybrid models, which are embellished in this section.

In [28], the authors presented an extensive review of the application of machine learning and deep learning models for detecting and countering wildfires. The authors have also presented a comparative study of fire detection and smoke detection methods in the paper. The authors have used models like Yolo, Haar cascade, SSD, LBP cascade, and Faster R-CNN on multiple datasets. The study utilized three datasets: a real smoke dataset with 12,000 images, a simulated smoke dataset with 12,000 images, and a new dataset with video recordings of wildfires using uncrewed automatic vehicles—the trivial machine learning algorithms of cascades performed less than the deep learning models. LBP gave an accuracy of 0.813, Haar gave an accuracy of 0.874, YOLO_v2 gave an accuracy of 0.983, Faster R-CNN gave an accuracy of 0.959, and SSD gave an accuracy of 0.811 on the testing data. After comparing the models on accuracy, precision, recall, and FPS, it was found that Faster R-CNN and Yolo were the best-performing models.

2.5 Literature review summary

Table 1 recapitulates the literature review, identifying the datasets and algorithms used and the results obtained. Our literature review was divided into three parts covering papers which used machine learning-based techniques, deep learning-based techniques, image processing-based techniques, and hybrid techniques where multiple techniques were combined together. Based on our review, we can conclude that:

  • The datasets used by most researchers are image datasets of aerial images of wildfires and forest fires. These images are generally extracted from satellites like Landsat 8 and Sentinel 2.

  • Random Forest was found to be the first choice for classification problems, followed by other machine learning models like Support Vector Machine, Decision Tree, and Clustering algorithms like K means.

  • In the case of deep learning, the literature is dominated by transfer learning models like AlexNet, GoogLeNet, VGG13, UresNet34, InceptionV3, and SqueezeNet. We also see using an ensemble of trivial machine learning models like Convolutional neural networks, Recurrent Neural Networks, and Long Short-Term Memory. The literature also brings to the front the use of spatio-spectral neural networks for forest fire classification.

  • The literature review also shows the use of various versions of U-Net for the segmentation of burnt areas and the use of YOLO for making boundary boxes around the burnt regions.

  • Image processing techniques used included MODIS, Haar and, LBP cascade, and SSD to assist other models. The review also highlights the use of histogram averaging and image entropy for smoke detection.

Table 1 Summary of Research Works Reviewed

Based on the intrinsic literature review undertaken by us, we have identified the following research gaps:

  • During our review, we have extensively identified the use of machine learning, deep learning, and image processing techniques individually, though there needs to be more integration of multiple techniques together.

  • A lot of studies have used offline datasets and have yet to emphasize on the challenges that could be faced with real time datasets.

  • The literature needs to talk about the response time of the proposed systems; considering the crucial nature of the fire, it is paramount nature, and the detection and response time should be as low as possible.

  • Researchers have used various techniques, datasets, and evaluation parameters, which makes it challenging to identify the state-of-the-art technique (SOTA) and hampers the process of comparison with other methods due to inconsistencies in evaluation and the unavailability of datasets.

3 Methodology

3.1 Geographical area of study

The area of study is confined to the state of California in the United States of America. The region has seen a lot of wildfires over the years due to its vast forest belt. We have covered over 1600 fires from 2013 to 2020. The fires have led to 74,94,562 acres of burnt area, inflicting 181 fatalities and 420 casualties. The data used to perform this research study was adopted from the California Wildfire dataset available on Kaggle. The dataset consisted of 1600 instances with a feature set consisting of 40 columns, stating the area burnt, active dates, extinguishing date, longitude, latitude, locality of the region, and the number of casualties. [29]. Figure 1 helps in visualizing the extent of the fires.

Fig. 1
figure 1

A visual summary of fires in California from 2013 to 2020

3.2 Data collection and preprocessing

The 1600 location points (latitude and longitude) of the dataset mentioned in the previous sub-section were used to extract burnt regions from the data collected by Landsat 8. Choosing Landsat 8 for our wildfire classification and segmentation project was particularly justified due to its continuous data availability from 2013 to the present. This extensive temporal coverage allowed us to access a consistent and comprehensive dataset spanning several years. With the help of the Google Earth engine, we created a custom dataset by querying the longitude, latitude, start date of the fire, end date of the fire, the box size of the image (which was set as ± 0.05 and ± 0.1), and scaling factor for the satellite images (the values were set to 20, 30, 40, and 50). This resulted in a total of 12,800 samples to start with, the count decreases after the preprocessing steps. Figure 2 shows the process of data collection and cleaning, and algorithm 1 provides the pseudo-code for the image extraction process. After the dataset is loaded, the images of the given locations are extracted from Google Earth. An image is acquired in 7 bands. Bands 2, 3, and 4 are combined to form an infrared image. Bands 5 and 6 combined to give the NBR result. The infrared images had to be cleaned before further processing. While most images were clean, some images had some noise which was a black pixelated region. In contrast, others had a cloudy or smoky view which was also considered as noise, as seen in Fig. 3. To avoid such cases, data from the same location in a different time frame was collected using the same process. Some locations, however, had no noisy data for all time frames; hence, we had to drop these images. After the data was cleaned, we used image processing techniques to form ground truths for boundary boxes and segmented regions of the burnt/burning area in the images collected in the previous step.

Fig. 2
figure 2

Data extraction and cleaning

Fig. 3
figure 3

Different Types of raw images acquired from the dataset

For the context of this paper, we define burning regions/areas are those areas which are currently burning and have fire and smoke coming from them in the current time frame, on the other hand the regions which are black patches and have no signs of smoke and fire in the current time frame are called burnt regions/areas. Following the same definition, image which have burning or burnt regions have the above-mentioned regions, and these regions are absent in the images with no burning or burnt regions.

The original image generated at the end of the Fig. 2, is of size 256 × 256 pixels and has a resolution of 72 pixels/inch. These terminologies will be used in Fig. 4.

Fig. 4
figure 4

Data preprocessing

Figure 4 shows the flow of the preprocessing steps that were followed. The extracted image attained from Fig. 3, is passed through an anisotropic filter, this is done to preserve edge information while effectively reducing noise, resulting in improved image clarity and detail retention. The filtered image is resized, this is a precautionary step to maintain the same size of the images of 256 × 256. These images are then converted to gray scale and then binary thresholding is performed to attain the high-density area, i.e., the region where the pixel intensity is the minimum (if we see the reference image in Fig. 3, the black region signifying the burnt area is the high-density area), which will be our segmented mask. It is to be noted that, binary thresholding was performed on each image manually on a dataset which could not be generalized, hence the threshold was selectively mapped for all the images. Furthermore, for some images, where there were multiple high-density areas or had water bodies in them, needed binary thresholding to be performed multiple times, thus leading to a need of an automated process using deep learning. This high-density region is then used to make a bounding box, and then to create a segmented image, indicated by the reference images in the subsequent steps.

Algorithm 1
figure a

Pseudo Code for Image Preprocessing

3.3 Proposed models

This study performs a two-step classification process. We first use classifiers to perform image-wise classification to find if the input image has a burning/burnt area in the image. Then, we do pixel-wise classification for the images where we have a burnt/burning area in the images to get the segmented region of the burnt/burning area. This methodology aligns with the approaches observed in [30], who also utilized a two-stage classification for enhanced model building in identifying and delineating fire-affected zones in UAV-based imagery. The dataset had 12,800 images, which we felt was sufficient for this step of research. The dataset was imbalanced in terms of image with burnt/burning regions and images without burnt/burning regions, but since there is natural imbalance due to less occurrence of wildfires, we decided to go ahead with an unbalanced dataset. Based on the above-stated steps, we have used the following classifiers and compared their performance with each other.

  • Image-wise classification

    1. o

      Convolutional Neural Network (CNN)

    2. o

      AlexNet

Utilizing CNN and AlexNet for wildfire detection in satellite images offers several advantages. These architectures enable automatic feature extraction, facilitating the identification of complex patterns indicative of wildfires. With their deep neural network structures, they can recognize subtle variations in image features, enhancing detection accuracy. Additionally, AlexNet, as a transfer learning model, leverages pre-trained weights, accelerating deployment and adaptation to new tasks with minimal data requirements. Their scalability supports real-time processing of large datasets, crucial for timely wildfire monitoring and management. Overall, these methods provide a comprehensive approach to wildfire detection, combining advanced machine learning techniques with efficient processing capabilities.

  • Pixel-wise classification for Segmentation

    1. o

      UNet

    2. o

      SegNet

We emphasize the strategic choice of UNet and SegNet architectures for wildfire segmentation in satellite imagery due to their distinct advantages. UNet's symmetric encoder-decoder architecture facilitates precise delineation of wildfire boundaries, crucial for accurate detection and monitoring. Meanwhile, SegNet's encoder-decoder design, supplemented by pooling indices, ensures robust performance with large datasets and class imbalances, essential for processing satellite imagery. These architectures offer efficient resource utilization, promising superior accuracy and efficiency in wildfire detection and management, thereby contributing significantly to proactive wildfire prevention and control efforts.

3.3.1 Convolutional neural network

A Convolutional Neural network is a deep neural network commonly used to process image-related inputs. The basic purpose of a CNN is to reduce the number of trainable parameters alongside building a network complex enough to process magnanimous amounts of data [31]. A custom CNN model is designed to classify the images into two classes - burn area (presence of burnt area in the image) and no burnt area (no area in the image appears burnt). Fig. 5 shows a detailed architecture of the custom CNN model. The input images were converted into binary and reshaped to 256x256. The architecture had three convolution layers followed by max-pooling layers, and it was followed by a flatten layer and two dense layers. The convolutional layers further act as the backbone network for the encoder of the proposed U-Net architecture.

Fig. 5
figure 5

Proposed Custom CNN architecture

3.3.2 AlexNet

AlexNet is a deep convolutional neural network consisting of 6,50,000 neurons and 60,00,00,000 parameters. It consists of 5 convolutional layers, one max pooling layer, three fully connected layers, and finally, an output layer with 1000 neurons and SoftMax activation function. Dropout was used in the model to prevent overfitting. AlexNet was the winner of the ILSVRC10. On the test data of the ImageNet dataset, AlexNet gave a 37.5% top-1 error and 17.0% as a top-5 error. This network was modified and entered in the ILSVRC12 where it achieved 15.3% as topt-5 error [32].

To compare the results obtained by the custom CNN, AlexNet architecture was used to train the model. The input size and other parameters like the number of epochs, batch size, and number of training and testing samples were kept unchanged. Fig. 6 shows the AlexNet model used.

Fig. 6
figure 6

AlexNet Architecture

3.3.3 U-Net

U-Net was introduced by Olaf Ronneberger et al. [33] to tackle the ISBI challenge of segmenting microscopic images of size 512x512. The network consists of an encoder and a decoder. The encoder of U-Net uses the convolutional layers of the proposed CNN model as the backbone network. The encoder is built using a unit consisting of two convolutional layers with a filter size of 3x3 and ReLU activation, followed by a max-pooling layer with a filter size of 2x2 and stride 2. Whereas the decoder is built using a unit consisting of a 2x2 up-convolutional layer, with two 3x3 convolutional layers. The basic concept was to classify each pixel into 0 or 1, i.e., whether the pixel was a part of the mask or not.

Fig. 7 shows the U-Net architecture used to perform the segmentation of the burnt area. The input for the network is two sets of data: The original infrared image and the corresponding mask, which was acquired by the preprocessing. The network is trained to detect the mask for unseen images. The input of the network is of size 224x224, while the output is a binary image mask of size 224x224. The hyperparameters were fine-tuned to suit the process of training. The network consists of 4 decoder and four encoder units, where the encoder unit consists of two convolutional layers with a 3x3 filter size and a max-pooling layer with a pool size of 2x2; similarly, the decoder unit consists of a transpose layer to concatenate the inputs, and two convolutional layers with a filter size of 3x3. For training the model we used dice score as the loss function and (Binary Cross Entropy) BCE dice loss as the optimizer. We have explained them in detail in Section 4.1.2.

Fig. 7
figure 7

U-Net Architecture used

3.3.4 SegNet

SegNet was introduced by Vijay Badrinarayanan et al. [34]. Similar to UNet, SegNet also consists of an encoder and a decoder. The model works on the VGG16 network design, discarding the fully connected layers. Hence, there are 13 convolutional layers in the encoder, which corresponds to 13 convolutional layers in the decoder. ReLU is used as the activation function for all the layers. Just like UNet, SegNet also performs pixel-wise classification. The significant difference between UNet and SegNet is that in SegNet, pooling indices are passed from the encoder to the decoder, whereas in UNet, the entire feature map is passed [35].

To compare the results obtained by our primary algorithm, U-Net, we used SegNet. SegNet was trained using the same input. Fig. 8 shows the architecture of the layers used in the SegNet model. The model consisted of 5 units of encoder and decoder, where the encoder unit consisted of 2 convolutional layers, 2 batch normalization layers, and a max-pooling layer. Similarly, the decoder consists of a transpose layer, three batch normalization layers, and two convolutional layers. For the training process, the model used the loss function and bce dice loss as an optimizer. We have explained them in detail in section 4.1.2. The model architecture is inspired from [36] with an additional bottleneck layer being added to the original architecture.

Fig. 8
figure 8

SegNet Architecture used

4 Experimental analysis

4.1 Model hyperparameters

In order to optimize the performance of the models, we fine tune the hyperparameters. Using different search spaces, as shown in Table 2, we find the best hyperparameter setting for each of the networks. The search space defined in Table 2 was used for all the models. The best parameters were selected after comparing the evaluation parameters as defined in Section 4.2.

Table 2 Hyperparameter Search Space

4.2 Evaluation parameters

4.2.1 Classification parameters

We have used various evaluation parameters for evaluating our classifiers. The first one used is accuracy, which tells us the proportion of the correctly classified samples out of the total number of samples [37]. Accuracy can be found by equation, where TP, FN, TN, and FP are true positives, false negatives, true negatives, and false positives, respectively, and are found from the confusion matrix. Equation (1) helps in finding accuracy.

$$accuracy= \frac{TP+TN}{TP+FP+TN+FN}$$
(1)

Precision is the proportion of the true positives and the total predicted positives. It helps in making sure that the system is reliable enough to predict positives [38]. Precision can be found using equation (2).

$$precision= \frac{TP}{TP+FP}$$
(2)

Recall is the proportion of the true positives and the actual positives. This parameter checks the model’s ability to check positive samples [38]. Equation (3) helps in calculating recall.

$$recall= \frac{TP}{TP+FN}$$
(3)

F1 score is the harmonic mean of precision and recall of the model. The F1 score combines precision and recall. Hence, it helps ease the monitoring process of the model [39]. Equation (4) tells us how to find the f1 score.

$$F1-score= \frac{2\times precision\times recall}{precision+recall}$$
(4)

The ROC curve or receiver operator characteristic curve is a plot against TPR and FPR, and the ROC-AUC score is the area under the ROC curve. This score tells us the ability of a classifier to differentiate between classes [39]. ROC-AUC score can be found using equation (5).

$$ROC-AUC score={\int }_{0 }^{1 }TPR\left(FPR\right)dFPR$$
(5)

TPR is also called sensitivity or recall. TNR or specificity is the proportion of true negatives to total negatives. FNR is the proportion of false negatives in total positives, whereas FPR is the ratio of false positives to total negatives [40]. Equations (6), (7), (8), and (9) tell us how to calculate TPR, FPR, TNR, and FNR.

$$TPR= \frac{TP}{TP+FN}$$
(6)
$$FPR= \frac{FP}{TN+FP}$$
(7)
$$TNR= \frac{TN}{TN+FP}$$
(8)
$$FNR= \frac{FN}{TP+FN}$$
(9)

4.2.2 Segmentation parameters

The dice Score is the ratio of the overlap area between the real and the projected mask to the total number of pixels in both images [41]. Equation (10) assists us in determining the dice score.

$$Dice\;Score= \frac{2*|A\cap B|}{\left|A\right|+|B|}$$
(10)

For calculating binary cross-entropy, each of the projected probabilities is compared to the actual class output, which can be either 1 or 0, i.e., binary output. The system then calculates a score that penalizes the probabilities in accordance with their deviation from the expected value. This reflects the distance between the estimate and the actual value [42]. Binary cross entropy is mathematically represented by equation (11).

$$Binary\;Crossentropy= -\frac{1}{N}\sum\nolimits_{i}^{N}\sum\nolimits_{j}^{M}{y}_{ij}\text{log}({\widehat{y}}_{ij})$$
(11)

where, N is the number of images and M is the number of classes.

BCE Dice loss is a proposed parameter, incorporating the properties of both dice score and binary cross-entropy. BCE Dice loss can be found using the following formula.

$$BCE\;Dice\;Loss=0.5*Binary\;Crossentropy-Dice\;Score$$
(12)

4.3 Results

These experiments were performed on the GPU environment of Dell G3 15 3500 with Intel core i5 10 generation processor and Windows 11 operating system, 16 GB RAM, using NVIDIA GeForce 1650 Ti GPU with a 4 GB storage.

4.3.1 Classification results

As stated earlier, our data set consists of infrared images. We have our data with 4112 images in total, out of which 3077 are of class 1, i.e., have burning or burnt regions. For simplicity, we take this class as images with fire, and the rest of the 1035 images are of class 0, i.e., having no burnt or burning regions; for simplicity, we take this class as images with no fire. This dataset is then split into 80-20 for training and testing the models. Table 3 gives the summary of the classification statistics.

Table 3 Classification Summary

Based on the results in Table 3, we can say that the proposed CNN performs better than AlexNet during testing and validation, whereas AlexNet performs better than CNN during the training phase. If we talk about the class-wise performance, we can observe that CNN has the upper hand in precision for class 0, i.e., no fire class, and recall and f1-score for class 1, i.e., with fire class. In terms of class-wise accuracy, we observe that CNN gets an accuracy of 0.9635 for class 0 and an accuracy of 0.9634 for class 1, whereas AlexNet gets an accuracy of 0.9566 for class 0 and an accuracy of 0.9566 for class 0; this indicates that overall CNN model developed by us performs better than AlexNet network.

Since the dataset used for this study is a new dataset, we implemented some of the models used by researchers for classifying satellite images, that were identified during the literature review. We observe from Table 3 that the proposed CNN outperforms GoogLeNet and VGG13 used in [7] and BushFireNet proposed in [43] with an average performance improvement of 18.77%, 24.31%, and 48.62% respectively. We also noticed that the proposed CNN not only outperforms GoogLeNet, VGG13, and BushFireNet but also exhibits faster inference time, making it a more suitable choice for real-time satellite image classification applications.

4.3.2 Segmentation results

After classification, the images that have fire, i.e., images that have burnt or burning images, were sent ahead for segmentation. With the help of segmentation, we can identify the size of the burning or burnt areas and the degree of burn using the Normalized Burnt Ratio. This data helps the forest and wildfire departments to take actions of extinguishing or in the case of burnt areas, restoring the ecological diversity of the region. As stated earlier, we have used UNet and SegNet for segmentation; Table 4 summarizes the results of both these models.

Table 4 Segmentation Summary

It can be observed that for testing data, the dice loss, binary cross-entropy, and dice score are marginally better for UNet, as compared with SegNet. In the case of training results, UNet performs well in terms of dice loss, and SegNet performs well in terms of dice score and binary cross-entropy. In similar terms, UNet performs well in terms of dice score, and SegNet performs well in terms of binary cross-entropy and dice loss. The training time for UNet was found to be 1 hour 4 mins, and that of SegNet was found to be 1 hour 25 mins, indicating that UNet requires less time to train the model.

The proposed UNet model (dice score of 0.6869) gives an improvement in the performance of 2.33% when compared to the UNet model in [33] (dice score of 0.6712), similarly the proposed SegNet model (dice score 0.6672) gives a performance improvement of 3.71% when compared to the SegNet model in [36] (dice score of 0.6433).

We compared the performance of our proposed models with existing models in the literature by also implementing UResNet34 proposed in [9]. It can be observed from Table 4, that the proposed UNet model performs better than UResNet34 in terms of BCE Dice Loss, Dice Score, and Binary Cross Entropy giving an average performance improvement of 11.54%. Not only UNet performs better than the proposed SegNet and UResNet34, but it also has a lower inference time of 3.24 seconds, compared to that of SegNet (5.65 seconds) and UResNet34 (8 seconds), making it more suitable for real-time fire detection applications.

The training and validation curves for UNet and SegNet, as illustrated in Fig. 9 through 14, provide a comprehensive view of the models' performance across epochs. For UNet, Figs. 9 and 10 reveals a declining trend in both training and validation BCE Dice Loss, which levels off post 40 epochs, suggesting that the model reaches a state of convergence with reduced risk of overfitting. This is mirrored in the Dice Score metrics in Figs. 11 and 12, where a steady increment is observed until stability is achieved, indicative of a model that is learning effectively without memorizing the training data. Figure 13 shows the visualization of the Binary Cross Entropy underscores some volatility, yet the downward trajectory indicates progressing model optimization.

Fig. 9
figure 9

Training and Validation BCE Dice Loss for UNet

Fig. 10
figure 10

Training and Validation BCE Dice Loss for SegNet

Fig. 11
figure 11

Training and Validation Dice Score for UNet

Fig. 12
figure 12

Training and Validation Dice Score for SegNet

Fig. 13
figure 13

Training and Validation Binary Cross entropy for UNet

Similarly, SegNet's performance, captured in Figs. 10, 12, and 14, echoes this pattern of initial fluctuation followed by stabilization. The BCE Dice Loss in Fig. 10 and the Binary Cross Entropy in Fig. 14 converge after approximately 40 epochs, paralleling the findings for UNet. The Dice Score in Fig. 12, however, demonstrates a more gradual ascent towards an asymptote, suggesting a more conservative learning rate but eventual stable performance.

Fig. 14
figure 14

Training and Validation Binary Cross entropy for SegNet

Both models demonstrate that after the initial learning phase, the rate of improvement in loss reduction and accuracy gain diminishes, which is characteristic of the convergence behavior of deep learning models. These visualizations not only confirm the models' capability to generalize but also help in identifying the epochs at which the loss functions begin to plateau, thus informing the potential stopping point for training to prevent overfitting and to conserve computational resources.

Fig. 15 and 16 show the results obtained by both the models. It can be observed that both models are able to segment the test images correctly. An exciting thing to be noted in Fig. 15 is that there are images where the model has been able to segment two burnt regions. The fact that the testing masks only had masks with one burnt/burning region makes this result intriguing and makes us appreciate the power of artificial intelligence. The images show that the model works efficiently for images that have a less intensity range, have a lot of shades, and have water bodies.

Fig. 15
figure 15

Results for UNet

Fig. 16
figure 16

Results for SegNet

5 Discussion

The study's focus revolves around analyzing the outcomes of employing deep learning techniques to classify and segment infrared images containing fire and burnt regions. The dataset encompasses a total of 4112 infrared images, divided into two classes: 3077 images represent class 1, corresponding to burning or burnt areas (referred to as images with fire), while 1035 images belong to class 0, depicting images without burning or burnt regions (denoted as images with no fire).

For the classification task, two deep learning models are utilized: AlexNet and a custom Convolutional Neural Network (CNN). The results, as summarized in Table 2, reveal that although AlexNet exhibits superior performance during the training and validation phases, the custom CNN slightly surpasses it in the testing phase. An exploration of class wise metrics unveils that the custom CNN excels in precision for class 0 (no fire) and excels in recall and f1-score for class 1 (fire).

Delving further into accuracy, the custom CNN achieves higher accuracy for both classes in contrast to AlexNet, signifying an overall better performance. Following the classification, images containing fire are subjected to segmentation using UNet and SegNet models. The subsequent outcomes, outlined in Table 3, demonstrate that UNet slightly outperforms SegNet in terms of dice loss and binary cross-entropy for training data and dice score for testing data. Conversely, during training, UNet shines in dice loss, while SegNet shines in dice score and binary cross-entropy.

Additionally, it is noted that UNet demonstrates shorter training and testing times compared to SegNet, highlighting its swifter performance. The evolution of loss and scores during training is visually illustrated in Figs. 9, 10, 11, 12, 13 and 14. Both UNet and SegNet exhibit stabilized graphs after approximately 40 epochs, suggesting reduced risks of overfitting.

Visual results of the segmentation process are displayed in Figs. 15 and 16, showcasing the accurate segmentation of test images by both models. An intriguing finding surfaces: the models are capable of identifying multiple burnt regions in particular images, even when ground truth masks exclusively denote a single burnt region. Furthermore, the models are adept at handling challenging scenarios, such as images with low-intensity ranges, intricate shades, and water bodies, underscoring their proficiency in complex environments.

In essence, the study furnishes a comprehensive analysis of the application of deep learning models to categorize and segment infrared images with fire and burnt regions. The custom CNN emerges as the stronger performer in testing, while UNet demonstrates a slight edge in segmentation over SegNet. The stability in training trends and adeptness in diverse conditions underline the models' robust capabilities. When compared to existing methods in the literature, our approach introduces several enhancements and novel contributions. Our model's initial classification stage leverages a custom-designed convolutional neural network (CNN) that outperforms traditional architectures like AlexNet, providing improved accuracy in detecting wildfire presence in satellite images. The subsequent segmentation stage, utilizing UNet and SegNet, allows for precise localization and delineation of fire-affected areas, an improvement over previous methods that employed less sophisticated segmentation techniques.

Recent literature shows the use of transformers [44], attention gates [45], and residual connections [46] to improve the performance of UNet models, the issue with these networks is that they have a larger inference time and require higher computational power. Thus, we opted for a simple CNN-based UNet which has a small inference time, is more interpretable, and requires lower computational power making it more suitable for real-time deployment. Using CNNs in a UNet helps in detecting edges, textures, and more complex patterns by stacking convolutional layers, making it suitable for wildfire detection using satellite images [33]. Convolutional layers also share weights across different parts of the input image providing translation invariance, indicating that the same feature can be detected regardless of its position in the image; this helps in identifying burning/burnt regions in the satellite image, regardless of their position [47].

In terms of the dataset, we use a true representative sample of the data, that is, we train our model on an imbalanced dataset. Moreover, the integration of manual and automated processes in data curation helps in reducing the noise and inconsistencies often present in large-scale satellite image datasets. This results in higher data quality and more reliable model training outcomes, positioning our approach as an advanced solution in the field of wildfire detection and segmentation.

6 Application areas

The proposed system has applications in various sectors. The data collected and inferred could be stored in a data warehouse so that it could be used to do historical analytics of the region. Governments can identify the fire-prone areas and take appropriate actions, like keeping a rapid action force ready in that region, building water reservoirs for extinguishing fires, and training locals to take steps in case of a fire. The insurance agencies could use the system to identify the fire-prone regions and charge high premiums in those areas. In deploying the proposed wildfire detection system, several challenges and ethical considerations must be addressed. Privacy concerns may arise regarding the collection and utilization of satellite imagery data, particularly regarding the monitoring of sensitive or private areas. Moreover, the implementation of rapid action measures based on automated detection systems could inadvertently lead to unintended consequences, such as false alarms or unnecessary interventions, impacting both human resources and the environment.

7 Conclusion and future scope

Forest fires have been causing immense loss of life and property over the years; they are also posing a threat to nature. Machine Learning and deep learning have been used to find ways to track fires, predict the direction of spread, and detect fires in aerial and satellite imagery. In this paper, we aim to detect burnt/burning regions in satellite images captured by Landsat 8 satellite and then pass the burnt/burning images to segment the burnt/burning region. For our experiments, we have queried the longitude, latitude, start date, and end date of the wildfire to the Google earth engine, after extracting the seven bands, we make an infrared image. Using this infrared image, we find the segmentation mask, segmented image, and bounding box image using the algorithm we curated. The extracted images are then divided into training and testing and passed to a custom CNN model and AlexNet model for classification. Custom CNN gave an accuracy of 88.19%, which was slightly better than the accuracy of AlexNet, which was 88.08%. The images that had burnt/burning regions were then passed to segmentation algorithms for localization U-Net and SegNet. U-Net performed slightly better than SegNet by getting a dice score of 0.6869. Hence, we propose the use of the combination of Custom CNN and U-Net for detecting burnt/burning regions in infrared images.

The proposed future scope of this study encompasses several key enhancements to the wildfire detection system. One significant aspect involves developing a user-friendly GUI tailored for forest officials, facilitating seamless interaction with the system, and enabling efficient monitoring and response to wildfire incidents. To foster indigenous technologies and system reliance on domestic sources we plan to integrate information obtained through ISRO satellites and help provide responsive actions. Furthermore, the implementation of convolutional LSTMs holds promise for enhancing predictive capabilities by forecasting the direction of fire spread and analyzing fire behavior dynamics over time. Similarly, leveraging YoloV3 for automating the process of generating bounding boxes in satellite images can streamline the data analysis workflow and improve the efficiency of wildfire detection algorithms. However, it is essential to acknowledge potential limitations associated with the proposed future scope, including the feasibility and technical challenges of integrating ISRO satellites into the system. Moreover, the adoption of convolutional LSTMs and YoloV3 may introduce additional computational complexity, requiring careful consideration of resource constraints and optimization strategies to ensure practical implementation in real-world scenarios.