1 Introduction

Wildfire, whether natural or human-induced, is a complex phenomenon with a large number of uncertain and unpredictable driving factors (Jaafari, Mafi-Gholami et al. 2019). The impact of wildfire plays a profound role in structuring ecosystems and change processes, including vegetation succession and landscape transformation (Liu et al. 2019; Bui et al. 2019). However, it also has a wide range of negative impacts on the environment and ecological processes, such as deforestation, habitat destruction, and can threaten the diversity and stability of ecosystems as well as human life and property (Kang et al. 2020; Quan, Xie et al. 2021). Therefore, wildfire probability modeling is critical for fire prevention and management.

Weather variables play a driving role in wildfire occurrence and affect fire ignition and spread by altering fuel availability and soil moisture (Semeraro et al. 2016; Jaafari, Termeh et al. 2019b; Jaafari, Zenner et al. 2019). Many studies and fire danger rating systems, therefore, rely on meteorological data for wildfire probability warning, with the air temperature (AT), relative humidity (RH), wind speed (WS), and accumulated precipitation (AP) used most often (Jain et al. 2020). Since wildfire occurrence has a complicated mechanism of multifactor interactions among fuel, topography, and weather (Pyne et al. 1996), as well as human activities (Jaafari et al. 2017; Quan, Xie et al. 2021), considering only weather variables is not conducive to accurate and robust wildfire probability prediction.

Fuel is another critical element for fire ignition, spread, and combustion. The fuel variables, such as land cover or land use, live fuel moisture content (LFMC) (Quan et al. 2015; Quan et al. 2017) and dead fuel moisture content (DFMC) (Resco de Dios et al. 2015), fuel load (FL) (Quan, Li et al. 2021), forestry type (FT), and tree species (TS) play important roles in wildfire probability prediction, since they have an important influence on fire behaviors and regimes (Cao et al. 2017; Jaafari et al. 2017; Luo et al. 2019; Carrasco et al. 2021; Gale et al. 2021; Quan, Xie et al. 2021). Among them, the live and dead fuel moisture contents are most commonly considered in wildfire probability modeling at present (Yebra et al. 2013; Nolan, Boer et al. 2020; Quan, Xie et al. 2021) because the large-scale and dynamic monitoring of these variables is available, either from remote sensing data or meteorological data (Yebra et al. 2008; Nolan et al. 2016; Fan and He 2021; Quan, Yebra et al. 2021).

Compared with fuel variables, topography and infrastructure variables are static and remain unchanged for long periods. Topography alters fire regimes, especially in the mountainous and steep gorge areas. Slope, elevation, and aspect are the most important topographic variables broadly reported in wildfire probability modeling (Jaafari, Termeh et al. 2019; Janiec and Gadal 2020). Infrastructure variables are related to human activities, which can also alter fire regimes and spatial patterns of wildfire danger for decades (Syphard et al. 2008). The spatial frequency of human activities, such as the distance to roads, residential areas, rivers, and railways, is generally used to account for the infrastructure condition when modeling wildfire danger (Jaafari, Termeh et al. 2019; Mallinis et al. 2019; Janiec and Gadal 2020; Carrasco et al. 2021).

Wildfire probability is not only related to the current weather variables but also affected by the long-term and short-term changes in these variables (Luo et al. 2019; Stefanidou et al. 2019; Nolan, Blackman et al. 2020; Zong et al. 2020). The temporal characteristics of weather variables within a fixed time span (fixed-step weather variables, for example, 10 days mean of the daily average RH) are generally used to model wildfire probability (Cao et al. 2017; Malik et al. 2021; Trang et al. 2022). However, rainfall has a large influence on the fixed-step weather variables, introducing additional uncertainties in wildfire probability modeling. This is because the temporal trends of weather variables may not be robust if rainfall occurs within a given fixed period. Continuous nonprecipitation days (CNPD) before the fire occurs is regarded as a dynamic time span to replace the fixed one, which can solve this problem (Zheng et al. 2020). The dynamic-step weather variables (DSw), defined as the corresponding weather variables within CNPD (for example, the minimum RH within CNPD), can be used to model wildfire probability to reduce the uncertainty caused by rainfall. But no existing study has focused on the effect of dynamic-step weather variables on wildfire occurrence.

To this end, our study aimed to assess wildfire probability for northwestern Sichuan Province in China, using weather, fuel, topography, and infrastructure data, as well as derived variables. In particular, the present study integrated the fuel and dynamic-step weather variables (fuel&DSw) to improve wildfire probability modeling. The random forest (RF) model and the extreme gradient boosting (XGBoost) model were selected to explore the relationships between these explanatory variables and wildfire occurrence and to compare their performance for wildfire probability modeling. The main objective of this study is to determine the main drivers of wildfire occurrence and to improve the accuracy of wildfire probability prediction by considering fuel&DSw.

2 Materials and Methods

This section first provides a brief introduction of the study area, and then elucidates the procedure of wildfire probability model development through a series of processes including wildfire database construction, model design and parameter optimization, and model performance evaluation.

2.1 The Study Area

The study area is located in the northwest of Sichuan Province in southwestern China (27° 57′ 00″–34° 18′ 00″ N, 97° 20′ 00″–104° 26′ 00″ E). The area covers most parts of Aba and Ganzi Autonomous Prefectures of Sichuan Province, with a total area of about 230,000 km2, comprising 47.3% of Sichuan’s land area (Fig. 1a). Belonging to the eastern edge of the Qinghai-Tibet Plateau, this area is characterized by a complex topography, which is dominated by a mixture of large, high mountains and deep canyons, and has a mean altitude of 4000–5500 m above the sea level (a.s.l).

Fig. 1
figure 1

Location of the study area in northwestern Sichuan (a) and yearly (b) and monthly (c) historical fire points (2003–2018) distribution in forests, shrublands, and grasslands. The background map in (a) is from the online map World Imagery (ESRI 2009)

The alpine climate in our study area varies with altitude, but the region is dominated by the continental plateau monsoon climate. This region faces water scarcity (with an annual total precipitation of 400–900 mm but annual potential evapotranspiration of more than 1000 mm) despite the cold winters and cool summers (with average yearly air temperature of 0–16 °C). The fire risk of this region increases from January as relative humidity decreases, peaks in February with a high frequency of wildfire occurrence, and declines in June with the beginning of the rainy season.Footnote 1

Vegetation varies with altitude; the southeast of the study area is covered by forests, but the northern area is mainly occupied by meadows, as well as alpine shrublands. The forests are predominantly evergreen, with broadleaf species (mainly Quercus semecarpifolia Smith and Betula delavayi Franch.) dominating at moderate altitudes (2000–4000 m), conifers (mainly Picea asperata Mast., Abies fabri (Mast.) Craib, and Pinus densata Mast.) at higher altitudes (3000–4500 m), and mixed forests in between. Compared with forests, the alpine shrublands are distributed at higher altitudes, with dwarf species (mainly Rhododendron simsii Planch. and Salix cupularis Rehd.) mixed with forests or grasslands, according to the forest resource survey data of Sichuan Province (Sichuan Forestry and Grassland Administration 2017).

In the past two decades, this area has suffered some catastrophic fires (Fig. 1b), with a total of more than 300 wildfire points, according to the statistics of historical wildfires from the MODIS Burned Area (BA) product (MCD64A1, collection 6) (see Sect. 2.2.1). The fire season lasts from winter to spring (mainly January to May), where about 93% of the total yearly fires occur during this period. The forest fires frequently occurred during a more extended period from December to next June, with the number being less than that of shrubland and grassland fires (Fig. 1c). A significant difference in the spatial distribution of fires can be observed, with increased fire occurrence from northwest to southeast (Fig. 1a).

2.2 Wildfire Database

The probability of fuel ignition was related to the drivers including weather, fuel, and topography variables (Verdu et al. 2012), which were used together with infrastructure variables to quantify human activities for modeling wildfire probability in areas prone to human-caused fires (Jaafari et al. 2017; Quan, Xie et al. 2021). The multisource data for the wildfire probability model include hourly weather variables (at 0.1°), daily to yearly fuel variables (at 500 m), topography variables (at 7.5 arcseconds), infrastructure variables (in vector format), and monthly BA data (at 500 m). To derive the multisource data with a consistent spatial resolution, weather and topography variables were resampled to 500 m using the bilinear interpolation method with the Geospatial Data Abstraction Library (GDAL) tool,Footnote 2 and daily weather variables were composited from hourly data. Infrastructure variables were converted into raster format, with a spatial resolution of 500 m. The wildfire database was built based on the historical wildfire points from the MCD64A1 product, and the driving factors identified above were extracted for each fire point.

Detailed information on the variables used to model wildfire probability in this study is shown in Table 1. Four essential weather variables (AP, RH, AT, and WS) were used to derive the dynamic-step, fixed-step, and daily variables. Here, the dynamic-step weather variables include the minimal, average, and maximal RH/WS/AT within CNPD (represented by min/ave/max-RH/WS/AT-CNPD). In contrast, the fixed-step weather variables include the average RH/WS/AT within 16 days (represented by ave-RH/WS/AT-16days), and the daily weather variables include the daily average RH/WS/AT (represented by ave-RH/WS/AT-daily) or daily accumulated precipitation (represented by AP-daily). The fuel variables include leaf area index (LAI), forestry type (FT), tree species (TS), and fuel moisture content (FMC) and the derived variables are FMC on the 8th/16th day before the fire (represented by FMC-8th/16th) and the average FMC within 16 days (represented by ave-FMC-16days). The topography variables, which consist of elevation, slope, aspect, and the derived variables, are topographic wetness index (TWI), topographic position index (TPI), plan curvature (PC), and sine/cosine value of aspect (sin/cos-aspect). The infrastructure variables include the distance to roads and residential areas.

Table 1 Source and significance of the variables used in this study

2.2.1 Wildfire Occurrence Dataset

The historical wildfires (a total of 2094 fire points from 2003 to 2018) were extracted from MODIS Burned Area (BA) product MCD64A1, Collection 6Footnote 3 (Giglio et al. 2016; Jiao et al. 2022). We extracted the wildfire locations and dates (this is, day-of-burn and the uncertainty of day-of-burn). The isolated hot pixels were excluded since they may not be natural fires but artificial agricultural burns or high-temperature soils (Jurdao et al. 2012).

To explore the relationship between the explanatory variables and wildfire occurrence, we built a complementary database by integrating unburned pixel areas as negative samples. Thus, based on burned pixels from the MODIS BA product, we used the machine learning technique, that is, semi-variogram geostatistical technique (see Jurdao et al. (2012) and Yebra et al. (2018) for full details), to generate unburned pixels. These unburned pixels were selected within a time interval of 16 days and some distance from each fire point, to ensure that they were not affected by the fire. The final dataset is imbalanced (burned:unburned = 1:10) and consists of 23,034 points, where each point was marked using binary encoding (“1” as burned and “0” as unburned).

2.2.2 Dynamic-Step Weather Variables

We selected four broadly used weather variables—accumulated precipitation (AP), relative humidity (RH), air temperature (AT), and wind speed (WS)—to characterize the prefire weather conditions. The basic weather data were generated from the ERA5-land dataset, produced by the Copernicus Climate Change Service (C3S) at the European Center for Medium-Range Weather Forecasts (ECMWF).Footnote 4 The dataset is the atmospheric reanalysis of the global climate, provided at a spatial resolution of 0.1° and a temporal resolution of 1 hour for all variables.

The five parameters of the ERA5-land dataset were used to generate the basic weather variables, which are 2 m dewpoint temperature, 2 m temperature, total precipitation, and 10 m eastward and northward wind components. Hourly AP and AT were extracted directly from total precipitation and 2 m temperature from the dataset, respectively, whereas hourly WS was calculated as the vector sum of the eastward and northward wind components in directions perpendicular to each other. Hourly RH was derived from the ratio of the vapor pressure e to saturated vapor pressure es (Henderson‐Sellers 1984), using 2 m dewpoint temperature and temperature. Daily WS, RH, and AT were calculated from the averages of the four 1 hour values at 00:00, 06:00, 12:00, and 18:00 (UTC+8) per day, respectively, whereas daily AP was the sum of a total of 24 hourly AP values within a day. All daily weather variables within 16 days before the fire were extracted to model wildfire probability, since the temporal changes in weather conditions at a short-term scale were also related to wildfire occurrence (Cao et al. 2017; Malik et al. 2021).

Rainfall affects the change of other weather variables (air temperature, relative humidity, and wind speed) potentially (Miao et al. 2016). To solve this problem, we additionally considered the dynamic-step weather variables during continuous nonprecipitation days (CNPD, daily AP ≤ 0.5 mm), where the minimal, average, and maximal weather conditions (RH, WS, and AT) within the CNPD (represented by min/ave/max-RH/WS/AT-CNPD) were derived to model wildfire probability.

2.2.3 Fuel Variables

This study integrated multiple fuel variables including fuel moisture content (FMC), leaf area index (LAI), forestry type (FT), and tree species (TS) into predicting wildfire probability. These variables were found to be strongly related to fire ignition, spread, and intensity (Yebra et al. 2018; Quan, Xie et al. 2021). The 8-day composite LAI was extracted from the MODIS MCD15A2H.006 dataset,Footnote 5 and the daily FMC was obtained from the global FMC product at 500 m pixel resolution produced by Quan, Yebra et al. (2021). This FMC datasetFootnote 6 was retrieved from the MODIS MCD43A4.006 datasetFootnote 7 with radiative transfer models (RTMs) inversion, using the look-up table (LUT) algorithm (see Quan, Yebra et al. (2021) for full details). Forestry type and TS represent fuel type, which was derived from the forestry survey (2010–2020) conducted by the Sichuan Forestry and Grassland Administration (2017). Forestry type includes forest and shrubland types, and TS includes the type of forest and shrubland species, according to the classification system of the national forestry investigation of China (National Forestry and Grassland Administration 2011). Neither FT nor TS is found in grassland areas. Three fuel classes were identified in the study area, including forests, shrublands, and grasslands (Fig. 1a). For each fuel class, the wildfire probability models were established using the wildfire sub-datasets split by FT and TS, respectively.

2.2.4 Topography and Infrastructure Variables

Three basic topography variables—elevation, slope, and aspect—were retrieved from the Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010), at 7.5 arcseconds spatial resolution (USGS 2010; Carabajal et al. 2011; Athmania and Achour 2014). Due to the discontinuous change of absolute aspect, especially when the aspect is equal to 0° or 360°, we converted it from angles to sine and cosine values (that is, sin/cos-aspect) to constrain its values continuously ranging between − 1 and 1. The other three derived variables—TWI, TPI, and PC—were also regarded as driving factors for predicting wildfire danger (Pourtaghi et al. 2015). These variables describe the impact of topography on wetness (Eskandari et al. 2020), as it affects the spatial distribution of soil water content and the greenness of the vegetation (Jaafari et al. 2017; Al-Fugara et al. 2021).

We calculated the Euclidean distances to roads and residential areas provided by the National Catalogue Service for Geographic Information (NCSFGI),Footnote 8 as the infrastructure variables (that is, distance to roads and residential areas). These variables are well-known factors to characterize human activities (Jaafari et al. 2017; Jaafari, Zenner et al. 2019).

2.3 Wildfire Probability Modeling

The main methodology is illustrated in Fig. 2. We selected the random forest (RF) and extreme gradient boosting (XGBoost) models to explore and establish the relationships between historical wildfire occurrence and weather, fuel, and topography, as well as infrastructure variables. These two tree-based models both have a high tolerance for the multicollinearity of the input variables in the wildfire dataset (Piramuthu 2008; Cao et al. 2017; Ma, Ding et al. 2020; Wang and Wang 2020). The wildfire dataset was randomly divided into a training (70% of the data) and a validation subdataset (the remaining 30%), where the former was used to set up the model and the latter to evaluate model performance. Before training the model, we applied the Mann–Whitney U-test to examine the relevance between the variables and wildfire occurrence, further ensuring the stable performance of the machine-learning models.

Fig. 2
figure 2

Workflow of the methodology adopted in this study. AP accumulated precipitation; AT air temperature; CNPD continuous nonprecipitation days; FMC fuel moisture content; PC plan curvature; RF random forest; RH relative humidity; TPI topographic position index; TWI topographic wetness index; WS wind speed; XGBoost extreme gradient boosting model

To explore the effect of fuel and dynamic-step weather variables (fuel&DSw) on improving the wildfire probability models in two machine learning methods, we compared four wildfire probability models, including RF-based and XGBoost-based models considering these variables (that is, RF-with_fuel&DSw and XGBoost-with_fuel&DSw), and the models without considering these variables (that is, RF-without_fuel&DSw and XGBoost-without_fuel&DSw). The models for forests, shrublands, grasslands, and all fuel classes were established to further assess the improvement of wildfire probability prediction, respectively.

The wildfire probability, which is the output of the wildfire probability models, was defined as the probability (ranging from 0 to 1) of wildfire occurrence (namely, the wildfire probability is marked with “1” when the wildfire occurs).

2.3.1 Random Forest and Extreme Gradient Boosting Models

The RF model is broadly used to explore the nonlinear relationship between the explanatory and dependent variables. It is a nonparametric and ensemble learning approach, which is formed from multiple decision (or regression) trees to train and predict samples (Breiman 2001). Each tree is created when the adopted features and number of layers are determined using the bootstrap aggregation algorithm (Guo et al. 2016). We tested multiple groups of optimized parameters, that is, the number of decision trees (n-tree) and the maximum depth of the decision trees (max-depth), for better performance of the wildfire probability model. Evaluations in our study showed that n-tree = 300 and max-depth = 9 were the best parameters through grid search and cross-validation.

The XGBoost model is based on gradient-boosted decision trees and vertically connects each regression tree by queue, where each tree is trained to fit the error of the last tree (Chen and Guestrin 2016). That is a scalable tree-boosting system, with the advantage of high computational efficiency using parallel computing automatically (Chen and Guestrin 2016; Mitchell and Frank 2017). The XGBoost model performs a second-order Taylor expansion on the loss function, with less precision loss. It also adds a regular term to the loss function for avoiding overfitting when optimizing the solution. Similarly, we implemented grid search and cross-validation to determine the optimal combination of model parameters (that is, n-tree = 200, max-depth = 6, and learn rate = 0.3).

2.3.2 Model Performance Evaluations with Multiple Metrics

Qualitative model evaluation was implemented by a receiver operating characteristic (ROC) curve analysis, where the curve plots the values of true positive rate (TPR, sensitivity or the proportion of correct predictions in positive samples) versus false positive rate (FPR, 1 − specificity, where specificity is the proportion of incorrect predictions in negative samples) for a binary classifier system and different values of the discrimination threshold. From there, the area under the receiver operating characteristic curve (AUC) was used as a quantitative metric to evaluate the performance of the wildfire probability model (Hanley and McNeil 1982). The higher AUC indicates that the model has better performance, with higher accuracy. We also quantified the difference between the predicted and true values using the metrics: mean absolute error (MAE), and logarithmic score (LS), which are defined as Eqs. 1, 2. The smaller MAE or LS value indicates better performance of the wildfire probability model.

$${\text{MAE}} = \frac{1}{n}\sum\limits_{k = 1}^{n} {\left| {y_{k} - p_{k} } \right|}$$
(1)
$${\text{LS}} = \frac{1}{n}\sum\limits_{k = 1}^{n} {\left[ {y_{k} \log \left( {p_{k} } \right) + \left( {1 - y_{k} } \right)\log \left( {1 - p_{k} } \right)} \right]} ,$$
(2)

where n is the number of observations, k is the index of observations, pk is the modeled probability of wildfire occurrence, and yk is the historical wildfire occurrence (0 or 1).

3 Results

This section first shows the results of variable importance analysis for validating the key role of the introduced factors in the wildfire probability model. Then, the quantitative metrics (AUC, MAE, and LS) of four kinds of wildfire probability models for different fuel classes are tracked and compared. Finally, the historical map of monthly average wildfire probability is presented for qualitative model evaluation.

3.1 Variable Importance of Driving Factors

All explanatory variables passed the Mann–Whitney U-test (p < 0.05) and were used to model wildfire probability. We selected the mean decrease Gini index in RF to explore the variable importance of driving factors (that is, weather, fuel, topography, and infrastructure variables) in forests, shrublands, and grasslands. The variable importance analysis indicated that the weather variables within CNPD played a more crucial role in wildfire probability modeling, especially for shrublands and grasslands (Fig. 3). Relative humidity (RH) was particularly important in forest fire probability prediction, with the variable importance ranking in the top five. The variable importance of tree species (TS) in forests was moderate and dropped to a low level in shrublands. In contrast, forest type (FT) ranked extremely low in importance both in forests and shrublands, but outperformed that result when all fuel classes were combined. The remotely sensed fuel variables, that is, FMC-8th and FMC-16th, provided moderate importance ranking between 19th and 23rd. All these topography variables showed moderate to low importance and ranked between 20th and 30th, while the derived topography variables (that is, TWI, TPI, and PC) were of minor importance. The distance to residential areas was more important (ranked between 11th and 16th) than the distance to roads in most cases, but the distance to roads showed higher importance (ranked 9th) for shrublands.

Fig. 3
figure 3

Variable importance analysis for forest (a), shrubland (b), grassland (c), and all fuel classes (d), according to the mean decrease Gini index in random forest (RF) using the wildfire database. Note The mean Gini index was normalized, and its values ranged from 0 to 1. The top 30 variables are reserved. AP accumulated precipitation; AT air temperature; CNPD continuous nonprecipitation days; FMC fuel moisture content; FT forestry type; LAI leaf area index; RH relative humidity; TPI topographic position index; TS tree species; TWI topographic wetness index; WS wind speed

3.2 Model Evaluations

The quantitative metrics of AUC, MAE, and LS validated the performance of each case, as illustrated in Table 2. The AUC values based on the RF and XGBoost model indicated an excellent performance for wildfire probability modeling (RF: AUC = 0.99; XGBoost: AUC = 0.99), when all fuel classes were combined. The fuel and dynamic-step variables increased the goodness-of-fit for the wildfire probability model, particularly in the RF model with acceptable improvements (AUC increases to 0.99 from 0.93, MAE decreases to 0.066 from 0.104), compared to the cases when these variables were not considered. Table 2 also indicates that the wildfire probability models for shrublands outperformed those for other fuel classes, based on the two machine learning models. The improvement of the RF-based model for grasslands was greater than that for forests and shrublands, and it reached the best for all fuel classes. In addition, the model for all fuel classes that was established based on the XGBoost model, appeared with the best AUC, MAE, and LS among all cases in this study.

Table 2 Quantitative metrics of area under the receiver operating characteristic curve (AUC), mean absolute error (MAE), and logarithmic score (LS) for wildfire probability modeling of forest, shrubland, grassland, and all fuel classes using the random forest (RF) and extreme gradient boosting (XGBoost) model

3.3 Wildfire Probability Mapping

Figure 4 shows the monthly average historical wildfire probability of the study area in the fire seasons (February–March 2018) and nonfire seasons (June–July 2018), respectively. The historical wildfire probability was calculated from the outperformed models (that is, RF-with_fuel&DSw and XGBoost-with_fuel&DSw for all fuel classes) considering fuel&DSw, and the general models (that is, RF-without_fuel&DSw and XGBoost-without_fuel&DSw for all fuel classes) without consideration of fuel&DSw. The wildfire probability was divided into five levels: very low, low, moderate, high, and very high using the natural breaks (Jenks) method (Jenks and Caspall 1971).

Fig. 4
figure 4

Monthly average wildfire probability level distributions (a-h) and statistical coverage (i, j) in the fire seasons (a, b, e, f, i, February–March 2018) and nonfire seasons (c, d, g, h, j, June–July 2018) for the study area based on the four models. The historical wildfire points shown on wildfire probability maps, occurring in February or March, were selected from the MODIS wildfire database

Historical wildfires were always concentrated in the south and west of the study area, and few wildfires occurred in the northwest area during the fire season (Fig. 4a and b versus Fig. 4e and f). Most of these fires were located in areas with a high level of wildfire probability calculated from the four models, particularly associated with the wildfire probability from RF-with_fuel&DSw. There was also a similar spatial distribution pattern of historical wildfire probability calculated from the models with or without consideration of fuel&DSw. The general models without considering these variables showed a lack of high wildfire vulnerable areas, and resulted in more areas with a low or moderate level of probability to wildfire occurrence during the fire season (Fig. 4i).

During the nonfire seasons, the wildfire probability of most of the study area was ranked as “very low” for all models, especially for XGBoost-without_fuel&DSw. Compared with the models considering fuel&DSw, the models without considering fuel&DSw resulted in significantly more moderate-probability areas, where these areas were linked with a high level of wildfire probability during the fire seasons. Among these four models, the RF-with_fuel&DSw showed fewer areas with moderate or low wildfire occurrence during the non-fire seasons (Fig. 4j).

4 Discussion

This study improved wildfire probability predictions for the northwest of Sichuan Province, China, based on multisource data on weather, fuel, topography, and infrastructure variables. Particular emphasis was placed on evaluating the role of dynamic-step weather variables in the prediction.

Inspired by the fire environment triangle developed by Pyne and colleagues in 1996, we selected weather, fuel, topography, infrastructure, and their derived explanatory variables to model wildfire probability with high AUC values (Table 2). Some studies in the wildfire literature also demonstrated that the long-term or short-term changes in fuel and weather variables were related to wildfire occurrence, and used the temporal characteristics of fuel and weather to model wildfire probability (Luo et al. 2019; Stefanidou et al. 2019; Nolan, Blackman et al. 2020; Zong et al. 2020). The weather variables within CNPD were a key concern in this study. Notably, the temporal characteristics of weather variables (for example, the mean change rate of RH within CNPD) were not considered, since the time span of these characteristics was inconsistent between different wildfire points. Therefore, we chose the statistical characteristics of time series weather data, including minimal, average, and maximal RH/WS/AT within CNPD (Table 1), as dynamic-step weather variables for wildfire probability modeling. All these derived variables improved the performance of the wildfire probability model, producing higher AUC and lower MAE and LS values (Table 2).

The average RH/WS/AT within a fixed time span (that is, fixed-step weather variables) was also included in the wildfire dataset. In this study, CNPD is dynamic with an upper limit that is equal to the fixed time span. This time span (unit: day) should not be too small. Otherwise, the dynamic-step variables would be similar to the fixed-step weather variables, introducing additional data redundancy in the wildfire dataset. This is because effective rainfall may not occur within a relatively short time span during the fire season, and thus has no influence on the characteristics of weather variables within this time span. Here, the fixed time span was set as 16 days, following Quan, Xie et al. (2021). However, other studies adopted different fixed time spans, such as 10 days (Cao et al. 2017) and 1 month (Guo et al. 2017; Su et al. 2018; Shabbir et al. 2021), in which the optimal value might be determined by enumeration methods. Instead, we did not focus on the determination of a fixed time span, since it has little negative effect on the result of wildfire probability modeling when the time span is large enough.

Variable importance analysis showed that the dynamic-step weather variables played a more crucial role in wildfire probability modeling, especially for shrublands and grasslands (Fig. 3). Compared with the fixed-step weather variables, these variables eliminate the influence of rainfall, which can be used to improve the wildfire probability model. Although the established models have satisfactory results, only a minor upgrade was observed in our model, which may be due to two reasons: (1) Daily AP includes liquid and frozen water (rain and snow) that falls to the Earth’s surface, with some uncertainties. To this end, this study has suggested a hypothesis that effective rainfall occurs when daily AP is greater than 0.5 mm. However, this threshold might not be appropriate, with an unclear impact on the wildfire probability model. (2) There had been no effective rainfall within a fixed time span (16 days) before the occurrence of some historical wildfires, and these wildfire points were still included in the wildfire dataset. In this situation, the corresponding dynamic-step and fixed-step weather variables are the same, which may have a negative impact on the improvement of the wildfire probability model.

Machine learning techniques have been widely used to predict wildfire probability, due to their convenience in integrating multisource data and higher accuracy than traditional statistical methods and fire weather or drought indices approach (Goldarag et al. 2016; Cao et al. 2017; Jaafari, Mafi-Gholami et al. 2019; Zhang et al. 2019; Phelps and Woolford 2021). One of the critical problems in wildfire probability modeling is the elimination of multicollinearity in the input variables. Explanatory variables can be checked for multicollinearity by the variance inflation factor (VIF) and Pearson correlation analysis (Arndt et al. 2013; Pourtaghi et al. 2015; Kang et al. 2020; Ma, Ding et al. 2020; Ma, Feng et al. 2020; Milanovic et al. 2021). Nevertheless, this study did not carry out the check, since the selected two tree-based models (RF and XGBoost) have a high tolerance to the multicollinearity of the input variables (Piramuthu 2008; Cao et al. 2017; Ma, Ding et al. 2020; Wang and Wang 2020). The possible multicollinearity problem may only affect the model calculation and storage efficiency. Also, this study used the two models for comparison to learn whether the improvement by using dynamic-step predictors is robust and independent of the models.

Wildfire management requires high accuracy in mapping the spatial distribution of wildfire probability, where a different level of probability could be used for wildfire danger assessment (Arndt et al. 2013). A noticeable difference was found between wildfire probability level distributions generated from the RF and XGBoost models. This is due to the different principles of the models, where the maximum value of predicted wildfire probability from the RF-based model is usually smaller than that from the XGBoost-based model (Fig. 4). The RF method is based on bootstrap aggregating, whose output is the average result of all decision trees, while the XGBoost model is a scalable tree-boosting system that can achieve better accuracy (Chen and Guestrin 2016; Guo et al. 2016). Generally, depending on RF-based models with bootstrap aggregating, instead of the models with polarized wildfire probability levels, is like a multi-expert decision-making system, which improves the guidance on the wildfire prevention and control of governmental administration.

5 Conclusion

This study demonstrated the ability of dynamic-step weather variables in improving the wildfire probability modeling over the northwest of Sichuan Province of China. The use of weather variables within CNPD improved the wildfire probability model in terms of better AUC, MAE, and LS values. This study also provided insight to eliminate the influence of rainfall on weather characteristics when modeling wildfire probability. The established RF-based model is more suitable for guiding wildfire probability early prescription, suppression, and response, with more reasonable spatial patterns of historical wildfire probability. Despite the advantages of these dynamic-step weather variables in improving wildfire probability prediction, how to derive the optimal threshold of the effective rainfall and fixed time span continues to be challenging. Also the need to quantify the performance of different spatial and temporal characteristics of the dynamic-step weather variables in estimating wildfire occurrence risk remains only partially solved.