1 Introduction

Natural hazards and extreme financial loss can be seen as extreme events (i.e., events that have low probabilities of occurring under normal circumstances). Because of their imbalanced number of occurrences, one usually has more data on common, non-extreme events than on extreme events. However, it can be shown that fitting a distribution to all the data via classical statistical methods leads to a reasonable fit to the bulk of the data at the expense of a poor fit to the tails in which the extremes lie (Ribatet 2016). Extreme value theory (EVT) and its corresponding models seek to remedy this disparity by offering guidance on when an extreme regime kicks in and how it can be modelled. For example, under known conditions, the distribution of excesses above a fixed threshold can be shown to converge to a generalised Pareto distribution (GPD) as the threshold increases (Embrechts et al. 2013; Beirlant et al. 2004). By knowing this asymptotic distribution, one can conduct model checks and generate sensible estimates of extremal behaviour.

EVT is increasingly used in financial applications and has been shown to give more accurate tail-risk predictions (Danielsson and De Vries 1997; Longin 2000). To address the fact that there is time dependence in financial returns which goes against the traditional assumption of independence (Diebold et al. 1998), Bee et al. (2019) propose a dynamic extreme value model which uses high-frequency realised measures of the daily asset price variation as covariates to model the probability of exceeding a high threshold and the size of the excesses. Since the realised variation is time-varying, the estimates of threshold exceedance and excesses are also time-varying.

The extreme value model developed by Bee et al. (2019) can be adapted to wider settings. In this paper, this is demonstrated by adapting the model for natural hazard forecasting, specifically for volcanic eruptions. Just as how extreme loss can be defined as an exceedance over some financial threshold, extreme volcanic activity can be defined as the threshold exceedance of some energy index.

The contributions of this model to the eruption forecasting literature are manifold. In addition to the short overview of existing methods provided in the Supplementary Information, the key differences and contributions of the proposed model are highlighted here. The dynamic extreme value model draws on techniques from time series analysis, EVT and machine learning, of which full and varied potential for short-term eruption forecasting has yet to be realised (Malfante et al. 2018; Carniel and Guzmán 2020; Whitehead and Bebbington 2021). To adapt the model from its original financial context to wider settings, choices need to be made on the following:

  1. (i)

    a suitable index to compute threshold exceedances;

  2. (ii)

    the look-ahead window or forecast horizon;

  3. (iii)

    the auxiliary information or covariates used to inform future behaviour; and

  4. (iv)

    the time periods to compute these from (i.e., the covariate window).

In the seismic context, threshold exceedance for event detection is synonymous with first-arrival picking algorithms such as the classical short-time average over long-time average (STA/LTA) method and a recently introduced method using trace envelopes (Withers et al. 1998; Trnkoczy 1999; Al-Mashhor et al. 2019). The trace envelope can be seen as the instantaneous amplitude of a seismic trace and related to the amount of energy in the signal (Ktonas and Papp 1980; Taner 2001). It is computed by taking the square root of the sum of the squared real and imaginary parts of the seismic trace. In this paper, trace envelopes will be used as eruption indices from which to take threshold exceedances. These exceedances would hopefully relate to extreme regimes leading up to volcanic eruptions and can be forecasted using covariates. Note that the exceedance of an eruption index is modelled rather than a physical monitoring signal because the latter does not necessarily have a monotonic relationship with the hazard. In addition, since different frequency bands within the seismic signal represent different physical phenomena (Bormann et al. 2013; Salvage et al. 2019), envelopes of frequency-filtered data will be considered.

While existing techniques like the failure forecast method estimate the eruption onset time directly, others including event trees define a look-forward window within which the probability of an eruption occurring is estimated. The latter approach is taken here. In particular, for illustration, 1-h-ahead forecasts are produced to complement other longer-term forecasts. Although a longer lead time allows for more time to make emergency management decisions, notify the public and implement evacuations (Wild et al. 2021), forecasts are typically more accurate when made closer to the actual eruption time due to temporal divergence at larger lags (see for example, Sugihara and May 1990).

Eruption forecasting methods such as event trees, belief networks and process/source models presuppose precursors or associations between source mechanisms and time series signals (Brenguier et al. 2008). In contrast, the proposed methodology selects combinations of covariates that could represent precursors for any volcano and type of eruption if the model is trained on corresponding data. Specifically, covariates inspired by machine learning classification algorithms for seismic signals (Malfante et al. 2018) are tested. By combining these covariates, different aspects of the seismic data and relationships across different frequency bands are represented. An objective stepwise selection procedure is then used to determine which covariates are more informative for forecasting eruptions.

The rest of this paper is organised as follows. The key components of the dynamic extreme value model introduced by Bee et al. (2019) are outlined in Sect. 2. In Sect. 3, the case study, the Piton de la Fournaise volcano, is introduced to illustrate how the model can be adapted for eruption forecasting. By comparing the effect of different choices of the threshold on the model fit and training performance, the value of using EVT to guide threshold choice is highlighted in Sect. 4. In Sect. 5, the broader applicability of the method is evaluated by refitting the model using three training event sets and testing the calibrated model on both event and non-event sets. In Sect. 6, the results and future areas for research are discussed. The code used for the analysis is publicly available at https://github.com/ntu-dasl-sg/dynamic-EV-forecasting.

2 The Dynamic Extreme Value Model

Following Bee et al. (2019), let \(\{Y_{t}\}_{t = 1, \dots , T}\) denote a time series of an index where higher values are associated with extreme events. Based on the selected threshold \(u \in \mathbb {R}\) of the index, exceedances are defined as the binary indicators of whether the index is higher than the threshold and excesses are defined as the numerical values of the exceedances. The conditional probability that the index at time t, \(Y_{t}\), exceeds u by some excess \(z>0\) given prior information available at time t, \(\mathscr {F}_{t-1}\), can be written as

$$\begin{aligned} P(Y_{t}> u + z | \mathscr {F}_{t-1})&\ = P(Y_{t}> u|\mathscr {F}_{t-1})P(Y_{t}-u>z | Y_{t}>u, \mathscr {F}_{t-1}) \nonumber \\&= \phi _{t}|\mathscr {F}_{t-1} \times GPD(\xi _{t}, \nu _{t})|\mathscr {F}_{t-1}. \end{aligned}$$
(1)

Here, \(\phi _{t}|\mathscr {F}_{t-1} = P(Y_{t} > u|\mathscr {F}_{t-1})\) represents a time-varying binomial exceedance probability which can be modelled with a logit function

$$\begin{aligned} \phi _{t}|\mathscr {F}_{t-1}&= \frac{\exp (\psi _{0} + \sum _{i = 1}^{p}\psi _{i}x^{(i)}_{t-1})}{1 + \exp (\psi _{0} + \sum _{i = 1}^{p}\psi _{i}x^{(i)}_{t-1})}, \end{aligned}$$
(2)

where \(\textbf{x}_{t-1} = (x_{t-1}^{(1)}, \dots , x_{t-1}^{(p)})\) denotes a vector of p covariates from the previous time step and is used to project the future probability. The parameters \(\{\psi _{i}\}_{i = 0, \dots , p}\) can be estimated by maximising the likelihood function

$$\begin{aligned} \mathscr {L}(\varvec{\psi }; I_{t}, \textbf{x}_{t}) =&\prod _{t = l+1}^{T} \left( \exp (\psi _{0} + \sum _{i = 1}^{p}\psi _{i}x^{(i)}_{t-1})\right) ^{I_{t}} \times \frac{1}{1 + \exp (\psi _{0} + \sum _{i = 1}^{p}\psi _{i}x^{(i)}_{t-1})}, \end{aligned}$$
(3)

where l is the lag at which the covariates \(\textbf{x}_{t}\) become available and \(I_{t}\) is the indicator of an exceedance at time t (it is equal to 1 if there is an exceedance and 0 otherwise). This is equivalent to using logistic regression to model the probability of threshold exceedance.

In (1), the model for excesses of the threshold is given by a generalised Pareto distribution (GPD)

$$\begin{aligned} P(Y_{t}-u>z | Y_{t}>u, \mathscr {F}_{t-1}) = GPD(\xi _{t}, \nu _{t})|\mathscr {F}_{t-1}. \end{aligned}$$
(4)

The shape parameter \(\xi _{t} = \xi \) is estimated assuming a non-time-varying GPD distribution for the excesses and kept constant for stability. To account for the time-varying nature of the excess distribution, the scale parameter is modelled as

$$\begin{aligned} \nu _{t} = \exp \left( \kappa _{0} + \sum _{i = 1}^{q}\kappa _{i}x^{(i)}_{t-1}\right) . \end{aligned}$$
(5)

This involves q covariates, \((x_{t-1}^{(1)}, \dots , x_{t-1}^{(q)})\) in a log-linear function.

When \(\kappa _{i} = 0\) for \(i>0\), this means that it is sufficient to model the distribution of the excesses statically. From the definition of a GPD with a non-zero shape parameter

$$\begin{aligned} P(Y_{t}-u>z | Y_{t}>u, \mathscr {F}_{t-1}) = \left( 1 + \frac{\xi z}{\exp (\kappa _{0} + \sum _{i = 1}^{q}\kappa _{i}x^{(i)}_{t-1})} \right) ^{-1/\xi }, \end{aligned}$$
(6)

where \(z = y_{t} - u\) denotes the excess. If \(\xi > 0\), \(z\ge 0\) and if \(\xi <0\), \(0\le z \le -\exp (\kappa _{0} + \sum _{i = 1}^{q}\kappa _{i}x^{(i)}_{t-1})/\xi \) (i.e., the excesses have an upper bound). The parameters \(\{\kappa _{i}\}_{i = 1, \dots , q}\) can be estimated by maximising the likelihood function

$$\begin{aligned}&\mathscr {L}(\mathbf {\kappa }, \xi ; \textbf{z}_{t}, \textbf{x}_{t}) \nonumber \\ =&\prod _{t = l + 1}^{T} \left( \frac{1}{\exp (\kappa _{0} + \sum _{i = 1}^{q}\kappa _{i}x^{(i)}_{t-1})} \times \left[ \left( 1 + \frac{\xi z_{t}}{\exp (\kappa _{0} + \sum _{i = 1}^{q}\kappa _{i}x^{(i)}_{t-1})} \right) ^{-1/\xi -1}\right] _{+}\right) ^{I_{t}}, \end{aligned}$$
(7)

where \([x]_{+} = \max (0, x)\), and the time subscript is added to the excesses, \(\textbf{z}\), to denote their temporal indices. Henceforth, the maximisation of (7) to estimate \(\{\kappa _{i}\}_{i = 1, \dots , q}\) will be referred to as GPD regression. When the estimated shape parameter \(\hat{\xi }\) is not significantly different from zero, it is set to zero and an exponential regression is used instead.

3 Case Study: Piton de la Fournaise Volcano

3.1 Data

The dynamic extreme value model is used to forecast eruptions at the Piton de la Fournaise volcano. Situated on La Réunion Island, Piton de la Fournaise is one of the most active basaltic volcanoes, with an average of one eruption every 10 months (Roult et al. 2012). In addition to the existing seismic monitoring stations, 15 broadband stations were installed on the volcano as part of the Understanding Volcanic Processes (UnderVolc) project in 2009–2010 (Taisne et al. 2011). Figure 1 shows their locations together with other stations of the UnderVolc project and Station Sismologique de Riviere de l’Est (RER) of the Geoscope seismic network. The data collected are available at https://www.fdsn.org/networks/detail/YA_2009/ and cover the years 2009 to 2011. Between 2009 and 2011, five eruptions were recorded at Piton de la Fournaise.

To avoid issues arising from varying energy index magnitudes at different recording stations due to their respective distances from the eruption centres, data from one station were used to train the model. This is the UV05 station, which was the closest available station to the January 2010 eruptive centre (Journeau et al. 2020) and which is approximately 571 m from the recorded starting point of the first fissure (Roult et al. 2012).

Data from four out of the five recorded eruptions were used in the analysis. Data from the December 9, 2010 eruption were not used for training or testing because the location of the eruption (Flank N) is relatively far away from the rest of the event locations and hence could skew the results.

Table 1 shows the key characteristics of the four eruption events as outlined by Roult et al. (2012). To evaluate the forecast performance in Sect. 5, three of the events will be used for training the model, and the other event will be used for testing. Three non-events are used for further validation. The non-event dates were chosen to be roughly halfway between the selected four events and represent quiet periods for which one should not expect threshold exceedances of the eruption indices.

Fig. 1
figure 1

Map of the Piton de la Fournaise volcano, corresponding to Fig. 2 of Seydoux et al. (2016). The locations of the Understanding Volcanic Processes stations are represented by red triangles, and that of Station Sismologique de Riviere de l’Est (RER) of the Geoscope seismic network is represented by the blue triangle

Table 1 Events and non-events used to train and test the forecasting model

3.2 Illustration of Method

Fig. 2
figure 2

a Seismic signal for the 1–5 Hz frequency band for the January 2010 eruption event with b a close-up look at the period near the time of eruption on January 2. c The corresponding envelope index in decibels (dB), with d a zoom into the same time period on January 2. The dotted, dashed and bold blue vertical lines denote the start of the seismic crisis, the start and end of the seismic swarm and the eruption onset respectively. Note that the time series are downsampled such that every 5,183th reading and every 402nd reading are shown for the top and bottom plots, respectively

To illustrate the use of the dynamic extreme value model for eruption forecasting, the method is first be applied to the January 2010 event data (training event 3) since it is best documented out of the chosen events. Figure 2a shows the raw seismic signal for the 1–5 Hz frequency-filtered data. This is given in counts which corresponds to the number read off the seismometer (i.e., the voltage read from a sensor). To scale it to physical units (e.g., m/s), a number of factors including the frequency of motion being measured and the calibrated zero-point of the instrument would need to be considered. Since only a scaling factor is involved, the same modelling results would be obtained when the raw seismic signal is used in counts or in physical units.

Figure 2b focuses on January 2 when the eruption occurred, and the dotted, dashed and bold blue vertical lines denote the start and end times of the seismic crisis, swarm and eruption onset, respectively. As documented by Roult et al. (2012), a seismic crisis took place from 07:50 local time. This means that there is an increase in seismic events such that this exceeds background seismicity. Between 08:10 and 09:02, a seismic swarm was recorded. This corresponds to when seismic events overlap and become hard to differentiate (McNutt and Roman 2015) and is reflected in the seismic signal as larger fluctuations in the readings between the dashed blue vertical lines. The swarm was followed by a relatively quiet phase that directly preceded the onset of the eruption at about 10:20 (the eruption is indicated by the continuous seismic tremor in the figure). The whole eruption was estimated to have lasted 9.6 days, ending at 00:05 on January 12, 2010.

3.2.1 Trace Envelope as an Eruption Index

Before fitting the dynamic extreme value model for eruption forecasting, the eruption index from which to consider threshold exceedances needs to be chosen. Here, the trace envelope \(e_{t}\) for \(t = 1, \dots , T\) is used. This can be seen as the instantaneous amplitude of the seismic trace \(\textbf{s} = (s_{1}, \dots , s_{T})\), and can be computed as follows:

  1. 1.

    First, the discrete Fourier transform (DFT) of \(\textbf{s}\) is computed for \(t \in \{1, \dots , T\}\) (this is implemented by the R function ‘fft’)

    $$\begin{aligned} f_{t} = \textrm{DFT}(\textbf{s})_{T} = \sum _{k=1}^{T} s_{k}\exp (-2\pi i(k-1)(t-1)/T), \end{aligned}$$
    (8)

    where T is the length of the seismic trace. Set \(\textbf{f} = (f_{1}, \dots , f_{T})\).

  2. 2.

    Next, the complex Hilbert fast Fourier transform (FFT) of \(\textbf{s}\) is computed as

    $$\begin{aligned} H_{t} = \textrm{IFT}(\textbf{f}\textbf{h})/T, \end{aligned}$$
    (9)

    where the length-T series \(\textbf{h} = (1, 2, 2, \dots , 2, 2, 1)\) if T is even and \((1, 2, 2, \dots , 2, 2, 2)\) if T is odd and the inverse Fourier transform (IFT) is defined as

    $$\begin{aligned} \textrm{IFT}(\textbf{f}\textbf{h})_{t} = \sum _{k=1}^{T} f_{k}h_{k}\exp (2\pi i(k-1)(t-1)/T), \end{aligned}$$
    (10)

    for \(t \in \{1, \dots , T\}\).

  3. 3.

    For \(t \in \{1, \dots , T\}\), the trace envelope of the seismic trace \(\textbf{s}\) is defined as

    $$\begin{aligned} e_{t} = \sqrt{\textrm{Re}^{2}(H_{t}) + \textrm{Im}^{2}(H_{t})} = \textrm{Mod}(H_{t}). \end{aligned}$$
    (11)

The above steps are in line with those used within the R function ‘envelope’ in the IRISSeismic package.

In Al-Mashhor et al. (2019), the first arrival travel time picking algorithm was based on the envelope in decibels. Similarly, the trace envelope \(e_{t}\) is converted to decibels via \(Y_{t} = 20\log _{10}(e_{t})\) (Moore 1995, p. 11) and \(Y_{t}\) is used as the eruption index. Figure 2c shows the trace envelope time series computed from the 1–5 Hz frequency-filtered data for the January 2010 event. The 1–5 Hz frequency band is used because it is strongly associated with volcanic/magmatic activity and fluid resonances (Salvage et al. 2019). Results for the 5–15 Hz frequency band are given in the Supplementary Information. In Fig. 2c, it is seen that although there are some spikes at the start of January 1, the envelope index remains relatively low around 50 dB until the recorded seismic crisis on January 2 (represented by the dotted blue vertical line in the bottom plot). In Fig. 2d, the index increases from the time of the seismic crisis to a first peak during the seismic swarm before waning slightly. About 10:00, the index increases steadily to plateau slightly above 80 dB, the timing of which coincides with the recorded eruption onset (represented by the bold blue vertical line in the bottom plot). The relationship between the increases in the index and the seismic events enables its threshold exceedances to be used to forecast eruptions.

3.2.2 Forecast Horizon and Covariate Window

In addition to computing the index from which to take exceedances, choices need to be made on the following:

  1. (i)

    The forecast horizon \(\delta _{t}\): the time period between the time window where the covariates are computed and the forecast time.

  2. (ii)

    The time window within which past observations contribute to the covariates, W.

For illustration, 1-h-ahead forecasts are made with 1 h of past data to inform the covariates in the model (\(\delta _{t} = W = 1\) h). This mimics the settings used by Brenguier et al. (2008), Malfante et al. (2018) and Ren et al. (2020). Their analyses involved the classification of 1-h-long signals or the generation of the covariates by scanning a moving window of length 1 h across the seismic signals. This framework is pictured in Fig. 3a, where the time period between \(t-1\) and t is \(\delta _{t} = 1\) h, and the covariate window \(W = 1\) h.

Although the covariates were computed from the original high-frequency (100 Hz) data, forecasts were only produced every 10 s to reduce any unnecessary computational burden. For a practically useful workflow, this should be longer than the time required to compute the covariates from the past hour and use the fitted model to generate the forecast.

Fig. 3
figure 3

a Model conceptual diagram for forecasting exceedances of the 1–5 Hz envelope index. At time \(t-1\), the covariates are computed from signals across multiple frequency bands within a past time window (step 1). These are then used to forecast threshold exceedances at time t (step 2). b p values from the Anderson–Darling (AD) and Cramér–von Mises (CVM) tests for goodness of fit of the excesses to the GPD distribution. The lowest threshold for which the p value exceeds the 10% significance level is chosen

3.2.3 Threshold Selection

Next, the threshold to define the extreme regime is selected. This will be associated with the covariates to forecast extreme behaviour (i.e., exceedance). Based on the Anderson–Darling (AD) and Cramér–von Mises (CVM) tests for goodness of fit of the excesses to the GPD distribution (Barder et al. 2018), Fig. 3b suggests that under a significance level of 10%, a threshold of 85 dB is reasonable, since this is the lowest value from the right for which the p-value exceeds the significance level. At this threshold, the null hypothesis of the excesses coming from a GPD distribution is not rejected.

3.2.4 Covariates

To forecast threshold exceedances of the eruption index, the covariates suggested by Malfante et al. (2018) are computed for the frequency-filtered data (0.1–1 Hz, 1–5 Hz, 5–15 Hz, 0.1–20 Hz and high pass 0.01 Hz) and their associated trace envelopes. Malfante et al. (2018) introduce three domains of representation of the seismic time series which could be useful: the original temporal domain, the frequency domain where spectral content is obtained via a Fourier transform, and the cepstral domain where the Fourier transform is computed twice to highlight the harmonic properties of a given signal.

For each of these three representations of the seismic traces and their trace envelopes, statistical, entropy and shape descriptor features are computed. These are listed together with their definitions in Table 2. Each covariate or feature tries to capture a particular aspect of the signal within the covariate window. For example, while kurtosis captures the transition between two signals, Shannon entropy describes the distribution of the amplitude levels of a given signal.

A covariate can also take on different meanings depending on the domain it is computed on. For example, the feature ‘i of Central Energy’ which is the time around which the signal energy is centred or the time centroid in the temporal domain, can be interpreted as the fundamental frequency in the frequency domain and the harmonic frequency in the cepstral domain. In addition, while the ratio of the maximum value over the mean value can describe the contrast and relate to the cause of the event in the original temporal domain, in the frequency domain, it describes the spectral richness of the signature and in the cepstral domain, the harmonic content of an observation.

To ensure that the model is not too sensitive towards extreme covariate values, the covariates were transformed before use. Box–Cox analyses were used to select which power or log-transformation was required to make their distributions more similar to Gaussian distributions. After transformation, the covariates were standardised using their mean and standard deviations to be on similar scales. To account for multicollinearity, the covariates were ordered according to the increasing Akaike information criterion (AIC) of their univariate models (for threshold exceedance and excesses separately). Then, the pool of covariates was reduced by removing covariates that had more than 0.6 in absolute correlation to covariates that were deemed more informative than themselves.

Table 2 Covariates considered for the dynamic extreme value models

3.2.5 Regression Models

As outlined in Sect. 2, a logistic regression and a GPD regression were fitted for threshold exceedances and threshold excesses, respectively. The shape parameter of the GPD is fixed to the value of the estimate obtained using maximum likelihood estimation for a constant GPD. For the January 2010 training event, this has an asymptotically normal 95% confidence interval of \((-\,0.300, -\,0.160)\) which does not include 0. The negative shape parameter implies that the distribution of the excesses is Pareto type II and lies within the Weibull domain of attraction which contains distributions with short tails (i.e., finite endpoints).

The covariates in the models were chosen by stepwise selection based on AIC. There are three options for stepwise variable selection: forward, backward and bidirectional. The default configuration of the ‘stepAIC’ function in the R package ‘MASS’ is bidirectional (Venables and Ripley 2002). For the logistic regression, a backward search is conducted before considering forward selection. The backward direction is often preferred over the forward direction because the full model and the effect of all candidate variables are considered (Steyerberg 2009; Harrell 2015; Chowdhury and Turin 2020). For the GPD regression, a forward search is used before backward selection to avoid singularity issues that arise from having large numbers of covariates and excess uninformative covariates in the model.

The five steps outlined in Sect. 3.2.13.2.5 can be repeated to model threshold exceedances and excesses for the other frequency-filtered envelopes.

3.3 Results

The black line in Fig.  4a shows the 1-h ahead probabilistic forecasts for threshold exceedances of the 1–5 Hz envelope. The forecasts are highest for January 2, the day the eruption started. Focusing on January 2 in Fig. 4b, the exceedance probability jumps about 1 h before the recorded eruption onset at 10:20. This means that the fitted forecast model is able to give a warning about 1 h ahead of the eruption.

Similar to Bee et al. (2019), the goodness of fit of the logistic regression is checked using a deviance chi-squared test. The p-value was \(e^{-1,113.53}\), indicating that the fitted model is significantly different from a null model (a logistic regression with an intercept term but no covariates). The usefulness of the covariates for explaining the temporal dependence in the occurrence of the threshold exceedances can also be seen through the reduction in the autocorrelation of the Pearson residuals in Fig. S1a of the Supplementary Information. In contrast, little temporal dependence was observed for the excess residuals in Fig. S1b. Hence, there was no real benefit of using covariates to inform a dynamic GPD and a constant GPD would have sufficed. As will be seen later, this is threshold-specific: when multiple events are used to train the model in Sect. 5 and the lowest EVT-informed threshold is selected among the training events, there will be autocorrelation in the excess residuals and hence benefits of modelling with a dynamic GPD.

Fig. 4
figure 4

a Training 1-h-ahead exceedance forecasts based on the logistic regression for the 1–5 Hz envelope index and the January 2010 eruption; b zoomed into the period of significant volcanic activity (the dotted, dashed and bold blue vertical lines denote the start and end of the seismic crisis, swarm and eruption onset respectively). Here, the value of the black line at 09:00, for example, indicates the forecasted probability of exceedance for 10:00

4 Value of Extreme Value Theory

Fig. 5
figure 5

a Improvement in goodness of fit when the threshold used increases from 50% (top) to 100% (bottom) of the value chosen by extreme value theory; b improvement in the training performance in terms of area under the curve (AUC) for the 1–5 Hz, 0.1–20 Hz and high pass 0.01 Hz index exceedance models when the threshold increases

EVT was used to select the threshold which defined the exceedances and excesses being forecasted by the dynamic extreme value model. Figure 5a shows the goodness-of-fit plots comparing the modelled and empirical probabilities and return levels when 50% and 100% of the threshold informed by EVT were used. The latter provided a better fit to the model assumptions.

The threshold choice is also important for determining what kind of phenomena is being modelled and what covariates are chosen to best explain it. Figure 5b shows that for the exceedance forecasting of the 1–5 Hz, 0.1–20 Hz and high pass 0.01 Hz envelope indices, the forecast performance, as measured via the area under the curve (AUC), generally improves as the threshold is increased towards that informed by EVT. Hence, EVT has benefits for modelling in terms of both goodness of fit to the data and forecast performance.

5 Evaluating Forecast Performance

5.1 Using Multiple Training Events

To assess the forecast capability of the dynamic extreme value model more formally, the model will be fitted to all three training events, and tested on the remaining test event and the three non-events.

Fig. 6
figure 6

With multiple training events: comparison of goodness of fit in terms of the empirical and theoretical quantiles of the standardised excesses when a the GPD regression is used for the excess distributions instead of b treating the excess distribution as static through a constant GPD. Since the extreme quantiles lie closer to the one-to-one diagonal line for the GPD regression, it fits the data better

There are a few ways to determine a suitable threshold based on data from multiple events. An initial approach might be to simply combine the data across events and use all exceedances to inform the threshold. However, this led to a relatively high threshold estimate of 95 with training event 2 dominating the model fit because there were comparatively more exceedances from event 2 than events 1 and 3 (see Sect. 4 of the Supplementary Information).

An alternative approach, which will be used, would be to estimate thresholds for the training events separately and use the lowest estimate across events. This will ensure sufficient exceedances to represent each event. For the 1–5 Hz envelope index, the lowest threshold among the three training events was 85 dB, and the estimated GPD shape parameter is \(\hat{\xi } = -0.125\) with 95% confidence interval \((-\,0.146, -\,0.104)\). Tables 1 and 2 of the Supplementary Information show the chosen covariates with their transformations and parameter estimates for the fitted logistic and GPD regressions, respectively.

As will be seen in the next section, this threshold choice leads to 1-h-ahead forecast probabilities which increase before the time of eruption for all three training events and the test event while remaining low for the three test non-events. By lowering the threshold to identify extremes across all three training events, there is more benefit of modelling the threshold excesses dynamically because there is autocorrelation in the excess residuals, particularly for training event 2 (see Sect. 2 of the Supplementary Information). In contrast to the initial illustration with just training event 3 in Sect. 3, modelling the excess distribution dynamically leads to better estimation of extreme quantiles, though there is still room for improvement. This is illustrated in Fig. 6.

5.2 Training and Test Performance

Fig. 7
figure 7

Training event 1: 1-h-ahead threshold exceedance forecasts using the lowest threshold estimated across events. The dotted, dashed and bold blue vertical lines denote the times of the seismic crises, start and end of the seismic swarms and the eruption onset respectively

Fig. 8
figure 8

Training event 2: 1-h-ahead threshold exceedance forecasts using the lowest threshold estimated across events. The dotted, dashed and bold blue vertical lines denote the times of the seismic crises, start and end of the seismic swarms and the eruption onset respectively

Fig. 9
figure 9

Training event 3: 1-h-ahead threshold exceedance forecasts using the lowest threshold estimated across events. The dotted, dashed and bold blue vertical lines denote the times of the seismic crises, start and end of the seismic swarms and the eruption onset respectively

Fig. 10
figure 10

Test event: 1-h-ahead threshold exceedance forecasts using the lowest threshold estimated across events. The dotted, dashed and bold blue vertical lines denote the times of the seismic crisis, seismic swarm and the eruption onset

After fixing the threshold for which to model exceedances, the dynamic extreme value model is fitted to data from the three training events. Figures 7, 8 and 9 shows that apart from an outlier on the first day of training event 3, the forecast probabilities remain low (e.g., below 0.3) for all three events until the time of their recorded seismic events. For training event 1 (referring to Fig. 7b), sustained high forecast probabilities start during the seismic swarm (between the dashed vertical lines), slightly more than 1 h before the recorded eruption at 17:00. For event 2, 1-h-ahead eruption warnings can also be made as the forecast probabilities begin to take high values during the seismic swarm, about 1 h before the recorded eruption at 14:40 (see Fig. 8b). For event 3, the forecast probabilities gradually increase from the time of the seismic swarm around 08:30 before jumping up to a higher plateau about an hour before the recorded eruption onset at 10:20 as shown in Fig. 9b.

The features of the forecast probabilities, namely the sharp jumps from near-zero for training events 1 and 2, the gradual increase for training event 3, the presence of outliers and the tendency to increase 1 h before the recorded eruption onsets, stem from the chosen covariates of the logistic regression. As can be inferred from their high coefficient estimates in Table 1 of the Supplementary Information, the logistic regression for threshold exceedance has three covariates that contribute to forecast probabilities more than the others: 0.1–20 Hz cepstral kurtosis, 0.1–20 Hz cepstral skewness and high pass 0.01 Hz energy.

Figures S12 to S15 in the Supplementary Information show the time series of these covariates for the training and test events. Unlike the 0.1–20 Hz cepstral kurtosis and skewness, the high pass 0.01 Hz energy has more block-like features in its time series. This influences the sharp jumps from near-zero for training events 1 and 2 during their seismic swarms. In contrast, the change in high pass 0.01 Hz energy during the period of the seismic events on January 2 of training event 3 was more smooth, resulting in a smoother increase in forecast probabilities. The block-like features result from one extreme value in the high pass 0.01 Hz signal which causes high energy values for the length of the moving covariate window (1 h).

Since energy was defined as the sum of the squared signal, this covariate is also very sensitive to outliers. Future exploration can be done for making the covariates more robust to outliers. For example, instead of using the mean or average operation \(\frac{1}{n}\sum _{i = 1}^{n}f_{i}\) on any sequence \(\{f_{i}: i = 1, \dots , n\}\), the median operation which is less sensitive to outliers can be used. For covariates involving minimum or maximum values, the 0.1 or 0.9 quantiles can be considered.

Still focusing on high pass 0.01 Hz energy, Figs. S12 to S15 show that the covariate values tend to increase during the seismic swarms. Since the seismic swarms for training events 1, 2 and 3 occur slightly more than 1 h prior to their recorded eruptions, monitoring the energy values seems to provide good 1-h-ahead forecasts for the eruptions. It seems likely that the length of the covariate window and the forecast horizons can be optimised depending on the expected seismic crisis and swarm durations at a volcano. If the seismic swarms precede the eruption by a longer time period, as in the test event, the one-hour ahead forecast probabilities may not be so temporally accurate. This is observed in Fig. 10 for the 1-h-ahead forecast probabilities for the test event. Here, the seismic crisis lasts 5 h 30 min instead of 1\(-\)2.5 h for the training events. Referring to Fig. S15 of the Supplementary Information, the sole forecast probability outlier on October 13 and the spikes in forecast probabilities in the earlier part of October 14 are in line with spikes in the high pass 0.01 Hz energy covariate. However, while the largest energy value was recorded during the seismic swarm (see Fig. S15f), the forecast probability was higher nearer to the actual eruption onset. This shows the effect of the other covariates, including the 0.1–20 Hz cepstral kurtosis and skewness, which work together to moderate the forecast probabilities. By nature of the covariates involved, the forecast probabilities are sensitive to different aspects of seismicity.

In addition to the training and test events, the potential for the dynamic extreme value model for eruption forecasting is examined by looking at its performance for the three non-events. In line with expectations, the corresponding threshold exceedance forecasts remain very low, compared to the magnitudes during the training and test events (see Fig. S16 of the Supplementary Information). Similar training and test results were obtained for the 5–15 Hz frequency-filtered data. The corresponding plots are given in Sect. 8 of the Supplementary Information.

6 Discussion and Outlook

In Sect. 3, the dynamic extreme value model was fitted to training event 3, the seismic time series for the January 2010 eruption at Piton de la Fournaise. Promising results were obtained with spikes in the probabilistic forecasts about an hour prior to the eruption onset. This means that in this case, 1 h ahead warnings can be made with the chosen set-up: using the 1–5 Hz trace envelope as an eruption index with a 1-h covariate window and 1 h forecast horizon. Similar performance was also seen when the model was fitted to more data in Sect. 5.

In addition to the good forecast results, it was seen that EVT is useful for choosing the threshold from which to model exceedances. An appropriate threshold is important because it determines the balance between the number of exceedances used to inform covariate selection and the adherence of the threshold excesses to the asymptotic theory. In general, with a higher threshold, there are fewer exceedances to inform the model which could mean higher estimation uncertainty but the excess distribution becomes closer to a GPD.

By examining the training performance of the logistic regression for different threshold values, the analysis in Sect. 3 showed that using EVT to choose the threshold can improve forecast performance. When the threshold was increased towards its EVT-informed value, the AUC, a measure for how well the model can distinguish between exceedances and non-exceedances, generally increased for the 1–5 Hz, 0.1–20 Hz and high pass 0.01 Hz envelope indices. Intuitively, the threshold determines what it means to be extreme; hence it would affect the covariates selected and thus the forecast performance.

A related note to defining what it means to be extremes is that the chosen threshold and forecast probabilities are volcano-specific. In this paper, the dynamic extreme value model was fitted for one volcano, Piton de la Fournaise. This was based on its seismic data and hence the chosen thresholds and the forecast probabilities are specific to this volcano. If one were to fit similar models for other volcanoes separately, their forecast probabilities would not be directly comparable since they are specific to each volcano’s natural characteristics such as eruption frequencies.

In practice, one would aim to train the forecast model with as much relevant data as possible. Section 5 presents some considerations to make when incorporating data from different events. For example, the p-values of GPD goodness-of-fit tests identify the threshold beyond which the excesses can be seen to follow a GPD. This procedure assumes that the same constant GPD applies to all the training events. However, as observed, different events can suggest different thresholds. Specifically, for training event 2, a higher threshold was inferred for the energy-related envelope index. This difference in envelope index values between events could be because some eruptions occur closer to the measurement station or because the eruption itself has a higher flux.

The proposed strategy to deal with the different threshold choices is to take the lowest identified threshold. The GPD regression component of the dynamic extreme value model would then model the non-stationarity of the GPD with appropriate covariates. In fact, modelling the excess distribution dynamically was seen to be more useful when the model was trained using multiple events and the lower threshold as compared to previously with just a single training event in Sect. 3. If instead a higher threshold was used, as suggested when all the training data were combined, only large events would be forecasted because training event 2 would dominate the model fit, resulting in the inability to forecast training events 1 and 3 well.

The proposed modelling framework is still far from operational use and can be extended in various ways. In the analysis, 1 h covariate and forecast windows were used. More work can be done to optimise these durations which are likely to be related to the seismic crisis and swarm durations. While the training events had seismic crises which lasted about 1–2 h, the test event had a much longer seismic crisis duration of 5 h 30 min. This could explain why the training forecasts were more temporally precise than the test forecasts. One could also experiment with computing the trace envelope in different ways. For example, different sliding windows to compute the Hilbert transform can be tested. Non-Hilbert transform methods of estimating amplitudes can also be used (Rosenblum et al. 2021). In addition, one might investigate the effect of different significance levels in the goodness-of-fit tests.

The analysis focused on seismic signals in the 1–5 Hz frequency range. As noted in the literature (Bormann et al. 2013) and observed from the AUC comparison in Fig. 5b, some frequency bands can be more useful for eruption forecasting than others. By combining information across useful frequency ranges through a joint, multivariate model, more comprehensive eruption forecast can be provided. Similarly, one can incorporate different monitoring signals such as gas emissions and ground deformation in the model. So far, only indices and covariates based on seismic signals have been used due to the prevalence of seismometers as volcano monitoring tools. Future extensions can include different monitoring signals and account for their interdependence and shared covariates via joint models.

Data from one seismic station (UV05) were used in the analysis. The promising results indicate that such methods can be useful for volcanoes where there is only one measuring station. To check that the performance of the modelling approach is not dependent on the choice of UV05, which is the closest available station to the January 2010 eruptive centre, and a product of hindcasting, additional analyses were conducted for two other stations (UV08 and UV11). The results are provided in Section 9 of the Supplementary Information.

The training and test event results for the UV08 and UV11-fitted model indicate that a model fitted to another station can be used to forecast eruptions that occur closer to other stations and this can be useful for mitigating the issue of a damaged seismometer. Nevertheless, it is preferable to use the model fitted to station data which is closest to the eruptions in the training data and to use the model to forecast at the same station. This is because when the station is further away from the eruptive centre, the threshold exceedances of the envelope may be linked with other mechanisms which precede or occur after the eruption.

With more data, each eruption event can be matched to the nearest station and this station-specific training data can be used to build station-specific models. Future work can look into linking these station-specific models in a multivariate framework, taking into account their spatiotemporal relations. Since the distance from the eruption location is related to higher detected energies, an additional benefit of extending the framework to model signals from the monitoring network as a whole may be the ability to inform not just the timing, but also the location of the eruption through the varying forecast probabilities at different stations. In line with the extreme value approach, a multivariate generalised Pareto framework, similar to that by Rootzén and Tajvidi (2006), could be useful for extensions to multivariate and spatial models.

Given that the dynamic extreme value model has worked well for financial forecasting (Bee et al. 2019) and can be adapted for volcanic eruption forecasting, it is postulated to have high potential for wider applications. In particular, high sampling-rate data were used in both the financial and volcanic contexts. For the former, it was used to compute realised variations over relatively short horizons while in the latter, it helped to separate different frequency bands of interest. High sampling rate data relevant to other natural hazards and crises are also being made available; however, what is deemed as a high sampling rate is highly context-specific. For example, for sea levels, data were previously publicly available only at the monthly or annual scales. Hence, 1–15-min resolution is deemed a high sampling rate. These are increasingly sought after to study extreme sea levels and coastal flooding (Woodworth et al. 2016; Ozsoy et al. 2016; Zemunik et al. 2021). Using such data, the dynamic extreme value model can be adapted to forecast extreme sea levels and their impact on coastal communities. While high sampling rates are good to have, the framework itself does not depend on this, and in the absence of such data, one can still forecast, albeit on a coarser scale.

To adapt the dynamic extreme value model to wider settings, there are several general considerations, namely, what is a suitable index to compute threshold exceedances? What are reasonable forecast horizons and covariate windows? What covariates can be used to inform future behaviour? One might also consider using algorithms for selecting the threshold automatically (see for example, Barder et al. 2018).

A general practical consideration that is shared across all contexts, be it finance, volcanoes or other hazards and crises, is the translation of the forecast probabilities into decisive action. What forecast probability warrants a warning or more drastic measures such as evacuation? The optimal strategy may not be straightforward but may involve many competing priorities and constraints, and should be determined on a case-by-case basis with multiple stakeholders.