1 Introduction

Time series forecasting during the presence of extreme events is a critical tool for resource allocation and resilience planning (Jing et al. 2021; Santos-Burgoa et al. 2018; Khan et al. 2021). Extreme events such as natural disasters are causing more than 400% economic damage in the U.S. compared to 1990s (Smith 2022). This requires us to develop highly accurate forecasts with low uncertainty to uncover the influence of external events on large-scale time series data (Adilova et al. 2021). Moreover, it is crucial to understand how different industries are influenced by and recover from such extreme events over time (Rolnick et al. 2019). Yet, it remains a challenge to develop such reliable and accurate forecasting models, as the real-world dataset often contains anomalies that are in their nature rare and random. Therefore, it is important to develop a forecast model that can leverage the previously seen extreme events and anomalies for their forecasts.

Although there have been considerable achievements in machine learning-based models, existing methods tend to overlook anomalies’ special effects on real-world time series data. For instance, LSTMs (Hochreiter and Schmidhuber 1997) are widely used to address the vanishing gradient problem via gate mechanism and have the ability to capture complex temporal dependencies (Zhu and Laptev 2017; Laptev et al. 2017). However, Khandelwal et al. (2018) show that even LSTMs have a limited ability to capture long-term dependencies, and their awareness of context degrades as the length of the input sequence increases. Consequently, making them inefficient to capture and learn from rare occurrences or extreme events.

As an alternative, Li et al. (2019) considered the use of transformer models for time series forecasting. Transformers benefit from the self-attention mechanism which allows each observation in the feature sequence to attend independently to every other feature in the sequence. However, they have considerable computational and memory requirements that grow quadratically with respect to sequence length, making it computationally rigorous to train large-scale data (Li et al. 2019). Such deficiency makes them computationally unsuitable for extreme events that often appear in longer sequences than the transformer inputs. Moreover, it was not even clear from the design itself that transformers can be as effective as RNNs, whereas Zaheer et al. (2020) reported that the attention mechanism in transformers does not even obey the sequence order of time steps which is essential for the time series domain. Furthermore, non-transformer architectures (i.e. MLP) have been shown to perform competitively with transformers when designed and trained properly (Tolstikhin et al. 2021).

This lack of systematic strategy to handle anomalies and not provide forecasts with nontransparent uncertainty levels makes the current forecast model unreliable during the presence of extreme events. As a result, a key aspect of our knowledge in developing time series models for critical moments of extreme events will remain a puzzle unless the long-term effects of anomalies are well captured and utilized.

Contribution. This work proposes a novel and generalized anomaly-aware prediction framework, AA-Forecast, which automatically extracts and uses anomalies to optimize its probabilistic forecasting. Specifically,

  • AA-Forecast extracts anomalies through a novel decomposition method and leverages them through an attention mechanism designed to optimize its probabilistic forecasting during extreme events. Also, AA-Forecast is able to perform zero-shot prediction for unseen time series and does not suffer from quadratic computational time and memory complexity of transformers.

  • An online optimization procedure is proposed to minimize the prediction uncertainties of the AA-Forecast framework, which features applying the optimal dropout probability at each time step during the testing.

  • Extensive experimental studies are conducted on three real-world datasets that are prone to extreme events and anomalies. The comparisons with state-of-the-art models illustrate the higher accuracy and less uncertainty in the AA-Forecast’s prediction.

2 Problem formulation

In this study, we are interested in the task of time series forecasting under the influence of extreme events and anomalies. Given a dataset \({\textbf{D}} = \{{\textbf{x}}^{(1)},{\textbf{x}}^{(2)},\ldots ,{\textbf{x}}^{(K)}\}\) with K univariate time series, \({\textbf{x}}^{(k)}=\{x^{(k)}_{1}, x^{(k)}_{2},\ldots , x^{(k)}_{T}\}\) denotes a time series instance with length T, where \({\textbf{x}}^{(k)} \in {\mathbb {R}}^{T}\). For every time step, the corresponding extreme events are aligned and labeled as covariates \({\textbf{e}}^{(k)} = \{e^{(k)}_1, e^{(k)}_2,\ldots , e^{(k)}_{T}\}\). Extreme events are considered as the influence of external events that promote a dynamic occurrence within a limited time steps (Broska et al. 2020). Specifically, \(e^{(k)}_{t} \in {\mathbb {R}}\) indicates the level of extreme event (e.g., hurricane category) at time t, otherwise, \(e^{(k)}_{t} = 0\) indicates a non-extreme event condition for periods outside of the event. To this end, we denote the data with extreme events as a series of tuples \(\widehat{{\textbf{x}}}^{(k)} \overset{\varDelta }{=} \{(x^{(k)}_1,e^{(k)}_1),(x^{(k)}_2,e^{(k)}_2),\ldots , (x^{(k)}_{T},e^{(k)}_{T})\}\). Particularly, given the previous \(\tau \) observations \(\widehat{{\textbf{x}}}_{t-\tau +1:t}^{(k)} = \{({x}^{(k)}_{t-\tau +1},e^{(k)}_{t-\tau +1}), ({x}^{(k)}_{t-\tau +2},e^{(k)}_{t-\tau +2}),\ldots , ({x}^{(k)}_{t},e^{(k)}_{t})\}\), we aim to model the conditional distribution of the next observation:

$$\begin{aligned} {p}(x^{(k)}_{t+1}\ | \widehat{{\textbf{x}}}^{(k)}_{t-\tau +1:t} ; {\varvec{\Phi }}), \end{aligned}$$
(1)

where \({\varvec{\Phi }}\) denotes the parameters of a nonlinear prediction model. We are also interested in reducing the uncertainty of predictions in an online setting, whereas uncertainty of prediction can be viewed as the variability of the distribution. Therefore, the optimization problem during the online settings is defined as follows:

$$\begin{aligned} {\varvec{\Phi }}^{*}_{\text {on}} = \text {argmin}_{{\varvec{\Phi }}} \, {\mathcal {V}} \left( {p}(x^{(k)}_{t+1}\ | \ \widehat{{\textbf{x}}}^{(k)}_{t-\tau +1:t} ; {\varvec{\Phi }})\right) , \end{aligned}$$
(2)

where \({\mathcal {V}}\left( \cdot \right) \) represents the variability of the probability distribution and \({\varvec{\Phi }}^{*}_{\text {on}}\) is the optimal online parameters of the nonlinear prediction model that produces the least amount of uncertainty in each time step.

3 AA-forecast framework

The proposed AA-Forecast framework consists of three main components. Section 3.1 proposes a novel anomaly decomposition method that automatically extracts the anomalies and essential features of the time series data. Then, the extracted anomalies are fed into an anomaly-aware model detailed in Sect. 3.2. Specifically, it leverages an attention mechanism on anomalies and extreme events to produce the distribution of the forecasts. To further reduce the forecast uncertainty in an online manner, Sect. 3.3 proposes a dynamic uncertainty optimization algorithm.

Fig. 1
figure 1

Main components of AA-Forecast: (i) STAR Decomposition to automatically extract essential features such as anomalies, (ii) an Anomaly-Aware Model to leverage such extracted features, and (iii) a Dynamic Uncertainty Optimization to reduce the uncertainty of the network. The final predicted series contains confidence intervals with the least uncertainty

3.1 STAR decomposition

STAR decomposition is used as a strategy to not only extract the anomalies and sudden changes of data but also decompose the complex time series to its essential components. Unfortunately, widely popular decomposition method such as STL (Cleveland et al. 1990) does not extract anomalies. Although recent works such as STR (Dokumentov and Hyndman 2020) and RobustSTL (Wen et al. 2019) are designed to be robust to the extreme effect of anomalies in their decomposition, they are not used to explicitly extract anomalies from the residual component.

To alleviate these issues, we propose STAR decomposition that decomposes the original time series \({\textbf{x}}^{(k)}\) in a multiplicative manner to its seasonal (\({\textbf{s}}^{(k)}\)), trend (\({\textbf{t}}^{(k)}\)), anomalies (\({\textbf{a}}^{(k)}\)), and residual (\({\textbf{r}}^{(k)}\)) components:

$$\begin{aligned} {\textbf{x}}^{(k)} = {\textbf{s}}^{(k)} \times {\textbf{t}}^{(k)} \times {\textbf{a}}^{(k)} \times {\textbf{r}}^{(k)} \end{aligned}$$
(3)

Such decomposition is important due to increasing the dimensions of the original data and providing the model with automatic extraction of anomalies. As shown in Fig. 1, we begin the decomposition by approximating the trend line \({\textbf{t}}^{(k)}\) with the locally weighted scatterplot smoothing (i.e., LOESS Cleveland 1979). Then, we divide the original data \({\textbf{x}}^{(k)}\) by the approximated trend line to derive the detrended time series.Footnote 1

We then partition the detrended time series into periods of cyclic sub-series where the cycle size is determined by the time interval of the dataset. As an example, the cycle size for a monthly dataset would be 12 (one year as a cycle). Then we obtain the seasonal component (\({\textbf{s}}^{(k)}\)) by grouping the detrended series in each period and deriving the average value of each period across the time series. Subsequently, the residual component (\({\textbf{r}}^{(k)}\)) is derived by dividing the seasonal and trend segments from the original series.

Note that the anomaly component (\({\textbf{a}}^{(k)}\)) can be considered as the oddities of the dataset, which do not follow the extracted trend or seasonal components. Intuitively, the anomalies spread through the residual components, which also contain noise and other real-world effects. To distinguish anomalies from residual components, statistical metrics such as mean and variance are not the appropriate measure as they are highly sensitive to the severity level of anomaly values. As one expects, the severity of the anomalies can change the mean and variance values which are unwanted. To resolve this issue, we leverage the median of the residuals, which is immune to the severity of the outliers in the residual components. Next, we define robustness score \(\rho ^{(k)}_t\) for each observation at time t as:

$$\begin{aligned} \rho _t^{(k)} = \frac{|r_t^{(k)} - \dot{r}^{(k)}|}{\sqrt{\frac{\sum _{t=1}^{T}|r^{(k)}_t-\dot{r}^{(k)}|}{T-1}}} \end{aligned}$$
(4)

where \(\rho _t^{(k)}\) stands for the strength of the anomalies, \(r_t^{(k)}\) is the residual at time step t and \(\dot{r}^{(k)}\) is the median of the residuals.

Note that the larger \(\rho _t\) indicates that a drastic change has occurred in the trend and seasonal components. We then extract the anomalies from the residuals as follows:

$$\begin{aligned} {\textbf{a}}_t^{(k)}=\left\{ \begin{array}{ll} 1, &{} {\rho }_t^{(k)} < \rho _{c}^{(k)} \\ r_t^{(k)}, &{} \rho _t^{(k)} > \rho _{c}^{(k)} \end{array}\right. \end{aligned}$$
(5)

where \(\rho _{c}^{(k)}\) is the constant threshold given by the value of a robustness score ranked in the p-value 0.05Footnote 2 while the values of elements in \(\rho ^{(k)}\) are ranked in descending order from large to small.

Notably, when the value of the anomaly component (\({\textbf{a}}^{(k)}\)) deviates further from the value 1, it indicates an abrupt change in the trend and the seasonal component (no sign of anomalies). On the contrary, when both anomaly and residual values are equal 1 (\({\textbf{r}}_{t}^{(k)}=1\) and \({\textbf{a}}_{t}^{(k)}=1\)), it indicates that the observed signal at time t explicitly follows the trend and the seasonal component. Note that such important information might not be automatically inferred when additive decomposition methods are being used. This is due to the fact that the values of residual components can differ from one dataset to another, which requires manual effort in their detection.

A sample result of anomaly decomposition is shown in Fig. 4, where the observed time series data is decomposed into its seasonal, trend, anomalies, and residual components respectfully. Each of these components holds essential information about the characteristics of the time series and will be leveraged to train the forecast model. To this end, we concatenate the derived decomposed vector of the time series with the input, which includes the observed time series and its labeled extreme event. Specifically, the STAR decomposition concatenates the original time series to \(\widetilde{{\textbf{x}}}^{{(k)}} = ({\textbf{x}}^{(k)},{\textbf{e}}^{(k)},{\textbf{s}}^{(k)},{\textbf{t}}^{(k)},{\textbf{a}}^{(k)},{\textbf{r}}^{(k)})\) which can be leveraged by the anomaly-aware model described in the next section.

3.2 Anomaly-aware model

The Anomaly-Aware model is designed to explicitly incorporate extracted anomalies \({\textbf{a}}^{(k)}\) and extreme event covariates \({\textbf{e}}^{(k)}\) into the prediction. As these features are rare in the whole time series, feeding them directly into a regular RNN like LSTM (Hochreiter and Schmidhuber 1997) can be potentially ignored during the training of the model. Note that the extracted anomalies and previously experienced external events hold valuable information regarding the effect of extreme events that should be handled carefully.

Recent robust prediction models rely on the LSTMs or transformers architecture to provide robustness in their prediction. Although LSTMs are designed to obtain long-term dependencies, their ability to pay different degrees of attention to sub-window features within large time steps is inadequate (Zaheer et al. 2020). As an example, Khandelwal et al. (2018) showed that even though the LSTM model can have an effective sequence size of 200 observations, they are only able to sharply distinguish the 50 closest observations. This indicates that even LSTMs struggle to capture long-term dependencies. On the other hand, conventional transformers suffer from quadratic computation and memory requirements, which limits their ability to process long input sequences.

Even though such memory bottlenecks have been improved by using sparse-attention algorithms (Li et al. 2019), their performance improvement is not significant compared to a full-attention mechanism for real-world datasets (Lim et al. 2019). Given that extreme events and anomalies are rare and can appear at very long distances from each other, it is computationally infeasible to increase the input sequence to provide attention to all previously seen anomalies and extreme events.

To address such problems, one must pay attention to all the anomalies and extreme events throughout the dataset, no matter how far they have occurred. Intuitively, due to their rare nature, they are of greater importance in learning, given that the trend and seasonal patterns are often easier to predict by statistical or deep learning models. Ergo, we developed a novel attention mechanism explicitly for extreme events and anomalies, which are considered the crucial time steps of time series data and often cause the biggest error in prediction.

Architecture design of AA-model. LSTMs and GRUs are suitable for predicting the recurring patterns with a fairly low computational time and memory complexity which suffer from the quadratic complexity of full-attention transformers. However, we enhance the long-term dependencies of these models through an attention mechanism that retains the effect of anomalies and extreme events for future predictions. Such a decision in architecture allows the model not only to be computationally feasible for handling large-scale datasets but also to take the critical moments of extreme events and anomalies into consideration.

Given the past \(\tau \) time steps of observations as \(\tilde{{\textbf{x}}}_{{t:t-\tau +1}}\)Footnote 3, we derive the hidden states of an RNN that deals with vanishing gradient problem (e.g., LSTM or GRU) as:

$$\begin{aligned} {\textbf{h}}_{{t:t-\tau +1}} =\text {RNN}\left( \tilde{{\textbf{x}}}_{{t:t-\tau +1}}\right) , \end{aligned}$$
(6)

where \(\textrm{h}_{t}\) is the hidden layer of \(\text {RNN}\) at time step t. Note that we are only paying attention to anomalies and extreme events which are naturally rare and belong to a small population of observations. Moreover, both could have different impacts on the prediction and based on the type of dataset, can be challenging to model. Hence, we design the attention mechanism to automatically incorporate extreme events and anomalies during their occurrence:

$$\begin{aligned} J = \{ t \in {\mathbb {Z}}^{+} | e_{t} \ne 0 \vee a_{t}\ne 1\}, \end{aligned}$$
(7)

where J is the set of time steps including two possible circumstances: the presence of extreme events covariates (\(e_{t} \ne 0\)) or anomalies (\(a_{t} \ne 1\)). We then gather all the previous hidden states of the RNNs for all critical time steps in J and regularize them by the weights generated from the attention layer as \(v_{t}\) which follows:

$$\begin{aligned} v_{t} =\text {tanh}({\textbf{w}}_{\alpha }^{\top } {\textbf{h}}_{t}+b_{\alpha }), \quad \forall t \in J \end{aligned}$$
(8)

where \({\textbf{w}}_{\alpha }\) and \(b_{\alpha }\) are the attention layer’s weight and bias. Then, we derive the attention weights of all previous values as:

$$\begin{aligned} {\alpha }_{t}={\text {Softmax}}\left( v_{1}, v_{2}, \ldots , v_{t}\right) , \quad \forall t \in J \end{aligned}$$
(9)

where \(\alpha _{t}\) is the attention weight at the critical time steps. The generated attention weights are then used in the AA-Forecast layer as:

$$\begin{aligned} {\mathcal {A}}_{t}=\left\{ \begin{array}{ll} {\textbf{h}}_{t}, &{}\forall t \not \in J\\ \sum _{t \in J} \alpha _{t} \cdot {\textbf{h}}_{t}, &{} \forall t \in J \end{array}\right. \end{aligned}$$
(10)

where the attention values are only calculated in the presence of anomalies and extreme events as shown in Fig. 2. The value of the next time step is calculated through a dense layer:

$$\begin{aligned} y_{t+1} = {\textbf{w}}_d ({\mathcal {A}}_{t:t-\tau +1})+b_d, \end{aligned}$$
(11)

where \({\textbf{w}}_d\) and \(b_d\) are the weights and biases of the dense layer. To train the network, we minimize the prediction loss \({\mathcal {L}}\) which is defined as follows:

$$\begin{aligned} {\varvec{\Phi }}^{*}_{\text {off}} = \text {argmin}_{{\varvec{\Phi }}} \, {\mathcal {L}}\left( {\mathcal {F}}_{\varvec{\Phi }}(\widetilde{{\textbf{x}}}),y\right) , \end{aligned}$$
(12)

where \({\mathcal {F}}_{\varvec{\Phi }}\) is the anomaly-aware model and y is the training label, which is the ground truth of the next time step prediction. Note that \({\varvec{\Phi }}^{*}_{\text {off}}\) represents the optimal model parameters after the offline training phase.

Fig. 2
figure 2

Left: AA-model architectures. Right: Dynamic dropout \(\mu _{t}\) determines the optimal probability of dropout at each time step during the online settings (i.e., inference). The output \({\hat{y}}\) consists of a distribution of predicted test values. The dropout optimization improves the certainty and accuracy at each time step t by determining how relevant the previous hidden state is for the next time step prediction

3.3 Dynamic uncertainty optimization

Although Monte Carlo (MC) dropout (Gal and Ghahramani 2016) probability is treated as a static hyperparameter in previous studies (Salinas et al. 2020; Laptev et al. 2017), it plays an important role in the prediction outcome and can be leveraged to reduce the uncertainty of the prediction during the testing phase (Wahab et al. 2020). Therefore, we rely on an automatic selection mechanism for optimal dropout in online settings. Such selection is based on the uncertainty of the prediction produced during the testing phase (Fig. 2).

Note that the model’s uncertainty is desired to be the lowest and as stable as possible in real-world settings. Therefore, it is essential to optimize further the uncertainty of the model prediction both during the offline training and online testing phase. Specifically, we apply a dropout operation after every AA-Forecast layer with a specific probability (p).

AA-Forecast not only reports the prediction distribution, but also provides the point prediction (average of the distribution) and the prediction uncertainty (variability of the distribution). Specifically, by producing M forecast for every time step in an online manner (test data \(\widetilde{{\textbf{x}}}^{*}\)) from the previously trained model \({\mathcal {F}}_{\varvec{\Phi }}(\widetilde{{\textbf{x}}})\), we obtain M outputs \(y^*\) as a from the prediction distribution \(\left\{ {y}_{(1)}^{*}, \ldots , {y}_{(M)}^{*}\right\} \). Then, the average of the distribution is calculated as \( \bar{{y}}^{*}=\frac{1}{M} \sum _{m=1}^{M} {y}^{*}_{(m)} \).

We represent uncertainty as to the variability of the prediction distribution —- the standard deviation (SD) of the probability distribution of future observations conditional on the information available at the time of forecasting. We further optimize the uncertainty of the framework by deriving the optimal dropout probability p at each time step. We derive the prediction error for the probability p between 0 and 1 with 0.1 increments. Notably, without such probability (i.e., \(p=0\)) the model prediction deviates from probabilistic forecasting and does not provide a level of uncertainty in its prediction for each time step. The optimal uncertainty \(\mu _{t}\) is then reported when it results in a minimal variance (i.e., SD) of the predicted values, thereby reducing the prediction uncertainty to its minimum during the testing phase. To this end, the prediction uncertainty is formulated as:

$$\begin{aligned} \sigma ^{2}\left( {\mathcal {F}}_{\varvec{\Phi }}(\widetilde{{\textbf{x}}}^{*})\right) =\sqrt{\frac{1}{M} \sum _{m=1}^{M}\left( {y}_{(m)}^{*}-\bar{{y}}^{*}\right) ^{2}}. \end{aligned}$$
(13)
Algorithm 1
figure a

Psuedecode for AA-Forecast

Algorithm 1 presents the pseudocode for AA-Forecast. Specifically, we sample (\(\tilde{{\textbf{x}}}^{k},y^{k}\)) as a driving example which includes extracted anomalies \({\textbf{a}}^{(k)}\) and extreme events \({\textbf{r}}^{(k)}.\) Next, we train the model by maximizing the overall prediction accuracy. Upon testing, the network leverage dynamic uncertainty optimization further optimizes the prediction uncertainty automatically in online testing so that it would not require any further training.

Note that the network’s predictions during the testing phase cannot benefit from the supervised training. However, the control of variability is possible and ensures that the prediction uncertainty is minimal in each step of future predictions, regardless of whether the labels are provided or not. Additionally, the algorithm testing time complexity is similar to other RNN-based models due to the use of dynamic uncertainty optimization during the test phase solely. This allows the model to provide the least amount of uncertainty during the presence of anomalies or extreme events where critical online decisions are being made.

Fig. 3
figure 3

Effects of dynamic uncertainty optimization on prediction error and uncertainty during the occurrence of an anomaly. The method automatically selects the optimal probability that yields the lowest uncertainty

As an example, Fig. 3 shows that optimal uncertainty results can occur when the standard deviation is the lowest. Intuitively, the network at \(p=0.5\) shows the highest confidence in its prediction (i.e., the lowest uncertainty) where unnecessary neurons are dropped out from the network. Therefore, the network automatically selects and reports the \(p=0.5\) probability as the best choice for this time step in the testing phase.

4 Experiments

This section reports multiple experiments comparing the proposed AA-Forecast framework with baseline models using different types of large-scale time series datasets.

4.1 Dataset and experimental settings

Three real-world time series with diverse structures and domains are gathered (Fig. 4).Footnote 4 Table 1 provides descriptive statistics and the detailed description are as follows:

  • We gathered a new spatio-temporal benchmark dataset (Hurricane), which is suited for forecasting during extreme events and anomalies. The dataset is provided through the Florida Department of Revenue which provides the monthly sales revenue (2003-2020) for the tourism industry for all 67 counties of Florida which are prone to annual hurricanes. Furthermore, we aligned and joined the raw time series with the history of hurricane categories based on time for each county. More precisely, the hurricane category indicates the maximum sustained wind speed which can result in catastrophic damages (Oceanic 2022).

  • The second dataset (COVID-19) showcases the changes in the number of employees based on one million employees active in the US during the COVID-19 pandemic and is gathered from Homebase (Bartik et al. 2020). We further enriched the data with the state-level policies as an indication of extreme events (e.g., the state’s business closure order).

  • The third dataset (Electricity) is a publicly available benchmark dataset that contains the electricity consumption of 370 consumers hourly from 2011 to 2014. Note that this benchmark dataset is anonymized and does not contain extreme event labels, yet AA-Forecast is able to automatically extract the anomalies, indicating abrupt changes in trend and seasonality.

Fig. 4
figure 4

The results of STAR Decomposition for three samples. The hurricane dataset sample is taken from Collier county in Florida, USA and the values are normalized in USD. The COVID-19 sample is taken from Florida state and the values show the changes (percentage) from the beginning of the pandemic. The Electricity sample is from MT-200 and values are in kW

Table 1 Descriptive statistics of the datasets

We propose two sets of experiments for all baseline models. The first experiment follows a standard 80–20 dividing of the dataset to training and testing sets and \(\tau =12\) for window length. The second experiment evaluates the zero-shot prediction capability of the model based on various window search ranges in {3, 6, 12, 24}, and thus is more applicable for real-world settings when the newly added time series cannot train on a newly added time series. Hence, the second experiment evaluates the prediction accuracy of all models on a set of completely unseen time series.

The models are implemented using Python 3.7 and tested on a cloud workstation with two Intel Xeon 2.3 GHz CPUs, 64 GB RAM, and one Nvidia Tesla A100 GPU. We conduct a grid search over all tunable hyperparameters on the held-out validation set for baseline methods and our framework. The hyperparameters for each dataset are shown in Table 2. To provide a fair evaluation, all baseline models benefit from the essential features extracted by AA-Forecast except the ARIMA model which does not benefit from multidimensional features. Moreover, future known information is not included in all the models.

The training times of AA-forecast for all three datasets are reported in Table 3. We kept training to 40 iterations for all experiments. The reported values are the average of the observed error five times during the test stage. The hyperparameters of all baseline methods are tuned based on a grid search.

Table 2 Hyperparameters of AA-forecast used for each dataset

4.2 Methods for comparison

The baseline methods for comparison include:

  • ARIMA (Box and Pierce 1970): A traditional autoregressive integrated moving average method for time series prediction and often used as a baseline.

  • AE-LSTM (Sagheer and Kotb 2019): An LSTM network that uses an autoencoder for deep feature extraction and provides a deterministic prediction.

  • SARIMAX (Tarsitano and Amerise 2017): An autoregressive model that can handle seasonality and exogenous features of time series.

  • UberNN (Zhu and Laptev 2017): An LSTM-based model that uses Monte Carlo dropout to provide uncertainty and is able to extract deep features of time series through autoencoders.

  • TSE-SC (Cai et al. 2020): was recently proposed as a Transformer-based Deep Learning model that can forecast abrupt changes accurately. (i) STAR Decomposition to automatically ex- tract essential features such as anomalies, (ii) an Anomaly-Aware Model to leverage such extracted features, and (iii) a Dynamic Uncertainty Optimization to reduce the uncertainty of the network. The final predicted

  • AA-Forecast (LSTM) is our proposed model with LSTM cells.

  • AA-Forecast (GRU) is our proposed model with GRU cells.

Table 3 Runtime of the methods used in the study

4.3 Metrics

For providing a comprehensive evaluation, we adopted three different evaluation metrics. The first evaluation metric is the Continuous Ranked Probability Score (CRPS), which evaluates probabilistic forecasting. Formally defined as \(\textbf{C R P S} = \int _{-\infty }^{\infty }(F(y)-\mathbb {1}(y-{\hat{y}}))^{2} \mathrm {~d}y\) where F is the cumulative distribution function of its forecast distribution and \(\mathbb {1}\) is the Heaviside step function. We also report the root mean square error (RMSE). Formally defined as \(\textbf{R M S E}=\sqrt{\frac{1}{N} \sum _{i=1}^{N}\left( y_{t,(i)}-{\hat{y}}_{t,(i)}\right) ^{2}}\) where \(y_{t}\) is the mean of the predicted distribution at time t and \({\hat{y}}_{t}\) is the observed value at time t. The third evaluation metric is the standard deviation (SD) that is correlated to the uncertainty of the prediction and is denoted as \(\textbf{S D} = \sqrt{\frac{1}{N} \sum _{i=1}^{N}\left( y_{t,(i)}-\tilde{y_t}\right) ^{2}}\) where \({\bar{y}}\) is the mean of the predicted distribution.

4.4 Experimental results

We provide two comprehensive comparisons and evaluations of the proposed AA-Forecast framework: the aforementioned 80–20 testing where 20% of the data are unseen, as well as the testing on zero-shot prediction where the whole time series is unseen. In both cases, we calculate the CRPS, RMSE, and SD. Lastly, we provided an ablation study to discuss the effectiveness of different AA-Forecast components.

The \({\varvec{80-20}}\) testing. We first used the ‘older’ 80% of each time series in training and tested the accuracy of prediction on the rest of 20%. Table 4 reports the loss of the networks under such \(80-20\) testing, where the SD of AA-Forecast (GRU) method is lower than all baseline methods, showing the model’s high confidence in the forecasts.

Table 4 Performance comparison of our proposed framework and baseline models under \(80-20\) testing

Among the baseline methods, UberNN and TSE-SC have shown good accuracy but suffer from higher SD (uncertainty) compared to the AA-Forecast (LSTM-GRU) models. Considering that the extracted features are available for all the baseline methods, we believe the higher uncertainty of SD is due to their static dropout probability that is constant for all time steps. Therefore, the two proposed models, AA-Forecast (LSTM-GRU), consistently outperform state-of-the-art methods. Considering all three evaluation metrics, AA-Forecast (GRU) is the best-suited framework for our dataset as it provides higher accuracy and confidence.

Zero-shot prediction: Table 5 demonstrates the zero-shot prediction abilities for the selected models. Both AA-Forecast (LSTM-GRU) predictions follow the observed time series in general. The prediction errors are comparably low during the presence of extreme events (i.e., hurricanes). This is mainly due to the anomaly attention mechanism developed to further reduce the prediction error during extreme events. Moreover, extracted anomalies from STAR decomposition led to the recall of the hurricane effects on previously seen regions, thus providing predictions for unseen time series data with a lower error given the presence of anomalies. Figure 5 showcases a sample of these predictions for each model where for every time step, the prediction uncertainty is the least.

Table 5 Performance comparisons of zero-shot prediction abilities of models using ten randomly selected counties’ sales tax data where they have not been used in training entirely

Given that the network did not train on the selected time series directly, it’s able to transfer its knowledge from previously seen extreme events (i.e., the effect of cat 4 hurricanes) and provide more accurate prediction when not provided with such ability.

4.5 Ablation study

In this section, we provide an extensive analysis of the performance of AA-Forecast, as well as the impact of each component on the performance of AA-Forecast. The results are shown in Table 6 where we removed each component and reported the changes in accuracy and uncertainty.

Fig. 5
figure 5

Zero-shot prediction for hotel tax sales of Collier County, Florida, U.S. Both variations of AA-forecast are concatenated for demonstration

Table 6 Ablation study on AA-forecast (GRU) model using the sales tax dataset to show the effectiveness of its components

Influence of anomaly-aware decomposition: To demonstrate that the anomaly-aware decomposition can aid in improving the time series prediction, we fed the input series to the prediction model directly. This modification resulted in the worst performance in our ablation study. Note that AA-Forecast (GRU) still benefits from dynamic dropout optimization and extreme event labels, and the predicted uncertainty is optimized. However, the accuracy of AA-Forecast prediction (GRU) drops because of the limited number of features, indicating that the neural network does not have a strong ability to capture complex and nonlinear information. This can highlight the role of auxiliary features such as decomposed anomalies and extreme events for forecasting.

Influence of uncertainty optimization: We also used a static dropout throughout the experiments at every time step, which caused a substantial increase in SD. Uncertainty optimization of dropout plays a critical role in reducing the uncertainty of the forecast intervals. Such modification also caused a higher error in the forecast, which is the model’s inability to forecast with higher confidence.

Influence of anomaly attention: We conducted experiments to demonstrate the effectiveness of anomaly awareness through the network’s attention mechanism. Specifically, we directly fed the extreme events and anomalies without the anomaly-attention mechanism described in Sect. 3.2. Such change makes limits AA-Forecast’s knowledge about hurricanes and the severity of their effects. As an example, in Fig. 6 (right), the results show that the network’s error during the presence of harder-to-predict time points (anomalies and extreme events) weakens.

Thus, removing the attention mechanism for anomalous/extreme event points of the dataset will reduce the performance of the model during the critical months of extreme events such as hurricanes. Simply relying on the previously seen dataset will not allow the network to handle external events and sudden changes effectively.

4.6 Discussion

Fig. 6
figure 6

Influence of anomaly attention on hurricanes. Two Category 4 hurricanes (Wilma and Irma) have caused similar annual sales losses. Anomaly-attention activation occurs during the presence of extreme events which makes it computationally efficient compared to the full-attention mechanism in transformers

Interpretation: The benefits of providing optimal uncertainty in prediction are twofold: first, it provides a systematic way to aid in resource allocation. Second, it further prepares the domain for interventions. For example, if one region receives more catastrophic extreme events, the resources can be transferred to that region. Moreover, government and industries can provide better-informed interventions and decisions (e.g., financial aid relief during COVID-19). As shown in the ablation study, including additional features such as extreme events and anomalous points can improve accuracy and better prime the model to handle predictions than deviate from trend or seasonality. Moreover, as shown in Fig. 6 without proper attention to these points, they result in a large amount of error in forecasting. Given that such critical moments are of high importance during extreme events such as natural disasters, the performance of the model during critical time steps can be improved. Hence, it is essential to provide a thorough learning objective in our time series models to not only improve the overall performance but take critical moments into more consideration. Furthermore, allowing the model to provide its level of uncertainty establishes transparency and builds a level of trust for the users. Table 3 also showcases the runtime for the methods used in the study. Although the traditional method’s accuracy and uncertainty are reported less than the deep learning methods, they still have better runtime efficiency. However, they contain few learnable parameters which result in lower models’ capacity. Moreover, they are not able to share information sharing across regions for various time series.

Anomalies: As shown in Fig. 4, for the Hurricane case study, the anomalies start with the losses of the early 2000s Atlantic hurricane season. Interestingly, hurricane Irma 2017 did a similar catastrophic damage (77.16 billion) to the early 2000s season (Oceanic 2022). which allows the model to predict with higher accuracy when trained on previously sens effects of these anomalies. Similarly, for the COVID-19 dataset, the anomalies start by indicating the drastic changes in lockdown order which caused a great loss in the percentage of employment. These anomalies for each state play a critical role in future pandemics so that enough resources can be allocated to combat the losses (Selerio and Maglasang 2021). In the Electricity case study, note that the larger values of anomalies need to be carefully handled given that these points of high electricity load can lead to unplanned generation plant outages (Grace and Christiansen 2013).

Limitations & future directions: Although the dynamic dropout mechanism guarantees the least uncertainty in predictions, it cannot provide guarantees to do the same for prediction accuracy. This is due to the random nature of the dropout which we left as a future work where the dropout can appear for a predetermined distribution of the neurons. Therefore, maximizing the useful information contained in the multidimensional model serves to predict time series in extreme events. When it is not available, it’s more reasonable to suggest methods that extract potential critical time steps such as anomalous points (e.g., STAR decomposition).

5 Related works

Anomalies in time series data often produce a high variance of uncertainty prediction that is difficult to predict, thus becoming a challenge for reliable model design (Zhu and Laptev 2017; Pang et al. 2017). To provide a more reliable forecast during the presence of anomalies, probabilistic forecasting methods are often studied, which can report a level of uncertainty (Li et al. 2019).

The majority of Bayesian Neural Networks in probabilistic forecasting require specific training and optimization methods and require additional model parameters that result in a larger amount of computation. Hence, MC dropout is preferred due to its practicability and its out-of-the-box solution (Zhu and Laptev 2017). Applying standard dropout to Bayesian Neural Networks often results in poor performance on account of dropout noise preventing the network from maintaining long-term memory (Labach et al. 2019). Gal and Ghahramani (2016) proposed the MC dropout, in which the dropout can be interpreted as a sampling method that is equivalent to a variational approximation of a deep Gaussian process. MC dropout that is used for recurrent layers has proved to be successful and is commonly used in practice by applying dropout to recurrent connections in a way that can preserve long-term memory (Labach et al. 2019). In previous studies, static MC dropout was used throughout their experiments, which suffers the model’s robustness toward the effect of anomalies. Given that probabilistic models still require an overall great accuracy of their forecasts, optimizing the uncertainty in prediction intervals remains a challenging question

6 Conclusion

We propose an anomaly-aware time series prediction framework, namely AA-Forecast, to capture and leverage the effect of extreme events and anomalies for the time series prediction task. It features a novel anomaly decomposition method that also extracts the essential features of the data. We also proposed an anomaly-aware model to leverage the extracted anomalies through an attention mechanism. Moreover, we reduced the uncertainty of the network without any further training so that the prediction uncertainty is minimal through the testing state. We compare our framework with several statistical and deep learning models using three real-world time series datasets. The results show that the AA-Forecast framework outperforms these models in prediction error and uncertainty. For future work, the prediction performance could be further improved if we target specific groups of neurons (e.g., the neurons containing unnecessary details of the time series dynamics) for dynamic dropout optimization.