Introduction

The COVID-19 pandemic has caused unprecedented volatility in oil prices, making energy risk management one of the primary concerns for regulators and market participants. Oil price fluctuations have a significant impact not only on the real economy (e.g., Hamilton 1983, 2003), but also on financial markets such as equity (e.g., Liu et al. 2020; Hashmi et al. 2021) and currency (e.g., Dai et al. 2020; Ding et al. 2021) with a strong tail contagion effect (Le et al. 2021). The growing integration of the oil and financial markets, particularly the large fluctuations in crude oil prices during a crisis, has necessitated the development of robust oil price risk forecasting.

Value-at-risk (VaR) has been one of the most widely used risk measures in the financial industry since its adoption by the Basel Committee on Banking Supervision in 1996. VaR is defined as the portfolio’s expected maximum loss over a given time horizon and with a given level of confidence. Despite its simplicity, VaR is a useful tool for monitoring risk exposure, adjusting positions to avoid large losses, and allocating capital to minimize income volatility while maximizing return on investment. Therefore, accurate VaR forecasting is critical for internal risk management and financial stability. However, during the early 2020 financial market turmoil, many financial institutions recorded excessive exceptions, i.e., the trading book losses exceeded the VaR forecasts (Risk.net 2020). This raises questions about the accuracy of the VaR models.

VaR was introduced for oil risk quantification in the 2000s under the volatile oil price environment (Sadeghi and Shavvalpour 2006). Oil price returns exhibit similar characteristics to other financial assets, such as skewness, leptokurtosis, and volatility clustering (Giot and Laurent 2003). This has motivated prior studies to forecast oil VaR based on a volatility filtering process that is carried out in two steps. To capture the volatility dynamics, returns are first filtered by the generalized autoregressive conditional heteroskedasticity (GARCH) model introduced by Engle (1982) and Bollerslev (1986), and then the standardized residuals, i.e., innovations are modeled by a specified distribution for quantile estimation. GARCH-type models with normal innovations can generate data with unconditionally fat tails, but they are not sufficiently fat to address all of the unconditional leptokurtosis observed in returns series (Kuang 2021). Alternative innovation distributions have been explored in oil VaR forecasts, including historical simulation (e.g., Costello et al. 2008) and parametric distributions such as generalized error (Fan et al. 2008), heavy-tailed (Hung et al. 2008), skewed Student’s t (Giot and Laurent 2003) and other flexible skewed distributions (Lyu et al. 2017), generalized pareto (Marimoutou et al. 2009), and Pearson’s type IV and Johnson’s Su distributions (Patra 2021). According to these studies, choosing the appropriate distribution for the innovation process is critical because it affects the accuracy of the required quantile estimates. The best-performing models differ depending on the models compared and the time periods investigated. Overall, flexible distributions are preferred, with those that capture both skewed and fat-tailed properties outperforming normal and non-skew distributions.

Building on the research of crude oil volatility modeling and forecasting, another strand of literature investigates variants of GARCH-type models to address stylized facts of oil volatility for more accurate VaR forecasts. For example, volatility persistence is observed in the crude oil price (e.g., Choi and Hammoudeh 2009; Kang et al. 2009). This implies that price fluctuations have long-term effects on volatility. Moreover, crude oil futures have a leverage effect (e.g., Nomikos and Andriosopoulos 2012; Kristoufek 2014). This suggests that downward movements (shocks) in the crude oil market are accompanied by greater volatility than similar upward movements. Wei et al. (2010) show that non-linear specification of GARCH models, allowing for long memory as well as asymmetric leverage effect in volatility, produce greater forecasting accuracy than the linear ones, especially for long-run volatility forecasting. Cheong (2009), on the other hand, reveals that while both the estimation and diagnostic evaluations favor an asymmetric long-memory autoregressive conditional heteroskedasticity (ARCH) model, the simplest parsimonious GARCH can produce better out-of-sample forecasting performance. Despite this, models that account for long-range memory and asymmetry in the volatility process, as well as fat tails, have been shown to outperform others in predicting oil VaR (e.g., Aloui and Mabrouk 2010; Chkili et al. 2014; Youssef et al. 2015). GARCH-type models also suffer from some limitations. For example, Lux et al. (2016) discover that Markov-switching multifractal (MSM) models that capture the multiscaling, long memory, and structural breaks of oil price volatility outperform a battery of GARCH-type models in VaR forecasts. Moreover, the squares of past daily returns are commonly used in GARCH-type models, which are unbiased but inefficient estimators of volatility because they only use one price measure per period. Liu and Wan (2012) find that the intraday returns-based realized volatility model outperforms the GARCH-type models in forecasting oil price volatility. However, the MSM model is more complex than GARCH-type models, and managing high-frequency data requires more time and computational resources.

Prior studies have mainly focused on VaR forecasts. However, VaR provides no information about potential losses beyond the quantile, which is problematic when exceeded exceptions occur during a crisis, nor does it satisfy the “subadditivity” requirement, i.e., the total risk of a portfolio should be less than or equal to the sum of risks of the individual portfolio assets. The expected shortfall (ES) has recently received growing attention as an alternative measure due to the upcoming implementation of the “Fundamental review of the trading book,” a revised market risk framework proposed by the Basel Committee on Banking Supervision (BCBS 2016). ES is the conditional expectation of exceedances above a certain VaR threshold. Despite a coherent risk measure (Artzner et al. 1999), ES does not satisfy the elicitability, which is a desirable property for evaluating point forecasts (Gneiting 2011). This has prompted research into the development of ES backtesting (e.g., McNeil and Frey 2000; Fissler and Ziegel 2016; Bayer and Dimitriadis 2020).

The challenge of oil price risk management, combined with impending regulatory changes, prompted this study to look into alternative approaches for more accurate oil tail-risk forecasting without increasing model complexity. Specifically, this paper extends previous literature on oil VaR by including ES analysis for both long and short oil positions over the last decade, which covers the 2008 global financial crisis, the 2014 oil crash, and the recent COVID-19 pandemic. It considers three GARCH-type models that were previously used for oil volatility forecasting: the standard GARCH (Bollerslev 1986), the Glosten-Jagannathan-Runkle GARCH (GJRGARCH) model of Glosten et al. (1993), and the fractionally integrated GARCH model (FIGARCH) of Baillie et al. (1996) with four innovation distributions: normal, skewed t, historical simulation, and the Cornish–Fisher expansion (CF). The VaR and ES forecasts are compared, and their performance is evaluated using traditional risk management criteria and the recently developed ES backtesting techniques. In particular, the paper contributes to the existing literature in three dimensions.

First, this research sheds light on oil ES forecasts other than VaR. It demonstrates the importance of using the ES forecast as a risk-control supplement to VaR. Tail-risk management is critical, as market participants are concerned not only about the loss associated with a certain level of confidence but also about the magnitude of the loss that could exceed the threshold. Moreover, it illustrates the impact of shifting from VaR to ES as a risk measure, which provides insights into the performance of various models as part of the transition mandated by the financial regulators.

Second, this paper extends the GARCH model combined with the Cornish–Fisher expansion (GARCH-CF) for forecasting ES. This approach was proposed by Alexander et al. (2013) to forecast the VaR of equity, foreign exchange, and interest rates, and has recently been explored to forecast the VaR of oil (Kuang 2022). It is based on the modified VaR of Favre and Galeano (2002) but incorporates the GARCH model to capture volatility dynamics, in addition to using Cornish–Fisher to address the skewness and excess kurtosis of innovation. To the best of our knowledge, no research has been conducted on this approach for ES forecasting.

Third, to evaluate the oil ES forecast, this paper employs a novel regression-based ES test of Bayer and Dimitriadis (2020) with only the ES forecast as an input variable. This is in contrast to Youssef et al. (2015) which uses the exceedance residuals test of McNeil and Frey (2000) dependent on both VaR and ES forecasts. Furthermore, we use a joint loss function for VaR and ES proposed by Fissler and Ziegel (2016) to compare different models’ forecast accuracy.

We find that the proposed GARCH-CF model is superior in forecasting oil ES for long positions. The GJRGARCH model with skewed t provides the most accurate joint VaR and ES forecasts. The relative magnitude of the ES to VaR forecast varies across models and over time. Switching from VaR to ES as a risk measure increases risk prediction to varying degrees, with the GARCH-CF model being the least impacted.

The remainder of the article is organized as follows. “Methodology” section introduces the models and evaluation methods. “Empirical analysis” section describes the data and discusses the empirical results. “Conclusion” section concludes.

Methodology

VaR has been the standard method for quantifying and managing market risk since its adoption by the Basel Committee on Banking Supervision in 1996. VaR is defined as the maximum loss that may be incurred by a portfolio over a given time horizon h at a certain level of confidence p, i.e., \(\text{ Pr }\left( r_{t+h} \le \text{ VaR}_{t+h}^{\alpha }\right) =\alpha,\) where \(\alpha =1-p\) and \(r_{t+h}\) is the asset’s return over the period h. ES is defined as the expected loss when losses have exceeded VaR. Following the 2008 financial crisis, the Basel Committee on Banking Supervision issued the “Fundamental review of the trading book” (FRTB) (BCBS 2016) to revise the approach to calculate risk-based capital requirements for trading activities. One of the significant changes is the transition from the VaR with a confidence level of \(p=99\%\) to the ES with a confidence level of \(p=97.5\%\) as an alternative measure to quantify market risk.

Given the daily log returns of an asset following the below process:

$$\begin{aligned} r_t = \mu _{t} + \varepsilon _{t} = \mu _{t} + \sigma _{t}z_{t}, \qquad \text{ with } \quad z_t \sim i.i.d(0,1) \end{aligned},$$
(1)

where \(r_t = p(t)-p(t-1)\) and p(t) is the logarithmic asset price observed at day t. The day-ahead VaR and ES are defined as follows (Brio et al. 2020):

$$\begin{aligned} \text{ VaR}_{t}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t}\hat{F}^{-1}_z(\alpha ) \end{aligned}$$
(2)
$$\begin{aligned} \text{ ES}_{t}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t}\hat{S}_z(\alpha ) \end{aligned},$$
(3)

where \(\hat{\mu }_{t}\) and \(\hat{\sigma }_{t}\) are the one-day ahead conditional mean and conditional volatility forecasts, respectively; \(\hat{F}^{-1}_z(\alpha )\) is the \(\alpha\) quantile of innovation, i.e., \(z_t = \left( r_t-u_t\right) /\sigma _t\) and \(\hat{S}_z(\alpha ) = \mathbb {E}\left[ z|z< \hat{F}^{-1}_z(\alpha )\right]\). The ES estimation is generally computed by numerical integration, while close-form solutions may exist in some cases.

Forecasting VaR and ES requires the conditional mean, conditional volatility, and the quantile estimation of the innovation distribution. Conditional mean returns are difficult to forecast using available information and forecasting techniques, as daily returns are heavily influenced by specific news and events that a model often does not take into account (Christoffersen and Diebold 2006). The conditional volatility, on the other hand, is time-varying and highly predictable (Andersen and Bollerslev 1998). Empirical research on volatility modeling has exploded since the introduction of the ARCH model by Engle (1982) to capture the volatility clustering effect, i.e., the tendency for large changes in asset prices to follow large changes and small changes to follow small changes. The ARCH model was generalized into the GARCH model by Bollerslev (1986), and various extensions have been proposed to capture other stylized facts of financial asset returns.Footnote 1 In this study, we use the standard GARCH model and two of its extensions, GJRGARCH and FIGARCH models to model and forecast the volatility of oil markets based on prior literature (e.g., Kang et al. 2009; Charles and Darné 2017). The details of the volatility models are discussed in “Volatility models” section.Footnote 2

Although GARCH-type models with normal innovations can generate data with unconditionally fat tails, they are insufficient to account for all of the unconditional leptokurtosis and skewness observed in oil returns series (e.g., Lux et al. 2016; Patra 2021). As a result, in addition to the normal distribution, this study compares the non-parametric historical simulation (e.g., Westgaard et al. 2019), parametric skewed t distribution (e.g., Giot and Laurent 2003), and semi-parametric Cornish–Fisher expansion (Cornish and Fisher 1937) approaches to address the non-normality of return innovation for both VaR and ES forecasts. The details of innovation distributions are covered in “Innovation distributions” section.

Volatility models

Standard GARCH

Building on the ARCH model introduced by Engle (1982), Bollerslev (1986) proposed the GARCH model to address volatility clustering. The conditional volatility is expressed as a function of past returns and past conditional volatility forecasts. The GARCH(1,1) specification is defined as follows:

$$\begin{aligned} \sigma _t^2 = \omega + \alpha \varepsilon _{t-1}^2 + \beta \sigma ^2_{t-1} \end{aligned},$$
(4)

where \(\omega , \alpha , \beta > 0\) and \(\alpha + \beta <1\) to ensure conditional variance \(\sigma _t^2\) positive and stationary. The decay rate of the positive autocorrelation in the volatility process is governed by \(\alpha +\beta\): the closer \(\alpha +\beta\) is to 1, the slower the decay of the autocorrelation of \(\sigma _t\).

GJRGARCH

Glosten et al. (1993) developed a GJRGARCH model to capture positive and negative shocks on the conditional variance asymmetrically using the indicator function I. The GJRGARCH has been used to address the asymmetric leverage effect in oil volatility (e.g., Charles and Darné 2017; Rizvi and Itani 2021).Footnote 3 The model is defined by Wei et al. (2010):

$$\begin{aligned} \sigma _{t}^{2}=\omega +\left[ \alpha +\gamma I\left( \varepsilon _{t-1}<0\right) \right] \varepsilon _{t-1}^{2}+\beta \sigma _{t-1}^{2} \end{aligned},$$
(5)

where \(I\left( \cdot \right)\) takes the value of one if the condition in the parenthesis is satisfied; \(\gamma\) represents the “leverage” term, i.e., for \(\gamma >0\) the impact of past negative returns on conditional variance is greater than the impact of past positive returns.

FIGARCH

Baillie et al. (1996) introduced a FIGARCH model that allows for the long-memory feature of volatility. In the FIGARCH model, shocks decay at a slower hyperbolic rate, as opposed to the GARCH model, where shocks decay at an exponential rate, or the integrated GARCH model, where shocks persist forever. Kang et al. (2009) demonstrate that the FIGARCH model is useful for modeling persistence in crude oil price volatility and outperforms the GARCH and the integrated GARCH models in volatility forecasting. The model is defined as follows (Kang et al. 2009):

$$\begin{aligned} \phi (L)(1-L)^{d} \varepsilon _{t}^{2}=\omega +\left[ 1-\beta (L)\right] \left( \varepsilon _{t}^{2}-\sigma ^2_{t}\right) \end{aligned}$$
(6)

or

$$\begin{aligned} \sigma _{t}^{2}=\omega [1-\beta (L)]^{-1}+\left\{ 1-[1-\beta (L)]^{-1} \phi (L)(1-L)^{d}\right\} \varepsilon _{t}^{2} \end{aligned},$$
(7)

where L denotes the lag operator; \(\phi (L)=[1-\alpha (L)-\beta (L)](1-L)^{-1}\); \(0\le d \le 1\) is the fractional difference parameter that controls the degree of long memory. The FIGARCH model is flexible to allow for a wide range of persistence via the parameter d, and accommodates the standard GARCH model with \(d=0\) and the integrated GARCH model with \(d=1\) as special cases (Kang et al. 2009). The estimation of the FIGARCH model requires a minimum number of observations. According to the standard procedure in the literature (e.g., Beine et al. 2002; Beine and Laurent 2003), the truncation order of the infinite \((1-L)^{d}\) is set to 1000 lags:

$$\begin{aligned} (1-L)^{d}=\sum _{k=0}^{1000} \frac{\Gamma (k-d)}{\Gamma (k+1) \Gamma (-d)} L^{d} \end{aligned}$$
(8)

Innovation distributions

Normal distribution

The naive assumption is that the standardized residual follows a standard normal distribution, i.e., \(z_{t} \sim iid N(0,1)\) with the cdf defined as

$$\begin{aligned} \Phi (x) = \frac{1}{\sqrt{2\pi }} \int _{-\infty }^x e^{-t^2/2} \end{aligned}$$
(9)

The day-ahead VaR and ES forecasts are given by (Brio et al. 2020):

$$\begin{aligned} \text{ VaR}_{t, N}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t}\Phi ^{-1}(\alpha ) \end{aligned}$$
(10)
$$\begin{aligned} \text{ ES}_{t, N}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t} \frac{1}{\alpha }\phi \left[ \Phi ^{-1}(\alpha )\right] \end{aligned},$$
(11)

where \(\phi\) is the pdf of standard normal distribution and \(\Phi ^{-1}(\alpha )\) is the corresponding \(\alpha\) quantile.

Skewed student distribution

Fernández and Steel (1998) proposed a four-parameter skewed student distribution to account for the asymmetry and kurtosis of the returns process, and Lambert and Laurent (2001) derives its quantile function as follows.

$$\begin{aligned} c_{\alpha ,\nu ,\xi }^{\rm skst} = \left\{ \begin{array}{ll} \left\{ \frac{1}{\xi }c_{\alpha ,\nu }^{\rm st}\left[ \frac{\alpha }{2}\left( 1+\xi ^2\right) \right] -m\right\} /s, & \quad \text{ if } \; \alpha < \frac{1}{1+\xi ^2}\\ \left\{ -\xi c_{\alpha ,\nu }^{\rm st}\left[ \frac{1-\alpha }{2}\left( 1+\xi ^{-2}\right) \right] -m\right\} /s, & \quad \text{ if } \; \alpha \ge \frac{1}{1+\xi ^2} \end{array} \right. \end{aligned}$$
(12)

where \(c_{\alpha ,\nu ,\xi }^{\rm skst}\) is the \(\alpha\)the quantile of the unit variance skewed student distribution with \(\nu >2\) degrees of freedom and asymmetric parameter \(\xi >0\); \(c_{\alpha ,\nu }^{\rm st}\) denotes the quantile function of the standardized Student-t density function; \(m = \frac{\Gamma \left( \frac{\nu +1}{2}\right) \sqrt{\nu -2}}{\sqrt{\pi }\Gamma \left( \frac{\nu }{2}\right) }\left( \xi -\frac{1}{\xi }\right)\) and \(s = \sqrt{\left( \xi ^2+\frac{1}{\xi ^2}-1\right) -m^2}\) are the mean and standard deviation of the non-standardized skewed student distribution, respectively. The day-ahead VaR and ES forecasts are then given by

$$\begin{aligned} \text{ VaR}_{t, skst}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t}c_{\alpha ,\nu ,\xi }^{\rm skst} \end{aligned}$$
(13)
$$\begin{aligned} \text{ ES}_{t, skst}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t}\mathbb {E}\left[ z|z< c_{\alpha ,\nu ,\xi }^{\rm skst}\right] \end{aligned}$$
(14)

Filtered historical simulation

The filtered historical simulation (FHS) introduced by Barone-Adesi et al. (1999) is a semi-parametric approach that employs parametric models for the mean and volatility dynamics and a non-parametric estimator for the residual distribution. By assuming that the standardized residual is normally distributed, the parameters of the GARCH model defined in Eq.(4) are estimated using Quasi Maximum Likelihood and the day-ahead VaR and ES forecasts are then given by

$$\begin{aligned} \text{VaR}_{t, {HS}}^{\alpha } &= \hat{\mu }_{t} + \hat{\sigma }_{t}\hat{G}^{-1}(\alpha ) \end{aligned}$$
(15)
$$\begin{aligned} \text{ ES}_{t, HS}^{\alpha }&=\hat{\mu }_{t} + \hat{\sigma }_{t}\mathbb {E}\left[ z|z< \hat{G}^{-1}(\alpha )\right] \end{aligned},$$
(16)

where \(\hat{G}^{-1}(\alpha )\) is the \(\alpha\) quantile of the empirical distribution of the standardized residual \(z_{t}\).

Cornish–Fisher expansion

Favre and Galeano (2002) proposed a four-moment “modified VaR” based on the Cornish–Fisher expansion (Cornish and Fisher 1937) for estimating the quantiles of non-normal distributions as a function of standard normal quantiles and the sample skewness and excess kurtosis. The \(\alpha\) quantile of a distribution is defined as follows.

$$\begin{aligned} \delta _{\mathrm{CF}} (\alpha )&= Z(\alpha )+\frac{1}{6}\left( Z(\alpha )^2-1\right) S + \frac{1}{24}\left( Z(\alpha )^3-3Z(\alpha )\right) K- \frac{1}{36}\left( 2Z(\alpha )^3-5Z(\alpha )\right) S^2 \end{aligned},$$
(17)

where \(Z(\alpha )\) is the \(\alpha\) quantile of the standard normal distribution, and S and K are the skewness and excess kurtosis of standardized residual series. Alexander et al. (2013) suggest using GARCH-type model to capture volatility clustering and then the standardized residual is modeled by the Cornish–Fisher expansion for VaR forecasting. In this paper, we extend this approach to ES forecasting. The one-day ahead VaR forecast and ES are given by

$$\begin{aligned} \text{ VaR}_{t, CF}^{\alpha }&= \mu _{t} + \hat{\sigma }_{t} \delta _{\text{ CF }} (\alpha ) \end{aligned}$$
(18)
$$\begin{aligned} \text{ ES}_{t, CF}^{\alpha }= & {} \hat{\mu }_{t} + \hat{\sigma }_{t}\mathbb {E}\left[ z|z< \delta _{\text{ CF }} (\alpha ) \right] \end{aligned}$$
(19)

Evaluation criteria

Value-at-risk tests

To assess the accuracy of a VaR model, a simple approach is to use the Actual over Expected (AE) exceedance ratio, which tracks the actual number of times that the returns have exceeded the forecast VaR over the expected VaR exceptions (Laporta et al. 2018). The closer the AE ratio approaches one, the more accurately the model predicts VaR. More rigorous approaches for backtesting include unconditional coverage (Kupiec 1995), conditional coverage (Christoffersen 1998), and the dynamic quantile test (Engle and Manganelli 2004), which have been used to backtest oil VaR forecast (e.g., Hung et al. 2008; Marimoutou et al. 2009; Aloui and Mabrouk 2010; Youssef et al. 2015; Lyu et al. 2017; Westgaard et al. 2019).Footnote 4

The unconditional coverage test \(\text{ LR}_\mathrm{uc}\) of Kupiec (1995) tests the null hypothesis that the exception rate is statistically equal to the expected value given a confidence level. The \(\text{ LR}_\mathrm{uc}\) test, however, is unable to determine whether the exceptions are distributed randomly. Christoffersen (1998) proposes a more comprehensive conditional coverage test \(\text{ LR}_\mathrm{cc}\) which jointly examines whether total exceptions are equal to the expected and whether exceptions are distributed independently. Similarly, Engle and Manganelli (2004) introduce a regression-based dynamic quantile (DQ) test to examine whether exceptions can be predicted by a set of explanatory variables such as the first four lags of the Hit series.Footnote 5

Expected shortfall tests

The exceedance residuals (ER) test proposed by McNeil and Frey (2000) was one of the first and widely used ES tests. Under the null hypothesis of a correctly specified risk model, the ER, defined as the difference between the realized return and the ES forecast in the event of a VaR exception, i.e., \(er_{t} = \left( r_{t}-\text{ ES}_{t}\right) I_{r_{t}<{\text{ VaR}_{t}}}\), should have zero mean. The bootstrap method of Efron and Tibshirani (1993) is used to determine whether the expected value of the ER is zero. A one-sided test is performed against the alternative hypothesis that the mean is negative, i.e., the expected shortfall forecast is systematically underestimated. The ER test essentially compares the empirical average of returns truncated at VaR forecast to the average ES forecast whenever there is a VaR violation, so is a joint backtest for the pair VaR and ES forecasts (Bayer and Dimitriadis 2020).Footnote 6

Bayer and Dimitriadis (2020) introduce an ES regression (ESR) test that models the conditional ES as a linear function, with returns as the response variable and ES forecasts as the explanatory variable including an intercept term. The intercept and slope parameters should be equal to zero and one, respectively, for correctly specified ES forecasts. To determine whether the ES forecast is systematically underestimated, Bayer and Dimitriadis (2020) further suggest an Intercept ESR test that limits the slope parameter to one and only estimates and tests the intercept term. The Intercept ESR test is defined as

$$\begin{aligned} r_{t}-\hat{e}_{t}=\gamma _{1}+u_{t}^{e} \end{aligned},$$
(20)

where \(r_t\) denotes the daily returns; \(\hat{e}_{t}\) is the ES forecast; the conditional \(\mathrm {ES}\) at \(\alpha\) for \(u_{t}^{e}\) given the past information \(\mathcal {F}_{t-1}\) is zero, i.e., \(\mathrm {ES}^{\alpha }\left( u_{t}^{e} \mid \mathcal {F}_{t-1}\right) =0\). The one-sided hypothesis test is given by

$$\begin{aligned} \mathbb {H}_{0}^{1 s}: \gamma _{1} \ge 0 \quad \text{ against } \quad \mathbb {H}_{A}^{1 s}: \gamma _{1}<0 \end{aligned}$$
(21)

which is tested based on a one-sided t test using estimated asymptotic covariance.

Loss functions

In most cases, using the above statistical tests to evaluate the performance of individual models may not be sufficient to determine the best model. In this case, the model confidence set (MCS) procedure of Hansen et al. (2011) can be used to determine a set of “superior” models based on loss functions at a given confidence level. If a model is not in the “superior” set, it is considered to be less likely to be the best model than those are. We follow Hansen et al. (2011) by implementing MCS testing at 90% confidence on various loss functions for evaluating volatility, VaR, and the joint VaR and ES forecasts, respectively. The models with lower average loss values over the forecasting period perform better.

Volatility forecast is evaluated using the Quasi-Likelihood (QLIKE) loss function where daily squared returns are used as a proxy for realized volatility (RV). The QLIKE is an asymmetric loss function with a heavier penalty for under-forecast. According to Patton (2011), both the QLIKE and Mean Squared Error (MSE) functions are robust to the presence of noise in the volatility proxy for ranking rival volatility forecasts, but QLIKE is less sensitive to outliers than MSE.

$$\begin{aligned} \text{ QLIKE}_{t}=\frac{\text{ RV}_{t}}{\hat{\sigma }_{t}^{2}}-\ln \left( \frac{\text{ RV}_{t}}{\hat{\sigma }_{t}^{2}}\right) -1 \end{aligned}$$
(22)

The tick loss function (TLF), which is used for quantile regression (Koenker and Bassett 1978), is one of the most commonly used methods for evaluating quantile forecasts (e.g., González-Rivera et al. 2004; Giacomini and Komunjer 2005). It is an asymmetric loss function that heavily penalizes observations with VaR exceptions.

$$\begin{aligned} \text{ TLF}_{t}^{\alpha } = \left( \alpha -I_{r_{t}<{\text{ VaR}_{t}^\alpha }}\right) \left( r_{t}-\text{ VaR}_{t}^{\alpha }\right) \end{aligned}$$
(23)

While ES itself is not elicitable (Gneiting 2011), Fissler and Ziegel (2016) show that the pair VaR and ES is jointly elicitable with an associated scoring function (FZL). We follow Patton et al. (2019) to use the FZL to assess each model’s ability to predict VaR and ES.

$$\begin{aligned} \text{ FZL}_{t}^{\alpha } = \frac{1}{\alpha \text{ ES}_{t}^{\alpha }} I_{r_{t}< {\text{ VaR}_{t}^\alpha }}\left( r_{t}-\text{ VaR}_{t}^{\alpha }\right) +\frac{\text{ VaR}_{t}^{\alpha }}{\text{ ES}_{t}^{\alpha }}+\log \left( -\text{ ES}_{t}^{\alpha }\right) -1 \end{aligned}$$
(24)

Empirical analysis

Data and estimation

In this study, we focus on two of the most closely watched oil markets: European Brent crude oil (Brent) and West Texas Intermediate crude oil (WTI). The daily spot prices are obtained from the Energy Information Administration (EIA) and cover the period from July 2, 2004 to August 31, 2021 for Brent and July 9, 2004 to August 31, 2021 for WTI, each with approximately 4300 observations. Figures 1 and 2 depict the daily prices and returns for Brent and WTI, respectively. Brent and WTI follow a similar pattern, with prices rapidly rising and reaching a record high above $140 per barrel in July 2008 before collapsing to $30 per barrel by the end of the year 2008. The markets started to recover and stabilize before the price slump in later 2014. The most severe crash was observed in early 2020 as a result of a supply glut and a drop in demand caused by the spread of Covid-19. It should be noted that the WTI fell from $18 to $− 37 per barrel on 20 April 2020, before recovering to $8 per barrel the next day. The extreme outlier could have a significant impact on the estimation of standard econometric models (Shi 2021). Therefore, we removed the negative price on April 20, 2020 from the WTI time series for a more robust parameters estimation. As can be seen, the impact of this extreme market shock quickly faded, as the market returned to and remain in the positive territory the following days. The continuously compounded daily returns are calculated as the difference in the logarithms of daily spot prices multiplied by 100, i.e., \(R_t = \left[ lnP_t-lnP_{t-1}\right] *100\).

Fig. 1
figure 1

Brent daily price and return. The graphs show the daily price and return series for Brent from July 9, 2004 to August 31, 2021, along with the return density graph. The three sub-sample forecast windows are denoted by dashed lines

Fig. 2
figure 2

WTI daily price and return. The graphs show the daily price and return series for WTI from July 9, 2004 to August 31, 2021, along with the return density graph. The three sub-sample forecast windows are denoted by dashed lines. The negative price on April 20, 2020 was removed for creating return series and the density graph

Each of the models is initially estimated using the first 1000 observations to produce a one-step-ahead VaR and ES forecast on both long and short positions for day 1001.Footnote 7 After that, the estimation sample is advanced by one day, the model is re-estimated, and the forecast for day 1002 is generated, and so on until the sample is completed. As a result, the entire sample has 1000 observations for estimation and 3300 out-of-sample observations for forecast evaluation. The main advantage of using a rolling window is that it incorporates the most recent market data while discarding out-of-date observations. In order to assess whether the model performance varies across the market condition, we further investigate the VaR and ES forecasts for the long position over three sub-samples with extreme oil movement. Each sub-sample has 1250 observations, with the last 250 observations of each sample taken as the out-of-sample forecast period to evaluate the one-day ahead VaR and ES forecasting performance of each model. Table 1 contains the information about each sample period. The first two sub-samples were chosen following Lyu et al. (2017) which covers the oil crisis in later 2008 and late 2014. The third sub-sample corresponds to the recent COVID-19 period. The three sub-sample windows are denoted by dash lines in the Brent and WTI price histories shown in Figs. 1 and 2.

Table 1 Sample periods

The returns graphs show time-varying volatility with large changes in oil price returns followed by large changes and small changes followed by small changes. During the recent Covid-19 outbreak, both Brent and WTI returns have experienced extremely high volatility, resulting in a series of large positive and negative returns in a short time. Table 2 provides the descriptive statistics of the two returns series over the full sample and sub-sample periods. The daily mean returns are close to zero. Brent and WTI have the largest standard deviation and maximum profit and loss over COVID-19 (sample III) period. The Jarque–Bera statistic rejects the normality assumption because both returns series are highly leptokurtic with negative skewness. The augmented Dickey-Fuller test indicates that both returns series are stationary. The ARCH portmanteau test for serial correlation in squared returns of order up to 20 reveals evidence of significant volatility clustering. Preliminary data analysis indicates that capturing skewness, fat-tailed, and time-varying volatility is critical for tail-risk forecasting.

Table 2 Summary statistics
Table 3 GARCH estimation

Table 3 presents the estimation results of conditional volatility models based on standard GARCH, GJRGARCH, or FIGARCH models combined with skewed student innovations over the full sample period. It is clear that the mean returns \(\mu\) are not statistically significantly different from zero in all the models. The sum of the ARCH and GARCH parameters (i.e., \(\alpha +\beta\)) is close to 1 which indicates the persistence of volatility. In particular, the leverage parameter \(\gamma\) of the GJRGARCH model is positive and significant, confirming the leverage effect, i.e., negative shocks have a larger impact on the conditional variance than the positive shocks. The long-memory parameter d of the FIGARCH model close to 1 suggests shocks decay at a slower hyperbolic rate rather than the exponential rate as in the standard GARCH model. The skewed student distribution parameter estimates (\(\nu\) and \(\xi\)) indicate fat tails and negative skewness. Finally, the standardized residuals display neither significant autocorrelation nor any ARCH effect left in all the models. Overall, the estimation results support the use of constant mean and GARCH-type volatility models, as well as skewed and fat tail innovations, for oil VaR and ES modeling.

Results

We now examine out-of-sample forecast performance. To begin, we assess the accuracy of volatility forecasting using three GARCH-type models (GARCH, GJRGARCH, and FIGARCH) with normal and skewed t innovations.Footnote 8 The daily squared return serves as a benchmark for comparison. The average values of the QLIKE loss function are reported in Table 4, with lower values indicating more precise volatility forecasts. The GJRGARCH model produces the most accurate volatility forecasts for both Brent and WTI, and it appears most frequently in the superior set of models based on the MCS test at 90% confidence. The differences between normal and skewed t innovations under each type of GARCH model are less evident.

Table 4 Volatility forecasts accuracy (QLIKE)

The VaR and ES forecast performance is then evaluated at 97.5% and 99% confidence levels using a set of criteria that includes the AE ratios and the p values of \(\text{ LR}_\mathrm{uc}\), \(\text{ LR}_\mathrm{cc}\), DQ, ER, and the Intercept ESR tests. A p value of 0.05 or less is interpreted as evidence for rejecting the null hypothesis. Furthermore, the maximum absolute deviation (ADMax), as well as the tick and FZL functions are employed to compare forecast accuracy across different models.Footnote 9 The MCS tests are run at a 90% confidence on the tick and FZL mean values, respectively. Tables 5 and 6 report the means and standard deviations of VaR and ES forecasts, as well as their backtesting performance over the full out-of-sample forecast periods for long positions in Brent and WTI, respectively. Tables 7, 8, 9, and 10 show the breakdown of performance over three sub-sample periods. Given the limited number of observations for forecast evaluation in sub-samples, which may impact test significance, we do not rely on p values tests but instead use the AE, ADMax, and the values of loss functions as indicators. The main findings for the long positions are summarized below.

Table 5 Brent (long) VaR and ES results: full sample
Table 6 WTI (long) VaR and ES results: full sample
Table 7 Brent (long) VaR and ES 97.5% confidence results: sub-sample
Table 8 Brent (long) VaR and ES 99% confidence results: sub-sample
Table 9 WTI (long) VaR and ES 97.5% confidence results: sub-sample
Table 10 WTI (long) VaR and ES 99% confidence results: sub-sample

First, the standard deviations of VaR and ES forecasts increase with the confidence level, and the standard deviations of ES are greater than those of VaR at the same level of confidence. This implies that as confidence levels rise, forecasts for oil tail-risk become more variable. Moreover, the COVID-19 (sample III) period has the highest mean and standard deviations of VaR and ES forecasts, followed by the 2008 GFC (sample II) period. When comparing the COVID-19 results to the GFC period, the means of VaR and ES forecasts are roughly 50% higher, with standard deviations nearly five times higher. This implies that COVID-19 has caused more uncertainty and has had a greater impact on the oil market than the GFC. VaR and ES forecasts based on the GARCH-type models are effective to track market movements via the volatility filtering process, but their relative performance requires careful examination. Furthermore, the maximum absolute deviation is significantly higher during the COVID-19 than during the GFC periods, with the increase being greater for the WTI market than the Brent market. Therefore, the model performance comparison should consider not only the frequency but also the magnitude of VaR exceptions. This is especially important during the crisis, given the AE ratio is significantly greater than one.

Second, when the performance of innovation distributions within each type of GARCH model is compared over the entire out-of-sample period, normal innovations perform the worst, with the AE ratio significantly greater than the others and the p value tests of VaR and ES being rejected in the majority of cases, indicating that the models are misspecified. The historical simulation approach is less accurate than the skewed t and the Cornish–Fisher expansion in predicting VaR at a 99% confidence level. Based on the tick and FZL functions evaluation, the skewed t produces the most accurate joint VaR and ES forecasts. Moreover, it is worth noting that the Cornish–Fisher expansion consistently produces the most conservative tail-risk forecasts even during the sub-sample stressed periods. It has the highest mean and standard deviation of VaR and ES forecasts, the lowest AE ratio, and the smallest maximum absolute deviations. It passes the one-sided Intercept ESR tests at the 5% significance level in all cases.

Third, the differences between various GARCH-type models are investigated. Under the same type of innovation distribution, the GJRGARCH outperforms the standard GARCH and FIGARCH models in VaR and ES forecasts based on the tick and FZL functions evaluation. This is supported by the findings from the three sub-sample periods. The results are consistent with the QLIKE test reported in Table 4, which shows that the GJRGARCH model produces the most accurate volatility forecasts. Indeed, GJRGARCH combined with skewed t innovation outperforms all others in joint VaR and ES forecasts for both Brent and WTI. It is interesting to note that the FIGARCH model has a lower maximum absolute deviation than the GARCH and GJRGARCH models, which can be seen during both the GFC and COVID-19 stressed periods.

Next, we evaluate the VaR and ES forecasts for short positions. The results for Brent and WTI over the full out-of-sample forecast periods are presented in Tables 11 and 12, respectively. The results are generally in line with the long positions. In particular, the GJRGARCH model combined with skewed t innovations outperforms the others in joint VaR and ES forecasts based on the tick and FZL functions evaluation. The Cornish–Fisher expansion consistently produces the highest VaR and ES forecasts with the most variability when compared to other innovation distributions. In contrast to the findings for the long positions, the Cornish–Fisher expansion overestimates VaR forecasts for the short positions in WTI at the 99% confidence level, with a significantly low AE ratio and high tick and FZL function values. This highlights the limitation of the Cornish–Fisher expansion in considering both long and short trading positions for VaR and ES forecasts.

Table 11 Brent (short) VaR and ES results: full sample
Table 12 WTI (short) VaR and ES results: full sample

Finally, the relative magnitudes of VaR and ES forecasts are compared. We first follow Patra (2021) to examine ES to VaR ratio at the 99% confidence level. Table 13 reports the average of the ratio over the entire out-of-sample period and the maximum values of the ratio during each sub-sample evaluation period for Brent and WTI, respectively. The ratio is constant at 1.15 for the normal distribution of innovations, but it is significantly higher for the other non-normal distributions and varies across models and time. The highest values are observed during the COVID-19 period. For example, the ES is 30 to 40% higher than the VaR predicted by the GJRGARCH model with skewed t innovation. Furthermore, we investigate the impact of shifting risk prediction from VaR (99% confidence level) to ES (97.5% confidence level). The average of the ES to VaR ratio over the entire out-of-sample period and the maximum values of the ratio during each sub-sample evaluation period for Brent and WTI are shown in Table 14. The ratio is always greater than one, indicating that switching from VaR to ES will increase risk prediction. The increase under normal innovation is negligible, whereas the increase under historical simulation could be more than 20% for short positions in WTI during COVID-19 period. It is worth noting that the Cornish–Fisher expansion produces a more stable increase when moving from VaR to ES compared to the skewed t and historical simulation.

Table 13 ES99 vs VaR99 ratio
Table 14 ES97.5 vs VaR99 ratio

Discussion

First, this paper complements the recent study of Rizvi and Itani (2021) which compares the oil market volatility during the COVID-19 pandemic, GFC, and the SARS outbreak of 2002-2004. They confirm the presence of volatility clustering in oil prices during the crisis periods and show a more severe negative skewness, positive kurtosis, and volatility leverage effect in COVID-19 than in the other crisis. Indeed, COVID-19 has had a greater impact on the oil market than the GFC (e.g., Jebabli et al. 2021; Zhang and Hamori 2021). Our analysis reveals a significantly higher number of VaR exceptions than expected during the crisis periods, as well as a higher level and variability of oil tail-risk prediction over the COVID-19 period in comparison to the GFC and the 2014 oil crash.

Second, this paper extends the oil VaR literature by incorporating ES performance for comparison under different models and across various stressed periods. We show that volatility models and innovation distributions are important not only for VaR but also for ES forecasting. In particular, the proposed GARCH-type models with the Cornish Fisher expansion are superior for ES forecast and have the lowest maximum absolute deviation for long positions while being overly conservative for VaR forecasting in short positions. The findings would be of interest to financial institutions concerned about a capital shortfall as well as regulators seeking to maintain financial system stability in the event of a crisis.

Last but not least, this paper broadens the previously limited understanding of the relationship between VaR and ES forecast. We demonstrate that the ES to VaR ratio varies across models and over time. Given the variability of this ratio, it is beneficial for institutions to monitor it closely especially during times of stress to inform timely risk management decisions. Furthermore, we find that switching from VaR (99% confidence) to ES (97.5 % confidence) increases tail-risk prediction to varying degrees depending on the models. The Cornish–Fisher expansion leads to the most stable increase compared to the parametric and non-parametric distributions. The findings would be useful to financial institutions and regulators in preparing for the transition of risk measures mandated by the FRTB.

Conclusion

Forecasting and managing oil risk has become more important and challenging than ever before due to the growing financialization of the commodity market. Given the limitations of the VaR measure and the increasing significance of ES, this paper extends previous studies on oil VaR forecasting by shedding light on some interesting aspects of ES forecasts over the last decade. We introduce a GARCH-type model combined with the Cornish–Fisher expansion to address volatility clustering, skewness, and excess kurtosis in oil return series, and compare its VaR and ES forecast performance to some widely used GARCH-type models and innovation distributions based on recently developed VaR and ES backtesting methodology.

Four main findings are presented. First, the GJRGRACH with skewed t innovation generally outperforms the others in joint VaR and ES forecasting. Second, the Cornish–Fisher expansion produces more accurate ES forecasting than the normal, skewed t, and historical simulation, especially for long positions vulnerable to extreme oil crashes. Third, the magnitude of expected loss exceeding VaR relative to the VaR forecast varies across models and over time, with an increase observed during the COVID-19 period. Finally, switching from VaR to ES as a risk measure at comparable confidence levels increases risk prediction to varying degrees depending on the model, and the Cornish–Fisher expansion produces a more stable change than the other innovation distributions studied.

Our analysis suggests that ES, in addition to VaR, should be used to inform timely oil risk management decisions. Furthermore, we demonstrate the value of using Cornish–Fisher expansion in conjunction with GARCH volatility models to forecast ES. In view of the recent unprecedented volatility in the oil market, the findings could help regulators, energy companies, and financial institutions quantify and manage oil price risk.