1 Introduction

Financial investments in commodities have grown rapidly over the last decades and became an important asset in portfolios of institutional investors such as pension funds, insurance companies, and hedge funds. The risk associated with weather, storage etc. led to the rise of commodity indices in the early 2000s, providing a hedge opportunity for commodity producers. The volumes of exchange-traded derivatives became 20 to 30 times higher than the physical production of many commodities (Silvennoinen and Thorp 2013; Paraschiv et al. 2015). The vivid interest in this asset class might be attributed to the perceived opinion that commodities show low correlation with traditional assets, and thus, provide diversification benefits in a mixed-asset portfolio (Bhardwaj et al. 2015; Gorton and Rouwenhorst 2006; Paraschiv et al. 2015). The empirical analyses in Silvennoinen and Thorp (2013) and Daskalaki and Skiadopoulos (2011) show increased integration of commodity and financial markets, with higher correlation, especially in bearish times (Cheung and Miu 2010). In Tang and Xiong (2012), it is shown that the increasing presence of index investors has exposed commodity prices to market-wide shocks, such as shocks to the world equity index, the US dollar exchange rate, and shocks to other commodities, such as oil. Furthermore, Adams and Glück (2015) show that the risk spillovers to commodities observed during the financial crisis were persistent over time and the volatility in commodity markets increased during the past decade (Tang and Xiong 2012; Basak and Pavlova 2016). These changes in commodity characteristics are often referred to as the financialization of commodity markets (Cheng and Xiong 2014) and lead to the need for an approach to measure and manage the associated risks in related financial investments.

A common tool for risk management is stress testing. European Banking Authority (2017, p. 28) points out in their new guidelines under development that “Institutions should ensure that the scenario analysis is a core part of their stress testing programme”. Implementing stress testing is now mandatory for banks, due to regulations from Basel III formed in the post crisis environment (Basel Committee on Banking Supervision 2009).

An example in this sense is the study of Koliai (2016) which analyses existing risk models for stress testing purposes. The study presents a semi-parametric copula-GARCH risk model for equity indices, exchange rates and commodity prices, to perform stress testing on hypothetical portfolios, where the marginal distributions of returns are specified using EVT. Results show that different risk models produce significantly different results in terms of corresponding stress scenarios and impact on the portfolios.

The analysis in Paraschiv et al. (2015) is to our knowledge the only example of stress testing methodology applied to a portfolio of commodities which takes into account specific events that impacted this class of assets over time. It shows the importance of using forward-looking scenarios to enable the simulations of extreme quantiles, providing a better understanding of risk.

In this article, we apply stress testing techniques in line with the regulatory requirements from Basel III (Basel Committee on Banking Supervision 2017, 2018; Baudino et al. 2018) to a portfolio of commodity futures. The existing literature on stress testing of commodity portfolios is scarce, despite their popularity gained in practice. We update the analysis in Paraschiv et al. (2015), keeping the same procedure for constructing the stress portfolio as in the original study. However, we innovate in several directions. We extend the data set by including several new historical shocks, among which the oil price drop in 2014. Secondly, we enrich the spectrum of stress testing scenarios, focusing more on the forward-looking ones. The analysis in Paraschiv et al. (2015) is limited to show the effects of a reoccurring financial crisis on the portfolio profit and loss. Our study shows the importance of combining historical estimations of model parameters with more flexible forward-looking scenario construction. Furthermore, this is the first study in the literature where we disentangle the effect of individual model components on the portfolio profit and loss.

For our portfolio construction we mimic the dynamics of Dow Jones Commodity Index. The DJCI is a broad commodity index consisting of 24 commodities in three major sectors: energy, metals and agriculture & livestocks. The weights are based on the traded volume, ensuring a liquid index.

For the marginal distributions of commodity returns, we use an asymmetric AR-GARCH process and model the tails by Extreme Value Theory. We further describe the joint dynamics of portfolio components by employing a copula function. Finally, we simulate the portfolio profit and loss distributions under different scenarios in a stress testing framework.

Our results are twofold. First we find that the simulated profit and loss distribution of the portfolio is highly sensitive to the choice of the modelling approach for the marginal distribution of portfolio components. In particular, a correct identification of tail risk is of great importance for the stress testing purpose. The marginal role of correlations/dependence structure among portfolio components seems, however, to have a less obvious impact for the stress test results. Secondly, we find the construction of hybrid scenarios to be a relevant tool to combine both historical information and the flexibility of forward looking approaches in line with the requirements from Basel III (Basel Committee on Banking Supervision 2009).

The remainder of this article is structured as follows. Section 2 offers an overview of the most relevant literature for our study. In Sect. 3 we will provide an introduction of our data, focusing on the characteristics of the portfolio. In Sect. 4 we show the theoretical background of the different methodologies applied. Section 5 shows details of the implementation of the methodology for our data set and of the simulation procedure. Finally, in Sect. 6 we will explain and apply stress testing and display our analysis.

2 Review of literature on stress testing

2.1 Regulatory requirements for Stress testing

As defined in Lopez (2005), stress testing is a risk management tool used to evaluate the potential impact on portfolio values of unlikely, although plausible events or movements in a set of financial variables. The recent financial crisis led the attention of banks and authorities to the insufficient methods of risk management and the need for more accurate stress testing became obvious, since financial institutions were not prepared to deal with the crisis. One main concern was that the scenario selection and simulation were carried out by separate units for each business line and for particular risk types (Basel Committee on Banking Supervision 2009). This indicates that the stress testing was isolated and did not provide a complete picture on the firm level.

Seemingly the most recent development in the methodology for stress testing of portfolios is the use of Extreme Value Theory (EVT) and copulas as input to the analysis. EVT was first introduced in Embrechts et al. (1997) to better model the tail distribution of risk factors. Extreme Value Theory focuses on shaping the tails rather than the whole distribution of returns, providing more rigorous estimates of risk for financial portfolios. In McNeil et al. (2015) the authors suggested using a combination of GARCH and EVT. This methodology is popular in recent literature, with the largest proportion of new studies focusing on stock markets or single commodities (Ghorbel and Souilmi 2014; Liu 2011; Wang et al. 2010; Aepli 2011).

2.2 Types of stress tests

Stress tests can be conducted with several methodologies. One can firstly differentiate between univariate and multivariate stress tests. Univariate stress tests aim at identifying the isolated influence of stressing or shocking one single risk factor of a portfolio (Aepli 2011, p. 4). This makes the univariate stress tests simple to apply, but very limited, since they do not take dependencies between the returns of portfolio components into account. Multivariate stress tests overcome this drawback. In Basel Committee on Banking Supervision (2009) we find a classification of stress test methodologies for financial institutions. One can consider different scenarios when running stress tests, historical, hypothetical and hybrid. The need for hypothetical scenarios was highlighted after the crisis, since risk managers mostly performed historical stress testing under Basel II (Basel Committee on Banking Supervision 2006). The European Banking Authority (2017, p. 28) pointed out that “the design of stress test scenarios should not only be based on historical events, but should also consider hypothetical scenarios, also based on non-historical events”. Forward-looking scenarios are now required for European banks according to Basel Committee on Banking Supervision (2009). In Aepli (2011) the authors propose an extensive framework for complex stress testing for portfolios of futures that is in agreement with the regulations from Basel III formed in the post crisis environment.

2.2.1 Historical scenario

Historical scenarios are based on actual, realised data stemming from a historical episode of financial stress. This makes them realistic and easy to access. The profit and loss distribution in the historical scenario is simply given by the realised empirical distributions. In  Lopez (2005), it is pointed out that historical scenarios are developed more fully than other scenarios since they reflect an actual stressed market environment that can be studied in great detail, therefore requiring fewer judgements by risk managers.

One major drawback with historical scenarios is the assumption that passed financial crises will reoccur with the same consequences on portfolio losses. This makes them unable to capture risks linked to new products that may have significant impact on the outcome of a crisis. The worst observed loss in the past might not reflect the worst possible outcome in the future. This drawback was proven to be essential in the financial crisis of 2007 and resulted in the underestimation of the risk level and interaction between risks (Basel Committee on Banking Supervision 2009, p. 5).

Another drawback in historical scenarios is the sample size. Due to the limited number of observations, computing risk metrics in the higher confidence levels becomes problematic. This is a considerable drawback as the most extreme losses are of great interest in stress testing exercises.

2.2.2 Hypothetical scenario

Hypothetical scenarios are, unlike the historical scenarios, forward looking. Scenarios can be constructed in multiple ways, for example by shocking model parameters arbitrarily, based on own experiences of market movements. Hypothetical scenarios have the advantage of being more flexible and forward looking, making them more informative if conducted correctly. More focus on hypothetical stress testing scenarios allows the institution to be both well prepared for potential extreme unexpected outcomes, and lay the foundation to overcome these potential losses.

An extensive analysis has to be in place before constructing hypothetical scenarios, which can be both time consuming and difficult. In Basel Committee on Banking Supervision (2009, p. 5) it is pointed out that banks had implemented hypothetical scenarios prior to the financial crisis, but it was difficult for risk managers to obtain the support of the senior management, since the scenarios were extreme or innovative, and often were considered as implausible. Extremes that have not yet been experienced are often difficult to imagine and to be taken seriously.

In the financial regulatory frame from Basel Committee on Banking Supervision (2018) we note that during stress testing exercises consideration should be given to both historical and hypothetical events. This is to take into account new information and emerging risks in the foreseeable future. Furthermore, “when conducting stress tests it is important to be aware of the limitations of the scenarios (BIS 2018, p. 4). This also emphasizes the need to use several scenarios for a more correct result.”

2.2.3 Hybrid scenario

Hybrid scenarios combine the knowledge found in historical scenarios with the flexibility of hypothetical scenarios, making them a suitable alternative in stress testing. Hybrid scenarios are also easier to implement than more extensive forward-looking scenarios, as they are anchored in actual experienced market conditions. Hybrid scenarios are constructed by using historical data during times of financial distress to calibrate the process of risk factor evolution, but allow extrapolation beyond experienced events.

Even though hybrid scenarios allow the construction of new possible scenarios, they are still somewhat backward looking in the sense that they do not fully explore the risk of shifting market conditions or risk associated with new products. However, Lopez (2005) points out that risk managers always face a trade-off between scenario realism and comprehensibility; that is, more fully developed scenarios generate results that are more difficult to interpret. The benefits from implementing hybrid scenarios should not be neglected as they balance this trade-off.

3 Data selection and description

3.1 Choice of commodity indices

Commodity indices have become quite popular in the last decades and several commodity indices have been developed, among which S&P Goldman Sachs Commodity Index (S&P GSCI) and the Dow Jones Commodity Index (DJCI).

The S&P GSCI consists of 24 commodities and the weights are based on trading volume. It is therefore often seen as a benchmark for investment performance in commodities. The trading volume in energy commodities is higher than any other commodity sector so this index is heavily based in energy (60% of the total weight in 2017 S&P Dow Jones Indices 2018). To get a more balanced portfolio across different commodity sectors the DJCI will be the focus in this article. It consists of 24 commodities divided into three major sectors: metals, energy and livestock & agriculture. The weights are based on the total volume traded, but unlike the S&P GSCI, the DJCI has constraints on total weight allocated in each sector and commodity. By not allowing any of the three sectors to obtain more than 35% of the weight, and no single commodity to constitute less than 2% or more than 17% of the total index, the DJCI becomes well diversified. These restrictions also provide continuity and high liquidity for potential investors. The weights are rebalanced annually. See S&P Dow Jones Indices (2017) for a detailed methodology.

To select the risk factors (commodities) for our analysis we apply the method introduced in Paraschiv et al. (2015). We take the ten commodities with the largest weights for 2017 in the DJCI, and form our test portfolio. This is done to get a more time efficient portfolio and to make the analysis more practical. The ten commodities add up to 76% of the DJCI, providing a good proxy for the movements of the entire index. To form our test portfolio we scale up the weights, proportionally to 100%. The weights of the ten commodities can be seen in Table 1. Our selection leaves us with three portfolio components in energy, three in metals and four in agriculture & livestocks.

Table 1 Portfolio weights scaled up from the weights in DJCI 2017. Source: S&P Dow Jones Indices (2016)

3.2 Descriptive statistics

We extracted daily data from 1996 to 2017 from Thomson Reuters Eikon for continuous series of futures with approximately one year to maturity for the ten selected commodities. This leaves us with 5741 observations for each commodity. Details about the data extraction are found in Table 2.

As for our data, the rolling over was done by Eikon. Their methodology can be found in Thomson Reuters (2012). For monthly futures data roll over is done by jumping to the nearest future contract with a switch over following in the last trading day. They use the nearest contract month to form the first values of the continuous series, and when the contract expires, the next point of data is the next one year to maturity contract. They do not adjust for price differentials when adjusting the data, but we found this methodology to be sufficient for our analysis especially as our futures have one year to maturity.

Table 2 Data extraction details. Source: Thomson Reuters Eikon

Figure 1 shows the historical price movements of commodities measured in a relative index value. We observe a general co-movement especially in the 2000s, when the commodity markets experienced a uniform rise in prices until the financial crisis.

We observe several structural breaks across commodities especially during the financial crisis in 2007–2009 that heavily affected commodity markets. The European Central Bank Delle Chiaie et al. (2017) brings evidence supporting that global activity has clear implications for the commodity markets. Their analysis shows that since the year 2000 the price drivers of oil have fundamentally changed, and during the time of the financial crisis global activity strongly affected the oil price. The acute drop in oil price in 2014 was driven by several factors, among which the increased supply of unconventional oil and a significant shift in OPEC policy (Baffes et al. 2015). What differentiates the price drop in 2014 from previous collapses in oil price is, according to Baffes et al. (2015), that the fluctuation could not be explained by a weakened demand or expansion of supply in isolation, but rather a combination of the two.

Fig. 1
figure 1

Historical daily price movements from 1996 to 2017 for the ten commodities in relative value

While during the financial crisis all commodity sectors were affected, the price drop in 2014 to a lesser degree showed spillover to non-energy sectors. This indicates the decoupling of oil price from other commodities in agriculture and metals. According to Erdős (2012) the co-integration of oil and natural gas ended in 2009 after an increase in shale gas production. We observe that the commodities in non-energy sectors in more recent years do not necessarily follow the oil price as closely as in the past decade, potentially affecting the dynamics of the commodity markets.

Table 3 shows the daily descriptive statistics of our time series. McNeil, Frey and Embrechts (2015, p. 117) present six stylized facts of financial returns that can be observed, which also apply to our data: (1) Return series are not i.i.d.; (2) Series of absolute or squared returns show profound serial correlation; (3) Conditional expected returns are close to zero; (4) Volatility appears to vary over time; (5) Return series are leptokurtic or heavy-tailed; (6) Extreme returns appear in clusters.

We can also see graphically that the returns show autocorrelation, in line with the stylized fact (2) (see Fig. 2). From the probability plot we see that the returns follow the t distribution better than the Normal distribution, but the data deviates from the t distribution in the tails (see Fig. 3). This indicates heavy tailed returns. Additionally, ARCH-GARCH tests show evidence of conditional heteroscedasticity, in line with stylized fact (4). We performed the test for lags 5, 10, 15 and 20. We also tested for stationarity. Augmented Dickey-Fuller test, Phillips Perron test and KPSS test show that the returns of all commodities are stationary.

Table 3 Daily descriptive statistics of commodity returns for years 1996–2017
Fig. 2
figure 2

Sample autocorrelation plot of the returns and squared returns for WTI, as well as the daily logarithmic returns and a quantile–quantile plot. Corresponding graphs for all commodities are available upon request

Fig. 3
figure 3

Probability plot for WTI return versus standard normal and \(-\,t\) distribution. Corresponding graphs for the other commodities are available upon request

4 Methodology

4.1 Motivating the choice of modeling approach

As a result of the return characteristics for the ten commodities, as shown in Sect. 3, we model the conditional volatility with a GARCH process. A GARCH process can be extended in various ways, depending on the purpose. For commoditiy markets it has been shown that volatility tends to increase more after large negative returns than after large positive returns (Nyström and Skoglund 2002b). We therefore see it as an appropriate choice to extend to a GARCH-GJR model (Alexander 2008a), which includes a leverage parameter to capture this asymmetry.

Since the focus of our study is on stress testing, extreme returns are of special interest. We have shown the deviations from the normal and student t distribution for the returns, especially in the tails. Extreme Value Theory with the Peak over Threshold method has been employed in earlier studies (Aepli 2011; Paraschiv et al. 2015; Wang et al. 2010) showing a good modeling performance in shaping heavy tails. This is in line with the regulatory requirements for stress testing that point out the need to give a special attention to tail risk in asset returns.

Due to the common bust and boom cycles and co-integration of commodity markets, is important to model the dependence structure between returns in a realistic way. In Aepli et al. (2017) the authors show the importance of modelling time-variation and asymmetries in the dependence structure between the risk factors of a portfolio of commodity futures. In support of Basel III critics on over-reliance on historical correlations, the authors introduce multivariate dynamic copula models as a superior alternative. There exist a numerous amount of copulas to choose from, and the best choice is dependent on the aim of the analysis and the data. Analysis in Paraschiv et al. (2015), Aepli (2011) and McNeil et al. (2015) find the t copula to be superior over the Gaussian copula in the context of modelling multivariate financial return data. For our purpose, we therefore prefer a t copula over the more common Gaussian copula. The asymmetry of our data would probably be better modelled by an asymmetric copula, but as the t copula keeps the analysis tractable and allows a direct comparison across stress tests we find it suitable for this analysis. The subsequent subsections show the technical specification of the models employed.

4.2 GARCH

A simple autoregressive AR(p) process is a simple way to capture the autocorrelation between the individual commodity returns:

$$\begin{aligned} y_t=\mu +\displaystyle \sum _{i=1}^{p}\phi _iy_{t-i}+\epsilon _t \end{aligned}$$
(1)

where \(\epsilon _t\) is i.i.d. with mean zero and variance \(\sigma ^2\).

The residuals from AR(q) models can be decomposed such that:

$$\begin{aligned} \epsilon _t=z_t\sigma _t \end{aligned}$$
(2)

where \(z_t\) is i.i.d. with unit variance and \(\sigma _t\) is the conditional variance.

The generalized autocorrelation conditional heteroscedasticity model (GARCH) is then used to capture the volatility change and clustering of returns over time.

The symmetric normal GARCH assumes that the dynamic behaviour of the conditional variance is given by:

$$\begin{aligned} \sigma ^2_t = \omega + \alpha \epsilon ^2_{t-1}+\beta \sigma ^2_{t-1}, \quad \epsilon _t | I_{t-1}\sim N(0,\sigma ^2_t). \end{aligned}$$
(3)

The parameters of the GARCH model are estimated by maximising the value of the log likelihood function (see Alexander 2008a, p. 137).

Empirical evidence suggests that positive (negative) innovations to volatility correlate with negative (positive) innovations to returns (Nyström and Skoglund 2002a, p. 5). The rational behind is that negative impacts on returns have a tendency to increase volatility, which is accounted for as the “leverage” effect (McNeil et al. 2015), since a fall in equity value causes a rise in the debt-to-equity ratio (leverage) of a company, therewith making the stock more volatile. The leverage effect is captured by extending the classical specification of the GARCH model (Eq. (3)) by an extra parameter, the leverage parameter \(\lambda \). The GARCH-GJR can be written from the GARCH(1,1) model above, including the extra parameter (Alexander 2008a, p. 150):

$$\begin{aligned} \sigma ^2_t = \omega + \alpha \epsilon ^2_{t-1}+\lambda 1_{(\epsilon _{t-1}< 0)}\epsilon ^2_{t-1}+\beta \sigma ^2_{t-1} \end{aligned}$$
(4)

where the indicator function \(1_{(\epsilon _{t}< 0)}=1\) if \(\epsilon _{t}<0\). Thus, in addition to the specification in Eq. (3), an additional shock to volatility for negative residuals is added to volatility to account for the asymmetry. The sign of \(\lambda \) is naturally expected to be positive.

In Nyström and Skoglund (2002a, pp. 10–12) the authors discuss which distribution should be assumed for the standardized residuals \(z_t\) for financial data, and find that using normal distribution as approximation for the high quantiles might lead to a significant underestimation. Related literature suggests the t distribution as an alternative distributional assumption, which might be more accurate in capturing fat tails, but unable to capture asymmetry. In Nyström and Skoglund (2002a), they use Extreme Value Theory to account for both the fat tails and the skewness and asymmetry of financial data. In this paper, we will apply this combined method.

4.3 Extreme value theory

Extreme value theory (EVT) is the study of improbable, but extreme events. EVT is more commonly used in weather and insurance, but has over the past decade become more popular also in financial studies. In Embrechts et al. (1997) the authors introduced a full framework for the analysis, and they argue that EVT should be given more attention in risk management for financial institutions. In McNeil et al. (2015) it is proposed a combination of GARCH-EVT models where the GARCH standardized residuals are used as input to EVT, since EVT requires the residuals to be i.i.d.

The theoretical framework for Extreme Value Theory is extensively shown in Nyström and Skoglund (2002b) and Embrechts et al. (1997).

The generalised Pareto distribution (GPD) is introduced for any \(\xi \in {\mathbf {R}},\beta \in \mathbf {R_+}\):

$$\begin{aligned} GP_{\xi ,\beta } (x) = 1 - \left( 1 + \xi \frac{x}{\beta }\right) _{+}^{-\frac{1}{\xi }} ,x \in {\mathbf {R}} \end{aligned}$$
(5)

where \(1/\xi \) is the tail index and x represent exceedances of standardized residuals \(z_t\) over the threshold that delimitates the extreme tail.

There are two practical methods in the literature for locating the threshold beyond which we define the tail of extreme values. The first is the block maxima method. In this approach we define blocks in the data, and then extract the maxima (maximum loss) in each block. There are several drawbacks of this approach. The local maxima in a block might not capture the actual maximas in the time series, and the second and third maxima in a block might be of significance to the investor but will not be captured by the block maxima approach.

The second method, the Peak over threshold, focuses on the events that exceed a specified, high threshold. Here the observations over the threshold are asymptotically described by the generalised Pareto distribution. The Peak over threshold is the preferred method for practitioners as it makes better use of the data, and so we will use this approach. Determining the optimal threshold is challenging, and there are several methods which can be used. However, Nyström and Skoglund (2002b) argue that the threshold should be between 5 and 13% of the data.

\({\tilde{\xi }}\) and \({\tilde{\beta }}\) are parameter estimates of the generalised Pareto distribution. To estimate the parameters in the GPD we use maximum likelihood. This is the preferred method as it provides estimates of the parameters that are consistent and asymptotically normal as \(n \rightarrow \infty \) given that \(\xi > -1/2\). When using the maximum likelihood it is nearly invariant to the level of the threshold given that the threshold is within a reasonable limit.

Kernel smoothed interior

The data between the lower and upper tail thresholds are fitted by a Gaussian Kernel estimator. A kernel estimator is a function that derives a smooth curve from the observed data that is the best possible representation of the probability density.

4.4 Dependence structure

The GARCH-GJR-EVT process focuses on modelling the distribution of individual risk factors by modelling the conditional volatility, asymmetric adjustment and fat tails. However, this is done by modelling each risk factor in isolation, and it does not contain information about the dependence structure, which is a very important part of stress testing.

A copula allows for modelling the underlying joint distribution of two or more assets by only specifying the marginals. In Alexander (2008a), it is pointed out that one of the advantages of using a copula is that it isolates the dependence structure from the structure of the different marginal distributions. The theoretical background behind copulas was introduced in 1959 by Sklar. The use of copulas to measure dependence became more popular in the literature in the end of the 1990s, but only in the recent decade copula became a popular method employed in financial applications.

We will only focus on the theoretical background of the symmetric t copula. For a more detailed and complementary background of copulas we refer the reader to Alexander (2008a) or McNeil et al. (2015).

The multivariate t copula can be derived from the multivariate t distribution, and is defined as (Alexander 2008a, p. 268):

$$\begin{aligned} C_v (u_1,\ldots ,u_n;\varvec{\Sigma }) = {\varvec{t}}_v(t_v^{-1} (u_1),\ldots ,t_v^{-1} (u_n)), \end{aligned}$$
(6)

where \({\varvec{t}}_v\) and \(t_v\) are multivariate and univariate Student t distribution functions. v is the degrees of freedom, and \(\varvec{\Sigma }\) is the correlation matrix.

5 Estimation procedure

In this section, we give an overview of technical details concerning the calibration of modeling approaches specified in Sect. 4 and show estimation results.

5.1 Application of the GARCH-GJR

To find the appropriate lag structure for the GARCH(p, q) process we estimate models with q and p ranging from 1 to 6. To select the best model for the data we perform Akaike (AIC) and Bayesian (BIC) information criteria (Box et al. 2015, p. 193). These criteria are the preferred ones for selecting the best GARCH fit for the data because it penalises models for additional parameters estimated.

$$\begin{aligned} & AIC=-2(log{\hat{L}})+2NumParams\\ & BIC=-2(log{\hat{L}})+NumParams*log(n) \end{aligned}$$

The process which minimises these criteria is considered to be the best specification. Table 4 shows the results from the AIC and BIC criterion tests for lags from 1 to 2. We tested up to 6 lags, but the results show insignificant parameters, and higher AIC and BIC criterion than the displayed models. The table indicates that the GARCH(1, 1) is the optimal choice overall, and we will continue with this specification. Similar results are found in Paraschiv et al. (2015) and Aepli (2011).

Table 4 AIC & BIC criterion for t distributed residuals for GARCH process with various (p, q) lag structures
Table 5 Estimated GARCH-GJR(1, 1) parameters for the variance equation

Table 5 displays the estimated GARCH-GJR parameters. The parameters closely align with previous empirical results for financial assets. Following Alexander (2008a, p. 137), the \(\beta \) is a measurement of the persistence in conditional volatility regardless of what happens in the market. Large \(\beta \), above 0.9, indicates that high volatility following market stress will persist for a long time which is true for all of our commodities. \(\alpha \) measures the reaction of conditional volatility to shocks in the market. The sum of the two parameters is the rate of convergence, for our risk factors the sum is close to 1, indicating high persistence and a relatively flat term structure of volatility forecasts. From the table we see that the estimated ARCH and GARCH coefficients, \({\hat{\alpha }}\) and \({\hat{\beta }}\), are significantly different from zero for all commodities. We found a significant leverage effect in the returns of oil products, copper and live cattle. Empirical evidence shows that after 2008 holding crude oil as financial asset gave higher returns than holding it as commodity, given the reduction in convenience yields and a change from backwardation to contango (see Kolodziej et al. 2014). We did not find evidence for the expected leverage effect in agricultural commodities and metals (excluding copper) though. In this case, negative impacts on returns are associated to negative shocks to volatility.

Figure 4 displays the filtered residuals and the filtered conditional standard deviation of WTI, as given in Eq. (2). The other commodities show similar results, and are available upon request. We observe that the GARCH process models realistically the volatility clustering pattern in commodity returns.

Fig. 4
figure 4

Filtered residuals and filtered conditional standard deviation for WTI. Corresponding graphs for the other commodities are available upon request

Fig. 5
figure 5

Sample autocorrelation plot of standardised and squared standardised WTI residuals, showing that the residuals are now i.i.d. Corresponding figures for the other commodities are available upon request

To be able to apply EVT to the tails we need to standardise the filtered residuals from each return series. The standardised residuals are calculated by dividing the filtered residuals with the conditional variance \(z_t=\frac{\epsilon }{\sigma _t}\) to obtain mean zero and unit variance. The standardised residuals are plotted in Fig. 5. We can now see graphically that the residuals are i.i.d. for WTI. The other commodities show similar results and are available upon request. The residuals are now applicable to be modelled by EVT.

5.2 Application of EVT

5.2.1 Estimation of the semi-parametric cumulative distribution functions

To locate the threshold, we have fitted the Generalized Pareto distribution (GPD) to the standardized residuals testing for parameter stability for different threshold values between 5% and 15%. This allows us to find a threshold where the tail indexes stabilise. In Table 6 we display the 7%, 10%, 11% and 12% thresholds upper tail index. The rest of the parameters from the different thresholds are located in Table 13. Notice that the tail index naturally becomes smaller as the threshold allows for more data in the extreme tail. However, generally speaking, the value of “\(\xi \)” observed at the 10% threshold has a similar value at subsequent thresholds.

Table 6 Comparison of upper tail parameters (\(\xi \)) for different thresholds

We will therefore define the lower/upper tails in all commodity returns as starting at the 10% and 90% quantiles respectively. Previous studies (Aepli 2011; Paraschiv et al. 2015) have chosen a 10% threshold. We see the advantage of choosing a standard threshold as it gives us the opportunity to compare directly parameter estimates among commodity returns.

The next step is to fit the generalised Pareto distribution to the exceedances over threshold by using maximum likelihood. By optimising the log-likelihood function we estimate the tail indexes \(\xi \) and scale parameters \(\beta \).

The estimated parameters for our risk factors are listed in Table 7. We have \(\xi >0\) for nine of the ten risk factors. This case coincides with the Fréchet distribution, which is characterized by a lower bound and gives an indication of an extreme tails in commodity returns. Only copper shows a negative sign for “\(\xi \)” (Weibull distribution, having an upper bound). The lower tails show a positive sign of \(\xi \) for all commodities. This suggests both fat upper and lower tails, and furthermore tail asymmetry. The findings are consistent with the theory and empirical results for financial time series (Nyström and Skoglund 2002b; Embrechts et al. 1997).

Table 7 Maximum likelihood estimators for the generalized Pareto distribution parameters
Fig. 6
figure 6

Generalized Pareto upper tail of the standardised residuals fitted versus empirical. Corresponding figures for the other commodities are available upon request

In Fig. 6 we display the empirical cumulative distribution function of the upper tail of the standardised residuals for WTI. The fitted distribution follows the empirical exceedances closely, and so the chosen distribution is well suited to estimate the tails for the commodities.

The last step is to combine the parametric generalized Pareto tails for each commodity with the corresponding Kernel smoothed interior to obtain the entire semi-parametric cumulative distribution function. Figure 7 displays the semi-parametric empirical cumulative distribution function of WTI standardized residuals. The piecewise distribution object allows interpolation within the interior of the CDF, displayed in black, and extrapolation in each tail, displayed in red and blue for the lower and upper tail, respectively. The extrapolation allows for estimation of quantiles outside the historical record, and is therefore important for the stress testing exercise.

Fig. 7
figure 7

Semi-parametric empirical cumulative distribution function of WTI. Corresponding figures for the other commodities are available upon request

5.3 Application of the t copula and simulation steps

We fit a t copula to the standardized residuals of portfolio return series. Estimates of the degrees of freedom for the baseline scenario are given in Table 9. Given the parameters of the t copula (the correlation matrix \(\Sigma \) and the degrees of freedom parameter) we simulate jointly dependent portfolio returns. This is done by first simulating the corresponding dependent standardised residuals. We transform the dependent uniform variates obtained from fitting the t copula to standardised residuals through the inversion of the semi-parametric marginal cumulative distribution function of each risk factor. We therefore extrapolate into the generalized Pareto tails, and interpolate into the smoothed interior. This gives simulated standardized residuals consistent with those obtained from the GARCH-GJR(1, 1) filtering process described in Sect. 5.1. Residuals show no autocorrelation and are i.i.d. with unit variance.

5.4 Risk metrics

To compare the implications of various stress scenarios on the portfolio profit and loss profile, the use of Value at risk (VaR) and conditional value at risk (CVaR) is common. These risk metrics provide an indication of the quantile losses. VaR is the amount of maximum potential loss at a given percentage. This risk metric is criticized because it is not coherent and ignores extreme values beyond the value at risk. CVaR corrects for the limitations of VaR (Alexander 2008b). For our analysis we include both risk metrics at various quantiles. This is in line with European Banking Authority (2017, p. 28): “The institutions should stress the identified risk factors using different degrees of severity as an important step in their analysis to reveal nonlinearities, threshold effects, i.e. critical values of risk factors beyond which stress responses accelerate”.

5.5 Simulation steps

Based on the technical specifications given in Sect. 4, we simulate the profit and loss distribution of our portfolio of commodity returns at the end of the given time horizon as realistically as possible without and with the impact of stress. Generally speaking, in each type of stress scenario the technical simulation steps are outlined here:

  1. 1.

    To simulate jointly dependent equity returns with the parameters of the t copula one first of all has to generate the corresponding dependent standardized residuals. This is done by simulating dependent uniform variates based on the estimated degrees of freedom parameter and correlation matrix.

  2. 2.

    We transform them by inversion of the according share’s semi-parametric marginal cumulative distribution function (Pareto tails and Gaussian kernel smoothed interior). The result are standardized residuals consistent with the ones obtained from the filtration of \(z_t\) in the GARCH model, namely i.i.d. with variance 1.

  3. 3.

    These simulated standardized residuals are then employed as the i.i.d. noise processes of the GARCH model. We simulate the asymmetric GARCH model to reestablish the heteroscedasticity and the autocorrelation of the original commodity returns. We use as seed for the GARCH model the last observed values of the data set and according volatilities.

  4. 4.

    The weights of the portfolio are held constant over the simulation horizon. We calculate the maximum simulated profit and loss (P&L distribution), the VaR (value at risk) and the expected shortfall (ES)

6 Stress testing and simulation results

In this section, we will perform stress tests on our portfolio of commodity futures.

By definition, stress testing is a risk management tool used to evaluate the potential impact on portfolio profit and loss profile of unlikely, although plausible historical or hypothetical events or movements in the portfolio risk factors. We will shock at one time various components of the model and assess which ones have the highest effect on the simulated profit and loss. Shocks are linked to stress scenarios as explained in this section.

For comparison purposes, we simulate a baseline scenario by calibrating the model on the entire data set and compare the profit and loss profile with others derived from stress scenarios as derived below. For each scenario we run 20,000 simulations over a 22 days horizon, which represent the average number of working days per month. Note that the portfolio weights are held fixed over the risk horizon and that the simulation ignores any transaction costs required to re-balance the portfolio (the daily re-balancing process is assumed to be self-financing).

6.1 Stress test scenarios

We limit the study mostly to hybrid scenarios, where we calibrate the model for the risk factors or the copula to the restricted financial crisis data set versus the entire data set, and shock them simultaneously or one at a time. Besides hybrid, hypothetical scenarios could have been implemented. Examples could be a scenario with recession in China, which would decrease the demand for aluminium, oil, copper, soybeans and natural gas, and look at the change in dependence structure and volatility. Other scenarios could be natural disasters that affect crops or diseases that affect grains or livestocks. This is however out of the scope of this article. We refer the reader to Aepli (2011) for stress testing with hypothetical scenarios.

Our analysis consists of seven different scenarios. Underneath follows a brief description of the scenarios before the analysis is conducted.

Baseline scenario:

The baseline scenario is a default scenario simulation with the t copula and GARCH-GJR process calibrated on the entire data set, the historical time period from 1996 to 2017. None of the parameters are stressed in the baseline scenario. The baseline scenario is constructed to be a reference for normal times to assess the effect of stressing parameters compared to the steady state.

Historical scenario:

For the historical stress scenario we use the years 2007 and 2008 to observe the severity of losses in the financial crisis. This time period is known for high market stress with high return volatility and captures the simultaneous price drop during the financial crisis (see Sect. 3 for discussion). The historical scenario is a scenario that stems from the empirical distribution of returns during the financial crisis. We refer to the empirical profit and loss distribution as observed between 2007 and 2008. Unfortunately, due to the limited number of observations when resuming ourselves to observed returns, extreme loss quantiles are hard to estimate.

Hybrid scenarios:

Due to the limitations of the historical empirical scenario, we construct five hybrid scenarios. Hybrid scenarios allow extrapolation beyond realized returns, and are therefore appropriate to estimate extreme quantiles and events that have not yet occurred. The focus in our hybrid scenario construction is to examine which of the estimated parameters challenge mostly the portfolio profit and loss distribution in stress testing exercises. The parameters that changed between the different scenarios are the dependencies between risk factors, measured in Degrees of Freedom and correlations, and the GARCH-GJR coefficients. The parameters of the GARCH-GJR process and the t copula are here re-calibrated on the stress horizon, following the same procedure as in Sect. 4. The re-estimated tails from the generalized Pareto distribution parameters can be found in Table 14. To isolate the effect of various parameters, we compare scenarios by mixing parameters from the baseline with those during the period of financial distress. The hybrid scenarios are described in Table 8.

Table 8 Description of input parameters for simulation in hybrid scenarios

Risk factor stress scenario aims to show the impact of stressing the model parameters describing the marginal distributions of the risk factors on the portfolio profit and loss distributions, without a change in the dependence between the risk factors.

Dependence stress scenario isolates the effect of stressing the dependence between the returns of portfolio components on the profit and loss distribution, without changing the parameters for the individual factors model (GARCH-GJR model).

Full stress scenario aims to simulate the effects of a recurring financial crisis on the portfolio. All model parameters refer to the financial crisis period

In the degrees of freedom shock we shock only the degrees of freedom of the copula, leaving all other parameters unchanged.

Risk factor stress without EVT highlights how the application of Extreme Value Theory to model the tails of portfolio components returns affect the profit and loss distribution of the portfolio. The risk factor distributions are here not modelled with EVT, but with a Student t distribution (see Sect. 4.1).

6.2 Comparative analysis of simulated profit and loss distributions

6.2.1 Baseline scenario versus historical scenario

Figure 8 displays the simulated profit and loss (P&L) distribution for the returns in the baseline scenario versus the empirical distribution of P&L in the historical scenario. The simulated returns deviate in both the upper and lower tails. This can be further viewed in Table 9, where the maximum simulated loss is significantly larger for the historical returns than the simulated baseline, respectively. These results might be linked to the symmetry of the t copula. The baseline scenario represents normal market conditions while with the historical scenario we get an indication of its profit and loss profile assuming that a similar crisis will reoccur. This result highlights the importance of implementing forward-looking scenarios, both to simulate extreme returns in comparison to the baseline and to simulate beyond the profit and loss profile empirically observed.

The previous statement is further substantiated when we look at very high confidence levels displayed in Table 9. The historical scenario is limited to already experienced events so there are not enough observations in the data set to calculate the expected shortfall at very high confidence levels. This emphasizes the discussion about the scenarios in Sect. 2.1 and the drawback of using historical scenarios highlighted in Basel Committee on Banking Supervision (2009). In addition, the historical scenario neglects the dependence structure between the risk factors, which is highly relevant in stress testing. In European Banking Authority (2017, p. 24), it is stated that stress tests should take into account changes in correlations between risk types and risk factors and that correlations tend to increase during times of economic or financial distress. This statement and its implications for stress testing exercises will be further investigated in the next subsection where we analyse the hybrid scenarios.

Fig. 8
figure 8

Portfolio returns simulation, baseline versus historical scenario

Table 9 Simulation metrics for baseline scenario and historical scenario CDF
Table 10 Risk metrics for hybrid scenarios

6.2.2 Hybrid scenarios

Table 10 shows the risk metrics over the 5 sets of hybrid scenarios. The tail dependence for the simulated returns is measured in the degrees of freedom parameter from the t copula. From the entire data set the DoF is 15.28, while during the stressed period they shift to 13.78. Our decrease in DoF signals that the tail dependence in the commodity portfolio is increasing during times of stress. Lower degrees of freedom indicate a higher tendency of extreme events to occur jointly across risk factors (Paraschiv et al. 2015), which is in line with our simulation result.

Fig. 9
figure 9

Simulated one-month portfolio returns CDF for baseline versus hybrid scenarios: risk factor stress, dependency stress and full stress

Table 11 Correlation increases between baseline and stress scenario

6.2.3 Risk factor stress versus dependency stress

Figure 9 shows the baseline scenario, the scenario where we stress the dependencies between the risk factors, the full stress scenario and the scenario where the individual risk factors are stressed. Starting from the baseline we can see that by only stressing the dependencies, the simulation displays more severe losses (green vs. red). The correlation matrix and the decrease in DoF show that the dependencies between the risk factors increase in times of stress (see Table 11), which leads to larger simulated losses for the portfolio overall. However, by stressing only the GARCH-GJR-EVT parameters for the individual risk factors the effect on the portfolio P&L is even stronger (red vs. light blue). This result indicates that stressing the model parameters describing the marginal distributions of portfolio returns has a larger impact on the profit and loss distribution than stressing the dependencies between the risk factors. This shows that shocks in returns of portfolio components is of higher impact on the profit and loss than the shifts in their dependence structure and correlations.

We compare further the mentioned scenarios with the full stress scenario. Naturally this stress scenario simulates the largest tail losses since both the dependencies and the individual parameters are stressed (black line). Comparing the risk metrics in Table 10 we see that the risk factor stress scenario simulates the second largest losses, after the full stress scenario, which substantiates the previous result.

The full stress scenario in Paraschiv et al. (2015) gives more severe losses overall. Our study replicated the methodology for modeling the marginal distributions of portfolio components and dependence structure, allowing for a direct comparison of results. The difference between the results might be explained by: (i) The difference in weights of the test portfolio where our study uses weights from 2017 while Paraschiv et al. (2015) use the weights from 2013. (ii) Our extended data sample. We include the years 1996–1998, and 2011–2017 beyond the original data set. (iii) Differences might be due partially to the randomness in the scenario generation.

In Paraschiv et al. (2015) natural gas makes 15.11% of the portfolio, while for our portfolio it is 9.6%. From the descriptive statistics natural gas is by far the most volatile commodity and we observe that natural gas has performed poorly over the last decades compared to most of the other commodities. Several structural breaks in natural gas prices are also included in our data set, examples being the supply shortfall in Libya 2011 and the Russian export stop in 2012 (Nick and Thoenes 2014). Weighting more the natural gas the portfolio might therefore be one of the main reasons of the more severe simulated loss in the original study.

Soybean is the second commodity with the most deviating weight from Paraschiv et al. (2015). In our portfolio soybean makes 15.66% of the total weight, in comparison to 6.89% in Paraschiv et al. (2015). Over our time period soybean returns showed low volatility. We expect that the increased allocation in soybean in our portfolio provides the same consequences as the down-scaling of natural gas.

6.2.4 DoF Shock Versus Dependency Stress

Fig. 10
figure 10

Simulated one-month portfolio returns CDF for baseline versus hybrid scenarios: dependency stress and DoF shock

In Fig. 10 we compare the scenario where both correlation and DoF are stressed (green line), with the scenario where only the DoF are shocked from 15.28 for the baseline to 13.78 for the financial crisis (purple line). For both scenarios the parameters of the GARCH-GJR-EVT model are calibrated on the entire data set. By doing so we can discuss the impact of correlations as a driver of losses in isolation.

From Fig. 10 we observe that a small shock to degrees of freedom does not provide a significant stress scenario. The baseline scenario and the DoF shock scenario do not deviate much from each other (red vs. purple), although the DoF shock scenario simulates larger extreme losses in the lower quantiles (see Table 10). Furthermore, we see that the scenario where both the correlations between the risk factors as well as the DoF are shocked displays the largest simulated loss. This indicates that shocking the DoF in isolation is a limitation in a stress testing exercise.

For more forward-looking hypothetical scenarios the implementation of more severe shocks to DoF might be of interest. We therefore tested by including a set of hypothetical scenarios where several more substantial downward changes to DoF are included. The results can be found in Table 12. We used the degrees of freedom from the financial crisis period and also the one corresponding to the baseline scenario and then we added more extreme values at both ends. For our data we found that extreme shocks to DoF yield no substantial increase in simulated tail losses.

Table 12 Risk metrics from different DoF shock scenarios

6.2.5 Impact of EVT

In Fig. 11 we display two hybrid scenarios to highlight the importance of implementing EVT for modelling extremely large return changes of portfolio components before running the actual stress test. For both scenarios the correlation matrix and the DoF parameter are calibrated on the entire data set, so the difference between them comes from how the individual risk factors are modelled. In the risk factor stress scenario the tail distributions are modelled with EVT where the tail indexes are calibrated on the financial crisis data, and the other scenario with a Student t distribution. One can see that the scenario where EVT is implemented estimates more severe losses, where simulated 99.99% CVaR is − 35.79% in comparison to − 33.72% for the scenario without EVT. Overall, the profit and loss distribution in the stress test excluding EVT is shifted to the right. Applying EVT strengthens the accuracy and understanding of the most extreme, potential losses. In light of this, we conclude that the risk is potentially underestimated when the individual risk factor distributions disregard extreme events (Embrechts et al. 1997).

Fig. 11
figure 11

Simulated one-month portfolio returns CDF for baseline and hybrid scenarios: risk factor stress and scenario without modelling with EVT

7 Conclusion

In this study, we update the analysis in Paraschiv et al. (2015) with a more extensive data set, and a more detailed focus on stress testing. In particular, hybrid and hypothetical scenarios are explored, in line with the regulatory requirements for stress testing calling for forward looking scenarios. Our stress testing exercise are based on rearranging arbitrarily shocks linked to specific extreme events or time to reveal the importance of correlations, tail correlations, or extreme movements in portfolio components on the profit and loss distribution. This is the first study in the literature that clearly illustrates the marginal impact of the model assumed for the individual portfolio components versus the marginal role of tail dependence and correlations on the portfolio risk profile.

We mimic the DJCI, by forming a portfolio of ten commodities. We use a GARCH-GJR approach to model stylized facts observed in commodity return data, and implement Extreme Value Theory to model the tails accurately. To account for the dependence structure we apply a t copula. We then stress test the portfolio with different scenarios, examining the drivers of the profit and loss distribution.

Our study revealed three main results. First, we bring empirical evidence showing the importance of hybrid (forward-looking) scenarios for comprehensive stress testing. In addition, we show the value added of forward looking over historical scenarios and show numerically the drawbacks of the latter. We confirm the stress testing requirements from Basel III accordingly to which different stress testing approaches cannot be used in isolation, but combined, for a comprehensive picture. Our second finding is that, before implementing a stress test, a special attention should be given to an accurate model identification for the evolution of returns of portfolio components and dependence structures. In addition, our third finding enhanced the previous findings in Paraschiv et al. (2015) by disentangling the effects of stressing at one time the model parameters for the individual portfolio components versus their correlations and tail dependence. We found clear evidence that the first accounts more than the latter while stress testing the portfolio profit and loss profile. At the same time, our analysis represents an integration of the “model risk” concept into stress testing exercises, highly relevant for portfolio managers. Special attention should be given to extreme tails, in line with the regulatory frame on stress testing.

Our analysis is bounded by the number of stress scenarios and simulations based on the random number generator. Stress scenarios display tendencies, and the numbers generated cannot be transferred directly to risk management. On the other hand, the simulations can form expectations and contribute to an overall understanding of stress testing for capital requirements. For further analysis it would be interesting to update our analysis with asymmetric copulas to better capture the dependence structure. In addition, a more extensive use of hypothetical shocks to a commodity portfolio would be of interest.