1 Introduction

In this paper, we are concerned with volatility forecasting in the Chinese commodity futures market. Volatility modeling and forecasting is a much devoted area of research as volatility is considered the "barometer for the vulnerability of financial markets and the economy" (Poon and Granger 2003, p. 479) and central to asset pricing, derivative valuation, portfolio allocation, and risk management. We are interested in this particular market in part because it has become an important part of the global futures markets with tremendous trading volume.Footnote 1 \(^{,}\) Footnote 2 More importantly, this market is regulated by two unique institutional rules that makes it interesting to explore.

The first regulation is the time-dependent margin rate, whereby the margin as a fraction of the contract value increases as contracts move closer to delivery. Take sugar as an example. The margin rate for deposit two months prior to delivery is 6 % of the contract value for an investor. In the month before delivery, it increases to 8 % in the first 10 days, 15 % between the 11th to the 20th day of the month, 25 % in the final 10 days of the month, culminating to 30 % in the delivery month.Footnote 3 The second regulation is that, although they represent 97 % of all investors in the futures markets, individual investors are not allowed to trade nearby contracts.Footnote 4 Both regulations effectively push market participation and trading volume to more distant contracts with implications for market liquidity.

Our contribution to the literature is that we take into account unique institutional regulations of this market and design empirical volatility forecasting exercises that are appropriate for the characteristics of the market and the data it generates. Our data on aluminum, copper, and fuel oil consistently show that contracts with three months to delivery enjoy the best liquidity. We are not the first to note this pattern (see Liu et al. 2014; Peck 2008), but we are the first to offer solid and detailed evidence. Using 5-min returns data over long sample periods, we compute three popular liquidity measures that capture different aspects of liquidity, namely the effective spread of Roll (1984), the proportion of zero returns of Lesmond et al. (1999), and the Amihud (2002) illiquidity measure (Goyenko et al. 2009). Our results show that contracts with three months to delivery are the most liquid as they exhibit the lowest effective spread, the lowest percentage of zero returns, and the smallest value for the Amihud (2002) illiquidity measure. This is different from the majority of futures markets and contracts for which the nearby contracts are usually the most liquid (see Baillie et al. 2007; Lee 2009; and the references therein). Crucially, this liquidity pattern results from the unique institutional environment in which trading takes place.

On the other hand, being an emerging market, the Chinese commodity futures market exhibits large proportion of zero returns (Bekaert et al. 2007) and this is particularly evident in our 5-min return series. Even for the most liquid 3-month to maturity contracts, the fraction of zero returns is as high as 36.27, 23.90, and 31.50 % on average, respectively, for aluminum, copper, and fuel oil. In the existing literature, intraday data are widely adopted for volatility modeling and forecasting as they are shown to contain more information and provide more accurate and efficient forecasts (see Fuertes et al. 2015; Hseu et al. 2007; Shi and Lee 2008; and the references therein). However, the large proportion of zero returns in our data suggests that higher data sampling frequency does not necessarily translate into better forecasting performance due to information loss or noise in the data (Bandi and Russell 2006; Phillips and Yu 2009). Hence we choose to perform volatility forecasting by aggregating 5-min data into 15-, 30-, and 60-min intraday returns and compute daily returns from daily prices so that we can observe and compare how good different models are at capturing the volatility dynamics given the data.

Equally important for the volatility forecast comparison is the choice of the true volatility proxy. While true volatility is a latent variable that cannot be observed in the market, an efficient and accurate representation of it is of great importance for the evaluation of volatility forecasts [see Andersen et al. (2010) for an excellent survey]. In this paper, we undertake three different proxies for the true daily volatility. In addition to the widely adopted realized volatility measure of Andersen and Bollerslev (1998), we also consider the median-based measure of Andersen et al. (2012) and the range-based proxy advocated by Parkinson (1980), both of which are shown to be robust to zero returns, potential jumps in the underlying price dynamics, and other microstructure related effects.

In terms of volatility models, we begin with the conventional generalized autoregressive conditional heteroskedastic (GARCH) model of Bollerslev (1986, 1990). Our choice of models is also motivated by Baillie et al. (2007), which document strong long memory properties in commodity futures and argue that the fractionally integrated GARCH (FIGARCH) model captures this feature very well. At the same time, a natural alternative that works well at capturing the long memory property in realized volatility is the autoregressive fractionally integrated moving average (ARFIMA) model of Granger (1980) and Granger and Joyeux (1980). The two models differ in the manner in which information is extracted from intraday data: intraday returns are first aggregated to obtain daily realized volatility before the ARFIMA model is adopted to describe and forecast realized volatility at the daily level; whereas for the FIGARCH model, deseasonalized intraday data are directly fed into the model. So it is empirically interesting to compare the performance of the two models using our data.

Our empirical analysis reveals a host of interesting findings. First, in terms of the out-of-sample forecasting performance, the Diebold and Mariano (1995) and West (1996) test applied on a pairwise basis and the superior predicative ability test of Hansen (2005), which tests across alternative models simultaneously, suggest that the ARFIMA model consistently outperforms the GARCH-type models in the out-of-sample tests. It is the best performing model in 11 out of 15 commodity/volatility proxy combinations, and for the remaining four combinations the difference between the forecasting performance of the ARFIMA model and that of the best performing model is statistically insignificant at any conventional level. In other words, the ARFIMA model consistently produces the best forecasts or forecasts not inferior to the best in statistical terms.

It highlights the importance of incorporating the long memory dimension in volatility modeling in line with the literature. This finding also contributes to the discussion in the literature of whether the FIGARCH or the ARFIMA model is empirically better at capturing the long memory feature in the volatility dynamics (Chortareas et al. 2011). Given that the intraday Chinese commodity futures data contain large proportion of zero returns which are directly fed in the FIGARCH model, it is not surprising that the ARFIMA model performs better.

Second, we show that within the GARCH family of models, the forecasting performance using the daily data is consistently as good as, if not better than, those using the intraday data. This finding suggests that the GARCH-type models may not be very efficient in utilizing the information contained in the intraday data of this particular market for volatility forecasting purpose due to high percentage of zero returns.

Finally, it is interesting to note that although sugar contracts with January maturity and November maturity differ massively in terms of trading volume and show different levels of liquidity, the underlying volatility dynamics is nevertheless captured by the same model at the same data sampling frequency. For example, when the median- and range-based proxies are adopted, both futures contracts are best forecasted by the AFRIMA model using daily realized volatility obtained from the 60-min returns. This further suggests that the ARFIMA model is a reliable and robust tool for forecasting volatility regardless of the underlying liquidity level with practical implications for traders and risk managers.

The rest of the paper is structured as follows. In Sect. 2, we briefly outline the alternative volatility models, the proxies for the true volatility dynamics, and the statistical metrics for the out-of-sample volatility forecasts evaluation. Section 3 describes the data and the model estimates. In Sect. 4, we discuss and analyze main empirical findings. Finally, Sect. 5 concludes. Details of the three liquidity measures are provided in the "Appendix".

2 Models and statistical evaluation

2.1 Volatility models

In this paper, we consider four popular volatility models at four different data sampling frequencies for volatility modeling and out-of-sample forecasting. In particular, we make use of the: (1) intraday GARCH, integrated GARCH (IGARCH), and FIGARCH models at the 15-, 30-, and 60-min intervals; (2) daily GARCH, IGARCH, and FIGARCH models; and (3) ARFIMA model applied to the daily realized volatility computed from the 15-, 30-, and 60-min intervals. The model specifications are briefly outlined below.

2.1.1 GARCH model

The GARCH model is the workhorse in the volatility estimation and forecasting literature (see Bollerslev 1986, 1990; among others). We use an ARMA(1,1) process in the conditional mean equation of the GARCH-type models. To allow for possible fat tails, we model the innovations in the GARCH process as independently and identically distributed Student’s t-distribution while implementing the ARMA(1,1)-GARCH(1,1) model using both intraday and daily data. The model specification is given by

$$\begin{aligned} {\tilde{r}}_{t,n} &= \mu +\gamma {\tilde{r}}_{t,n-1}+\varepsilon _{t,n}+\theta \varepsilon _{t,n-1},\quad \varepsilon _{t,n}\vert \Omega _{t,n-1}\sim D_{v}(0,h_{t,n})\nonumber \\ h_{t,n}& = \omega +\alpha \varepsilon ^{2}_{t,n-1}+\beta h_{t,n-1}, \end{aligned}$$
(1)

where \({\tilde{r}}_{t,n}\) is the deseasonalized logarithmic return on day t for the nth time interval [see Eqs. (10)–(12)], \(\mu \), \(\gamma \), and \(\theta \) are the parameters of the conditional mean equation, and \(\omega \), \(\alpha \), and \(\beta \) are the parameters of the conditional variance equation.Footnote 5 The error term \(\varepsilon _{t,n}\), which is conditional on the information set \(\Omega _{t,n-1}\), follows a Student’s t-distribution (denoted by \(D_v\)) with zero mean, variance \(h_{t,n}\), and v degrees of freedom. The GARCH model requires that \(\alpha +\beta <1\) for the volatility process to be stationary. For the IGARCH model, however, the corresponding requirement is \(\alpha +\beta =1\).

2.1.2 FIGARCH model

The FIGARCH model extends the conditional variance equation of the standard GARCH model by adding fractional differences in order to allow for long memory property of the GARCH volatility process (Baillie et al. 1996; Baillie and Morana 2009). Following Baillie et al. (2000), we implement an ARMA(1,1)-FIGARCH(1,d,1) model given by

$$\begin{aligned} {\tilde{r}}_{t,n}&= \mu +\gamma {\tilde{r}}_{t,n-1}+\varepsilon _{t,n}+\theta \varepsilon _{t,n-1},\quad \varepsilon _{t,n}\vert \Omega _{t,n-1}\sim D_{v}(0,h_{t,n})\nonumber \\ h_{t,n}&= \omega +\beta h_{t,n-1}+[1-\beta L_1-(1-\varphi L_1)(1-L_1)^{d}]\varepsilon ^{2}_{t,n}, \end{aligned}$$
(2)

where \(\omega \), \(\beta \), and \(\varphi \) are the parameters of the conditional variance equation, d is the order of fractional integration, \(L_1\) is the lag operator on n, and \(D_v\) is the Student’s t-distribution defined above.

2.1.3 ARFIMA model

Granger (1980) and Granger and Joyeux (1980) introduce a flexible class of long memory processes based on realized volatilities not belonging to the ARCH family. It has been widely adopted in the literature when long memory properties are assumed in the data (see Martin and Wilkins 1999; Pong et al. 2003; and the references therein). The ARFIMA (pdq) model for a process \(y_t\) is defined as

$$\phi (L_2)(1-L_2)^{d}(y_{t}-\mu )=\theta (L_2)\varepsilon _{t},$$
(3)

where d is the order of fractional integration and \(L_2\) is the lag operator on t. The AR and MA polynomial components are given as \(\phi (L_2)=1+\phi _1 L_2+\cdots +\phi _p L_2^p\) and \(\theta (L_2)=1+\theta _1 L_2+\cdots +\theta _q L_2^q\), respectively, and \(\mu \) is the mean of \(y_t\). In the empirical estimation of the ARFIMA (pdq) model, we follow Andersen et al. (2003) and replace \(y_{t}\) by the log of the daily realized volatility [denoted as \(\log ({\hat{\sigma}}_{t})\)] obtained from the 15-, 30-, and 60-min returns.

2.2 True volatility proxies

2.2.1 5-min realized volatility

The most popular proxy for the unobservable true volatility is the realized volatility measure proposed by Andersen and Bollerslev (1998). This is obtained by aggregating the intraday squared returns. We follow this approach and use a realized volatility series constructed from 5-min log price series, which is the highest frequency in our data. The proxy is given by

$${\hat{\sigma }}_{rv,t}^2=\sum _{n=1}^{N}r_{t,n}^2,$$
(4)

where \({\hat{\sigma}}_{rv,t}^2\) is the realized variance for day t and \(r_{t,n}^2\) is the squared 5-min (log) return on day t for interval n (\(n=1,2,\ldots ,N)\).

2.2.2 Median-based volatility

The second proxy we exploit for true volatility is the median-based volatility measure introduced by Andersen et al. (2012). The measure is robust to jumps in the underlying return dynamics and to small ("zero") returns. The median-based true volatility proxy is defined as

$${\hat{\sigma}}_{med,t}^2=\frac{\pi }{6-4\sqrt{3}+\pi } \left( \frac{N}{N-2}\right) \times \sum _{n=2}^{N-1} \text{ med }(|\Delta r_{n-1}|,|\Delta r_n|,|\Delta r_{n+1}|)^2,$$
(5)

where \({\hat{\sigma}}_{med,t}^2\) is the median-based variance for day t and \(|\Delta r_n|\) is the absolute return over the nth interval on day t.

2.2.3 Range-based volatility

The third proxy for true volatility is the range-based measure proposed by Parkinson (1980). It has been further refined and adopted in Garman and Klass (1980), Yang and Zhang (2000), and Li and Hong (2011). Taking into account of daily high and low prices, this measure is able to deal with microstructure biases in the market. The proxy is defined as follows:

$$\hat{\sigma }_{rng,t}^2=\left( \frac{1}{4\ln 2}(\ln H_t-\ln L_t)\right) ^2,$$
(6)

where \({\hat{\sigma}}_{rng,t}^2\) is the range-based variance for day t , and \(H_t\) and \(L_t\) are the daily high and low prices, respectively.

2.3 Forecasting accuracy

We use three different metrics to evaluate the out-of-sample forecasting accuracy of the volatility models, all of which are commonly adopted statistical measures in the literature (see, for example, Ahmed et al. 2016).

2.3.1 Root mean squared forecast error

The root mean squared forecast error (RMSFE) compares the true volatility with the forecasted volatility from a given model and is computed as

$$\text{ RMSFE }=\sqrt{\frac{1}{R}\sum _{t'=1}^{R}({\hat{h}}_{t+1}-{\hat{\sigma}}_{t+1}^2)^{2}},$$
(7)

where R is the number of daily observations, \({\hat{h}}_{t+1}\) is the variance forecast, and \({\hat{\sigma}}^2_{t+1}\) is the chosen proxy for true variance in the out-of-sample period.

2.3.2 Diebold and Mariano (1995) and West (1996) test

The second out-of-sample statistical metric of accuracy is the Diebold and Mariano (1995) and West (1996) MSFE t-statistic, which in our case tests whether a competing volatility model outperforms the benchmark volatility model by generating more accurate variance forecasts. We chose the benchmark model based on the lowest RMSFE. The test statistic is as follows:

$$\text{MSFE}{\text{-}}t=\frac{1}{\sqrt{R{\hat{\Omega}}}}\sum _{t=1}^{R}\Delta Loss_{t+1},$$
(8)

where \(\Delta Loss_{t+1}\) is the difference between the squared forecast error loss functions of the benchmark and competing volatility models and \({\hat{\Omega}}\) is the consistent estimate of the asymptotic variance of \(R^{-0.5}\sum _{t=1}^{R}\Delta Loss_{t+1}\). The null hypothesis can be expressed as

$${\text{H}}_{0}:E[\Delta Loss_{t+1}]=0.$$
(9)

Since the volatility models are non-nested, the alternative hypothesis in this case is two-sided. The test statistic in Eq. (12) follows an asymptotic standard normal distribution under the null hypothesis of equal predictive ability. We regress \(\Delta Loss_{t'+1}\) on a constant and obtain the \(\text{MSFE}{\text{-}}t\) statistic for a zero coefficient based on the Andrews and Monahan (1992) estimator. A positive (negative) and statistically significant \(\text{ MSFE}{\text{-}}t\) statistic suggests that the competing model outperforms (is outperformed by) the benchmark volatility model.

2.3.3 Superior predictive ability test

To address the multiple-testing problem in the light of data mining, we conduct the superior predictive ability (henceforth SPA) test of Hansen (2005). Under the composite null hypothesis, there is no predictive ability across all competing volatility models. In other words, the null states that the benchmark model is not inferior to any of the alternative models. A rejection of the null hypothesis indicates that at least one competing model produces forecasts more accurate than the benchmark. Once again, we chose the benchmark model based on the lowest RMSFE and evaluate the out-of-sample forecasts based on the MSFE. For inference, we report stationary bootstrap p values obtained using 10,000 replications.

Table 1 Sample periods and trading volumes for commodity futures contracts

3 Data and estimation

The data come from the GTA Information Technology Company. We obtain contract ID, trading date, trading time, trading venue, contract expiry date, last recorded (Renminbi) price, high and low prices, and volume for 5-min time series on four commodity futures contracts: aluminum, copper, fuel oil, and sugar. The full sample period as well as the in-sample and out-of-sample periods for each commodity are provided in Table 1.Footnote 6 \(^{,}\) Footnote 7 In Panel D, we find seasonality in trading volume for each contract over the full sample period. More precisely, we observe that in terms of average number of contracts traded for each delivery, there is not much variation across the 12 delivery months for aluminum and copper, and there is a slight variation for fuel oil. In other words, the number of contracts traded is relatively stable all-year round. However, with only six delivery months per year, sugar shows a notable variation in the average number of contracts traded across the delivery months. In particular, contracts for January, May, and September exhibit huge trading volumes, while contracts for March, July, and November show the opposite. The trading volume for January delivery is the highest on average with more than 5.6 million contracts, whereas for November delivery the average trading volume is the lowest at 18,418 contracts, about 0.32 % of that for January delivery. This striking yet interesting variation naturally raises the question of how much the volatility dynamics for these two delivery months are different, if they are different at all. Hence, in the empirical exercises, we examine two futures contract series for sugar, one for the very liquid January delivery and the other for the very illiquid November delivery.

Table 2 Liquidity measures of commodity futures with different time to delivery

In Table 2, we report descriptive statistics of three measures adopted to describe liquidity of futures contracts at 5-min interval, which is the highest sampling frequency in our data.Footnote 8 For aluminum, the Roll spread measure for nearby contracts averages at 0.0006, zero returns account for 61 % of all 5-min returns on average in a trading day, and the scaled Amihud measure is 0.23. Comparing these figures to those for the 3 months to delivery contracts, we notice a marked improvement. In particular, the Roll spread drops to 0.0004, the percentage of zero returns decreases to 36 %, and the scaled Amihud illiquidity measure drops to 0.03. The liquidity of the futures contract series subsequently worsens with longer time to delivery. For example, aluminum contracts with 3 months to delivery are the most liquid and this liquidity decreases for contracts with longer or shorter time to maturity. The pattern is mirrored in the liquidity estimators for other commodities as well. Hence, in our volatility estimation and forecasting exercises for aluminum, copper, and fuel oil, we use futures contracts with 3 months to delivery, as they are the most liquid among all maturities, and volatility forecasts are least expected to be biased by the large proportion of zero returns. While constructing the time series on returns with 3 months to maturity for aluminum, copper, and fuel oil, we choose prices of the third month prior to delivery month until the contract reaches the first day of 2 months prior to delivery month. We then switch to next contract, which is to be matured in 3 months to make continuous time series. Hence, for these three commodities, the contract time to maturity is always around 3 months. For sugar futures, however, we are mostly interested in the effect that seasonality in trading volume has on volatility forecasting. Therefore, we take contracts from January to December for next January delivery and from November to October for next November delivery. This results in the contract time to maturity to change over time. The practice of switching contracts to the next delivery month is common in the literature (see, for example, Baillie et al. 2007).

In our sample, all commodity futures are traded for 4 h on a trading day starting at 9:00 a.m. and closing at 3:00 p.m. with a 2-h break between 11:30 a.m. and 1:30 p.m. As a result, there are 48 5-min returns on any business day. The (log) return \(r_{t,n}\) on a trading day t for the nth interval is computed as

$$r_{t,n}=\ln P_{t,n}-\ln P_{t,n-1},$$
(10)

where \(P_{t,n}\) denote the commodity futures price on day t and the end of the nth interval. The 15-, 30-, 60-min and daily returns are obtained by taking the logarithmic difference between prices that are 15, 30, and 60 min apart. The daily returns are computed as \(r_{t}=\ln P_{t}-\ln P_{t-1}\).

Table 3 Descriptive statistics of commodity futures returns

In Table 3, we provide descriptive statistics of commodity futures contract returns at 5-, 15-, 30-, 60-min and daily intervals. We notice that the average returns are very close to zero irrespective of contracts and data frequencies. Returns are left skewed with fat tails, although the degree of negative skewness and excess kurtosis tend to drop with decreasing sampling frequency. In addition, the percentage of zero returns drops considerably from the 5-min to daily intervals. For example, it is 31.50 % at the 5-min interval, 17 % at the 15-min interval, while only 3.60 % at the daily level for Fuel oil. The trade-off between the improvement in data quality and the loss of information at lower frequencies could be crucial for the outcome of volatility measurement and forecasting exercises. In Fig. 1, we plot the time series of 30-min returns for aluminium, copper, fuel oil, and sugar with January delivery as an example of the data we employ in this paper.

Fig. 1
figure 1

The time series of returns to the Chinese commodity futures contracts. This figure plots the 30-min returns series for aluminium (top left), copper (top right), fuel oil (bottom left), and sugar with January expiry (bottom right) for the full sample

The volatility of intraday returns are known to display periodicity within a trading day, which could contaminate the estimation of conventional volatility models (Andersen and Bollerslev 1997). Following Taylor and Xu (1997), we estimate a simple seasonality term \(S_{t,n}\) by averaging the squared returns for each intraday period as follows:

$${\hat{S}}^2_{t,n}=\frac{1}{T}\sum _{t=1}^T r_{t,n}^2,$$
(11)

where T is the number of trading days in the full sample period. The deseasonalized intraday returns are obtained as

$${\tilde{r}}_{t,n}=\frac{r_{t,n}}{\hat{S}_{t,n}}.$$
(12)

We then make use of the deseasonlized returns to estimate the intraday GARCH family of models. In the out-of-sample forecasting, the intraday forecasts are based on the deseasonlized filtered returns and therefore transformed back to those from the original returns. This is implemented as follows:

$${\hat{h}}_{t,n}={\hat{S}}_{t,n}^2 \times {\tilde{h}}_{t,n},$$
(13)

where \({\tilde{h}}_{t,n}\) is the intraday variance forecast using the deseasonalized returns and \({\hat{h}}_{t,n}\) is the transformed variance forecast for the original returns. We produce one-step ahead daily volatility forecasts for daily models. But for intraday models, we produce 16-, 8-, and 4-step ahead forecasts for 15-, 30-, and 60-min intervals and aggregate them to transform into daily forecasts. For the ARFIMA model, it is fitted directly to daily realized volatility aggregated from intraday returns. The out-of-sample forecasts are evaluated against the daily true volatility proxies described earlier. For all sampling frequencies, we use a rolling window forecasting scheme to obtain forecasts from all volatility models.

4 Empirical analysis

4.1 In-sample results

We report the in-sample parameter estimates of the intraday GARCH, FIGARCH, and IGARCH models for five futures contracts at 15-, 30-, and 60-min intervals in Table 4. For the ARMA(1,1)-GARCH(1,1) model specification in Panel A, most of the AR parameter estimates \({\hat{\gamma}}\) are statistically significant at conventional levels. Also, the MA parameter estimate \({\hat{\theta}}\) is significantly negative in most cases, capturing the first order negative autocorrelation in the returns. All the parameters in the conditional variance equations are highly significant at the 1 % level except \({\hat{\alpha}}\) for 15-min copper contracts. The fact that \({\hat{\alpha}}+{\hat{\beta}}<1\) reveals that the GARCH process is stationary, and, since \({\hat{\alpha}}+{\hat{\beta}}\) is close to 1, the volatility process is persistent. For the contract series with return innovations following a Student’s t-distribution, the degrees of freedom parameter is between 2 and 4 and statistically significant at the 1 % level. This indicates a fat tail in the return distributions.

In Panel B, when the volatility process is described by an ARMA(1,1)-FIGARCH(1,d,1) model, we notice that the parameter d, the order of fractional integration, is significantly different from zero at the 1 % level for all futures contract series. This implies that the volatility process exhibits a long memory property and attests to the importance of adding this feature in the volatility dynamics of the commodity futures contract returns under scrutiny. It is also worth noting that, similar to the results in Panel A, the degrees of freedom parameter v is highly significant. Panel C shows the parameter estimates of the ARMA(1,1)-IGARCH(1,1) model specification and the results are qualitatively similar to those in Panel A.

Table 4 In-sample parameter estimation of the intraday GARCH, FIGARCH, and IGARCH models
Table 5 In-sample parameter estimation of the daily GARCH, FIGARCH, and IGARCH models

Table 5 shows the in-sample parameter estimation for the daily GARCH, FIGARCH, and IGARCH models. These results are qualitatively similar to those in Table 4. We observe: (1) negative and significant first order autocorrelation in the conditional mean equation for each model and contract except for the daily IGARCH model using the sugar contract with January delivery; (2) statistically significant \(\hat{\beta }\) parameters; (3) highly significant fractional integration parameters \(\hat{d}\); and (4) highly significant degrees of freedom parameters \(\hat{v}\).

We present the in-sample parameter estimates of the ARFIMA model using the daily realized volatility obtained from the 15-, 30-, and 60-min returns in Table 6. For aluminum, copper, and fuel oil, we set the MA term \(q=0\) as it is statistically insignificant at any conventional level. The first order autoregression term \(\hat{p}\) is negative and highly significant and the fractional integration term \(\hat{d}\) hovers around 0.4 for each of these three commodities. In cases of January and November contracts for sugar, the first order autocorrelation \(\hat{p}\) tends to be positive and quite often significant. The MA parameter \(\hat{q}\) is close to \(-0.4\) and significant at the 1 % level. Similar to other commodities, the fractional integration parameter estimate for sugar is in the vicinity of 0.45 and is highly significant.

Table 6 In-sample parameter estimation of the ARFIMA(pdq) model

Overall, the in-sample estimates of the GARCH, FIGARCH, IGARCH, and ARMIFA models reported in Tables 4, 5, and 6 using intraday and daily data reveal that, for the four commodities, the return innovations are generally negatively autocorrelated with fat tails. Moreover, the underlying volatility processes are persistent with clear evidence of long memory properties.

4.2 Out-of-sample predictions

Table 7 reports RMSFEs for all volatility models, where forecasts errors are computed in comparison with three alternative true volatility proxies. In Panel A, we use the most widely exploited proxy in the literature, namely, the realized volatility measure constructed from the 5-min returns. It is interesting to notice that for aluminum and copper futures contracts, the IGARCH and FIGARCH models produce the smallest RMSFEs, respectively, and both at the daily level. This preliminary evidence suggests that for this particular true volatility proxy, used in computing forecast errors, information contained in intraday prices does not help in generating more accurate volatility forecasts. For fuel oil, the 30-min FIGARCH model produces the smallest RMSFE. It is also interesting to observe that although the January and November deliveries for sugar contracts differ massively in terms of trading volume (see Table 1), the ARFIMA model utilizing the daily realized volatility obtained from the 15-min returns provides the best forecasts for both futures contracts.

In Panel B, we consider median-based daily volatility as a proxy for true volatility. In this case, the ARFIMA model beats the rest of the competing models by producing the lowest RMSFE. More precisely, the ARFIMA model outperforms the other models for copper, fuel oil, and sugar (both January and November deliveries) when the daily realized volatility is obtained from the 60-min returns. For aluminum, it is the ARFIMA model using the daily realized volatility computed from the 30-min returns. Finally, in Panel C, we make use of range-based volatility as true volatility proxy. Once again, the ARFIMA model is the best performing model for four out of five commodity futures contracts. In particular, the ARFIMA model applied to the daily realized volatility obtained from the 15-min returns leads to the lowest RMSFE for copper. But for aluminum and January and November deliveries of sugar contracts, it is the the 60-min returns based daily realized volatility applied to the ARFIMA model. Fuel oil is the only exception, for which the daily IGARCH model provides the most accurate out-of-sample variance forecasts.

Table 7 Root mean squared forecast error

Taken together, we notice three interesting and consistent patterns from the preliminary results in Table 7. First, the ARFIMA model, with its long memory dimension, dominates the other three volatility models in 11 out of 15 commodity/true volatility proxy combinations. Second, GARCH-type models using daily data outperform similar models using intraday data. Third, the ARFIMA model applied to the daily realized volatility obtained from the higher frequency returns (i.e., 15-min returns) does not always beat the ARFIMA model using the daily realized volatility computed from the lower frequency returns. The latter two observations are novel for our chosen futures market because the literature seems to agree that intraday data enjoy informational advantage over daily data and that forecasting performance of the ARFIMA model improves with sampling frequency (Martens 2001; Martens and Zein 2004). We plot in Fig. 2 the time series of forecast errors between the ARFIMA model and the GARCH model using 30-min returns when the benchmark is the median-based volatility measure. It is quite evident that for the two products depicted in this figure, the ARFIMA model provides smaller forecast errors over time.

Fig. 2
figure 2

The forecast errors for different volatility models. This figure plots the out-of-sample forecast errors between the ARFIMA model and the GARCH model using the 30-min return series for aluminium (top left), copper (top right), fuel oil (bottom left), and sugar with January expiry (bottom right). The benchmark is the median-based volatility measure

Table 8 Diebold and Mariano (1995) and West (1996) test results

In Table 8, we provide pair-wise comparison following the well-known Diebold and Mariano (1995) and West (1996) test based on the Andrews and Monahan (1992) estimator. We choose the benchmark model in each case as the one with the lowest RMSFE in Table 7. The results suggest that the competing model forecasts are either as accurate statistically as the benchmark model, or, in most cases, significantly worse. It is interesting to notice that in Panel A, for aluminum, the ARFIMA model utilizing the daily realized volatility from the 15-, 30-, ad 60-min returns produces inferior forecasts but the difference from the benchmark is statistically insignificant. Put differently, the null hypothesis of equal MSFEs can not be rejected at any conventional level. In fact, for all model/true volatility proxy combinations, whenever the best performing model utilizes daily data, the ARFIMA model provides forecasts just as good statistically. These include the daily IGARCH model for aluminum and the daily FIGARCH model for copper in Panel A, and the daily IGARCH model for fuel oil in Panel C. For other model/true volatility proxy combinations, the competing models tend to produce statistically inferior forecasts, including both sugar contracts in Panels A and C.

As a robustness check, we provide the Diebold and Mariano (1995) and West (1996) test results obtained by sequentially using each volatility model as the benchmark, based on their increasing RMSFEs, against the remaining alternative models in Tables 10, 11 and 12. These additional results corroborate the conclusion in Table 8 that the benchmark, chosen as the one with the lowest RMSFE in Table 7, is indeed the one with the best volatility forecasting ability.

In Table 9, we perform the SPA test of Hansen (2005) to examine out-of-sample forecasting ability across all competing models and compute the stationary bootstrap p values. The null hypothesis is that the benchmark model, the one with the lowest RMSFE, is not inferior to any of the competing models. The test results are resounding. The probability that the benchmark model is at least as good as the competing models in forecasting volatility in the out of sample is 1 or very close to it. Taken together, the results in Tables 8 and 9 clearly confirm and substantiate the observations in Table 7. In other words, when intraday data are directly used in the GARCH-type models, they are no better than daily data for volatility forecasting even after deseasonalization. Hence, if a model is to be recommended for volatility forecasting in the Chinese futures market, it would be the ARFIMA model, as it is consistently the best performing model or not inferior to the best performing one statistically.

Table 9 Superior predictive ability test results

Finally, we note that although sugar contracts for January and November deliveries differ in terms of trading volume and liquidity, the underlying volatility dynamics is very similar. The in-sample parameter estimates are similar between these two series and both are best forecasted by the same model. When the 5-min realized volatility is the proxy for true volatility, the ARFIMA model using the realized volatility computed from the 15-min returns produces the most accurate forecast for both series, while the ARFIMA model applied to the realized volatility computed from the 60-min interval outperforms competing models for the other two volatility proxies for both series. In other words, seasonality in trading volume and differences in liquidity do not affect volatility model selection.

5 Conclusion

In this paper, we undertake a comprehensive volatility forecasting exercise in a futures market with unique institutional regulations. In the Chinese commodity futures market, margin rate is time-dependent and investors face higher deposit as contracts move closer to maturity. In addition, although individuals account for the majority of investors, they are not allowed to trade nearby contracts. These two regulations result in a liquidity pattern whereby contracts with 3 months to delivery are the most liquid and we demonstrate this by computing three popular liquidity measures with 5-min intraday data for aluminum, copper, fuel oil, and sugar. In addition, even these most liquid contract series contain large percentage of zero returns at the 5-min interval.

We explicitly take these features into account when forecasting volatility and utilize more distant 3 months to maturity contracts at the daily and three different intraday sampling frequencies. We demonstrate that the long memory dimension is present in our data in the in-sample volatility modeling. When it comes to out-of-sample forecasting, we show that the ARFIMA model, which aggregates intraday returns to daily level in generating daily forecasts, is the best-performing model, or equivalent to the best-performing model in statistical terms. The FIGARCH model, which also incorporates the long memory feature in the volatility dynamics, is less efficient in generating forecasts probably due to the fact that large proportions of intraday returns are zero and the deseasonalized intraday returns are directly fed into the model.

Furthermore, we show that within the GARCH-family of models, the forecasting performance using the daily data is consistently as good as, if not better than, those using the intraday data, which also attests to the trade-off between information and noise in the intraday data with many zero returns. Finally, it is interesting to note that even though January and November contract series for sugar differ massively in terms of trading volume, their underlying volatility dynamics are well captured and forecasted by the ARFIMA model at the same data sampling frequency.