Volatility forecasting in the Chinese commodity futures market with intraday data

Given the unique institutional regulations in the Chinese commodity futures market as well as the characteristics of the data it generates, we utilize contracts with three months to delivery, the most liquid contract series, to systematically explore volatility forecasting for aluminum, copper, fuel oil, and sugar at the daily and three intraday sampling frequencies. We adopt popular volatility models in the literature and assess the forecasts obtained via these models against alternative proxies for the true volatility. Our results suggest that the long memory property is an essential feature in the commodity futures volatility dynamics and that the ARFIMA model consistently produces the best forecasts or forecasts not inferior to the best in statistical terms.


Introduction
In this paper, we are concerned with volatility forecasting in the Chinese commodity futures market. Volatility modeling and forecasting is a much devoted area of research as volatility is considered the ''barometer for the vulnerability of financial markets and the economy'' (Poon and Granger 2003, p. 479) and central to asset pricing, derivative valuation, portfolio allocation, and risk management. We are interested in this particular market in part because it has become an important part of the global futures markets with tremendous trading volume. 1;2 More importantly, this market is regulated by two unique institutional rules that makes it interesting to explore.
The first regulation is the time-dependent margin rate, whereby the margin as a fraction of the contract value increases as contracts move closer to delivery. Take sugar as an example. The margin rate for deposit two months prior to delivery is 6 % of the contract value for an investor. In the month before delivery, it increases to 8 % in the first 10 days, 15 % between the 11th to the 20th day of the month, 25 % in the final 10 days of the month, culminating to 30 % in the delivery month. 3 The second regulation is that, although they represent 97 % of all investors in the futures markets, individual investors are not allowed to trade nearby contracts. 4 Both regulations effectively push market participation and trading volume to more distant contracts with implications for market liquidity.
Our contribution to the literature is that we take into account unique institutional regulations of this market and design empirical volatility forecasting exercises that are appropriate for the characteristics of the market and the data it generates. Our data on aluminum, copper, and fuel oil consistently show that contracts with three months to delivery enjoy the best liquidity. We are not the first to note this pattern (see Liu et al. 2014;Peck 2008), but we are the first to offer solid and detailed evidence. Using 5-min returns data over long sample periods, we compute three popular liquidity measures that capture different aspects of liquidity, namely the effective spread of Roll (1984), the proportion of zero returns of Lesmond et al. (1999), and the Amihud (2002) illiquidity measure (Goyenko et al. 2009). Our results show that contracts with three months to delivery are the most liquid as they exhibit the lowest effective spread, the lowest percentage of zero returns, and the smallest value for the Amihud (2002) illiquidity measure. This is different from the majority of futures markets and contracts for which the nearby contracts are usually the most liquid (see Baillie et al. 2007;Lee 2009; and the references therein). Crucially, this liquidity pattern results from the unique institutional environment in which trading takes place.
On the other hand, being an emerging market, the Chinese commodity futures market exhibits large proportion of zero returns (Bekaert et al. 2007) and this is particularly evident in our 5-min return series. Even for the most liquid 3-month to maturity contracts, 1 See the Annual Volume Survey Report 2014 published by the Futures Industry Association, the primary industry association for centrally cleared futures and swaps based in Washington D.C., at https://fia.org. The Chinese sugar futures contracts rank 3rd globally in terms of trading volume in the Agricultural Category, while copper ranks 4th in the Metals Category. 2 Our paper is related to Liu et al. (2014) which examine hedging with metal futures in China using commodity futures contracts, and to Fung et al. (2003) which adopt the bivariate GARCH framework to analyze the information flow between commodity futures traded both in the US and China. 3  the fraction of zero returns is as high as 36.27, 23.90, and 31.50 % on average, respectively, for aluminum, copper, and fuel oil. In the existing literature, intraday data are widely adopted for volatility modeling and forecasting as they are shown to contain more information and provide more accurate and efficient forecasts (see Fuertes et al. 2015;Hseu et al. 2007;Shi and Lee 2008; and the references therein). However, the large proportion of zero returns in our data suggests that higher data sampling frequency does not necessarily translate into better forecasting performance due to information loss or noise in the data (Bandi and Russell 2006;Phillips and Yu 2009). Hence we choose to perform volatility forecasting by aggregating 5-min data into 15-, 30-, and 60-min intraday returns and compute daily returns from daily prices so that we can observe and compare how good different models are at capturing the volatility dynamics given the data.
Equally important for the volatility forecast comparison is the choice of the true volatility proxy. While true volatility is a latent variable that cannot be observed in the market, an efficient and accurate representation of it is of great importance for the evaluation of volatility forecasts [see Andersen et al. (2010) for an excellent survey]. In this paper, we undertake three different proxies for the true daily volatility. In addition to the widely adopted realized volatility measure of Andersen and Bollerslev (1998), we also consider the median-based measure of Andersen et al. (2012) and the range-based proxy advocated by Parkinson (1980), both of which are shown to be robust to zero returns, potential jumps in the underlying price dynamics, and other microstructure related effects.
In terms of volatility models, we begin with the conventional generalized autoregressive conditional heteroskedastic (GARCH) model of Bollerslev (1986Bollerslev ( , 1990. Our choice of models is also motivated by Baillie et al. (2007), which document strong long memory properties in commodity futures and argue that the fractionally integrated GARCH (FIGARCH) model captures this feature very well. At the same time, a natural alternative that works well at capturing the long memory property in realized volatility is the autoregressive fractionally integrated moving average (ARFIMA) model of Granger (1980) and Granger and Joyeux (1980). The two models differ in the manner in which information is extracted from intraday data: intraday returns are first aggregated to obtain daily realized volatility before the ARFIMA model is adopted to describe and forecast realized volatility at the daily level; whereas for the FIGARCH model, deseasonalized intraday data are directly fed into the model. So it is empirically interesting to compare the performance of the two models using our data.
Our empirical analysis reveals a host of interesting findings. First, in terms of the out-ofsample forecasting performance, the Diebold and Mariano (1995) and West (1996) test applied on a pairwise basis and the superior predicative ability test of Hansen (2005), which tests across alternative models simultaneously, suggest that the ARFIMA model consistently outperforms the GARCH-type models in the out-of-sample tests. It is the best performing model in 11 out of 15 commodity/volatility proxy combinations, and for the remaining four combinations the difference between the forecasting performance of the ARFIMA model and that of the best performing model is statistically insignificant at any conventional level. In other words, the ARFIMA model consistently produces the best forecasts or forecasts not inferior to the best in statistical terms.
It highlights the importance of incorporating the long memory dimension in volatility modeling in line with the literature. This finding also contributes to the discussion in the literature of whether the FIGARCH or the ARFIMA model is empirically better at capturing the long memory feature in the volatility dynamics (Chortareas et al. 2011). Given that the intraday Chinese commodity futures data contain large proportion of zero returns which are directly fed in the FIGARCH model, it is not surprising that the ARFIMA model performs better.
Second, we show that within the GARCH family of models, the forecasting performance using the daily data is consistently as good as, if not better than, those using the intraday data. This finding suggests that the GARCH-type models may not be very efficient in utilizing the information contained in the intraday data of this particular market for volatility forecasting purpose due to high percentage of zero returns.
Finally, it is interesting to note that although sugar contracts with January maturity and November maturity differ massively in terms of trading volume and show different levels of liquidity, the underlying volatility dynamics is nevertheless captured by the same model at the same data sampling frequency. For example, when the median-and range-based proxies are adopted, both futures contracts are best forecasted by the AFRIMA model using daily realized volatility obtained from the 60-min returns. This further suggests that the ARFIMA model is a reliable and robust tool for forecasting volatility regardless of the underlying liquidity level with practical implications for traders and risk managers.
The rest of the paper is structured as follows. In Sect. 2, we briefly outline the alternative volatility models, the proxies for the true volatility dynamics, and the statistical metrics for the out-of-sample volatility forecasts evaluation. Section 3 describes the data and the model estimates. In Sect. 4, we discuss and analyze main empirical findings. Finally, Sect. 5 concludes. Details of the three liquidity measures are provided in the ''Appendix''.
2 Models and statistical evaluation 2.1 Volatility models In this paper, we consider four popular volatility models at four different data sampling frequencies for volatility modeling and out-of-sample forecasting. In particular, we make use of the: (1) intraday GARCH, integrated GARCH (IGARCH), and FIGARCH models at the 15-, 30-, and 60-min intervals; (2) daily GARCH, IGARCH, and FIGARCH models; and (3) ARFIMA model applied to the daily realized volatility computed from the 15-, 30-, and 60-min intervals. The model specifications are briefly outlined below.

GARCH model
The GARCH model is the workhorse in the volatility estimation and forecasting literature (see Bollerslev 1986Bollerslev , 1990; among others). We use an ARMA(1,1) process in the conditional mean equation of the GARCH-type models. To allow for possible fat tails, we model the innovations in the GARCH process as independently and identically distributed Student's t-distribution while implementing the ARMA(1,1)-GARCH(1,1) model using both intraday and daily data. The model specification is given bỹ r t;n ¼ l þ cr t;nÀ1 þ e t;n þ he t;nÀ1 ; e t;n jX t;nÀ1 $ D v ð0; h t;n Þ wherer t;n is the deseasonalized logarithmic return on day t for the nth time interval [see Eqs. (10)-(12)], l, c, and h are the parameters of the conditional mean equation, and x, a, and b are the parameters of the conditional variance equation. 5 The error term e t;n , which is conditional on the information set X t;nÀ1 , follows a Student's t-distribution (denoted by D v ) with zero mean, variance h t;n , and v degrees of freedom. The GARCH model requires that a þ b\1 for the volatility process to be stationary. For the IGARCH model, however, the corresponding requirement is a þ b ¼ 1.

FIGARCH model
The FIGARCH model extends the conditional variance equation of the standard GARCH model by adding fractional differences in order to allow for long memory property of the GARCH volatility process (Baillie et al. 1996;Baillie and Morana 2009). Following Baillie et al. (2000), we implement an ARMA(1,1)-FIGARCH(1,d,1) model given bỹ where x, b, and u are the parameters of the conditional variance equation, d is the order of fractional integration, L 1 is the lag operator on n, and D v is the Student's t-distribution defined above.

ARFIMA model
Granger (1980) and Granger and Joyeux (1980) introduce a flexible class of long memory processes based on realized volatilities not belonging to the ARCH family. It has been widely adopted in the literature when long memory properties are assumed in the data (see Martin and Wilkins 1999;Pong et al. 2003; and the references therein). The ARFIMA (p, d, q) model for a process y t is defined as where d is the order of fractional integration and L 2 is the lag operator on t. The AR and MA polynomial components are given as /ðL 2 Þ ¼ 1 þ / 1 L 2 þ Á Á Á þ / p L p 2 and hðL 2 Þ ¼ 1 þ h 1 L 2 þ Á Á Á þ h q L q 2 , respectively, and l is the mean of y t . In the empirical estimation of the ARFIMA (p, d, q) model, we follow Andersen et al. (2003) and replace y t by the log of the daily realized volatility [denoted as logðr t Þ] obtained from the 15-, 30-, and 60-min returns.

5-min realized volatility
The most popular proxy for the unobservable true volatility is the realized volatility measure proposed by Andersen and Bollerslev (1998). This is obtained by aggregating the intraday squared returns. We follow this approach and use a realized volatility series constructed from 5-min log price series, which is the highest frequency in our data. The proxy is given by wherer 2 rv;t is the realized variance for day t and r 2 t;n is the squared 5-min (log) return on day t for interval n (n ¼ 1; 2; . . .; NÞ.

Median-based volatility
The second proxy we exploit for true volatility is the median-based volatility measure introduced by Andersen et al. (2012). The measure is robust to jumps in the underlying return dynamics and to small (''zero'') returns. The median-based true volatility proxy is defined asr 2 med;t ¼ med ðjDr nÀ1 j; jDr n j; jDr nþ1 jÞ 2 ; wherer 2 med;t is the median-based variance for day t and jDr n j is the absolute return over the nth interval on day t.

Range-based volatility
The third proxy for true volatility is the range-based measure proposed by Parkinson (1980). It has been further refined and adopted in Garman and Klass (1980), Yang and Zhang (2000), and Li and Hong (2011). Taking into account of daily high and low prices, this measure is able to deal with microstructure biases in the market. The proxy is defined as follows: wherer 2 rng;t is the range-based variance for day t , and H t and L t are the daily high and low prices, respectively.

Forecasting accuracy
We use three different metrics to evaluate the out-of-sample forecasting accuracy of the volatility models, all of which are commonly adopted statistical measures in the literature (see, for example, Ahmed et al. 2016).

Root mean squared forecast error
The root mean squared forecast error (RMSFE) compares the true volatility with the forecasted volatility from a given model and is computed as where R is the number of daily observations,ĥ tþ1 is the variance forecast, andr 2 tþ1 is the chosen proxy for true variance in the out-of-sample period. Diebold and Mariano (1995) and West (1996) test

2.3.2
The second out-of-sample statistical metric of accuracy is the Diebold and Mariano (1995) and West (1996) MSFE t-statistic, which in our case tests whether a competing volatility model outperforms the benchmark volatility model by generating more accurate variance forecasts. We chose the benchmark model based on the lowest RMSFE. The test statistic is as follows: where DLoss tþ1 is the difference between the squared forecast error loss functions of the benchmark and competing volatility models andX is the consistent estimate of the asymptotic variance of R À0:5 P R t¼1 DLoss tþ1 . The null hypothesis can be expressed as H 0 : E½DLoss tþ1 ¼ 0: Since the volatility models are non-nested, the alternative hypothesis in this case is twosided. The test statistic in Eq. (12) follows an asymptotic standard normal distribution under the null hypothesis of equal predictive ability. We regress DLoss t 0 þ1 on a constant and obtain the MSFE-t statistic for a zero coefficient based on the Andrews and Monahan (1992) estimator. A positive (negative) and statistically significant MSFE-t statistic suggests that the competing model outperforms (is outperformed by) the benchmark volatility model.

Superior predictive ability test
To address the multiple-testing problem in the light of data mining, we conduct the superior predictive ability (henceforth SPA) test of Hansen (2005). Under the composite null hypothesis, there is no predictive ability across all competing volatility models. In other words, the null states that the benchmark model is not inferior to any of the alternative models. A rejection of the null hypothesis indicates that at least one competing model produces forecasts more accurate than the benchmark. Once again, we chose the benchmark model based on the lowest RMSFE and evaluate the out-of-sample forecasts based on the MSFE. For inference, we report stationary bootstrap p values obtained using 10,000 replications.

Data and estimation
The data come from the GTA Information Technology Company. We obtain contract ID, trading date, trading time, trading venue, contract expiry date, last recorded (Renminbi) price, high and low prices, and volume for 5-min time series on four commodity futures contracts: aluminum, copper, fuel oil, and sugar. The full sample period as well as the insample and out-of-sample periods for each commodity are provided in Table 1. 6;7 In Panel D, we find seasonality in trading volume for each contract over the full sample period. 6 The starting and ending dates of the four commodities are constrained by data availability. 7 Chortareas et al. (2011) and Liu et al. (2014) adopt similar sample period for the out-of-sample forecasting exercise with foreign exchange and commodity futures data, respectively.
More precisely, we observe that in terms of average number of contracts traded for each delivery, there is not much variation across the 12 delivery months for aluminum and copper, and there is a slight variation for fuel oil. In other words, the number of contracts traded is relatively stable all-year round. However, with only six delivery months per year, sugar shows a notable variation in the average number of contracts traded across the delivery months. In particular, contracts for January, May, and September exhibit huge trading volumes, while contracts for March, July, and November show the opposite. The trading volume for January delivery is the highest on average with more than 5.6 million contracts, whereas for November delivery the average trading volume is the lowest at 18,418 contracts, about 0.32 % of that for January delivery. This striking yet interesting variation naturally raises the question of how much the volatility dynamics for these two delivery months are different, if they are different at all. Hence, in the empirical exercises, we examine two futures contract series for sugar, one for the very liquid January delivery and the other for the very illiquid November delivery. The table reports descriptive statistics of liquidity for aluminum, copper, fuel oil, and sugar contracts at 5-min interval using three liquidity measures. Roll refers to the effective spread of Roll (1984) (Â10 3 ); Zeros are the proportion of 5-min zero returns during a trading day (in per cent); and Amihud is the illiquidity measure of Amihud (2002) (Â10 8 ). The futures contracts are grouped according to their time to delivery. The full sample period for each commodity futures contract series is reported in Table 1 In Table 2, we report descriptive statistics of three measures adopted to describe liquidity of futures contracts at 5-min interval, which is the highest sampling frequency in our data. 8 For aluminum, the Roll spread measure for nearby contracts averages at 0.0006, zero returns account for 61 % of all 5-min returns on average in a trading day, and the scaled Amihud measure is 0.23. Comparing these figures to those for the 3 months to delivery contracts, we notice a marked improvement. In particular, the Roll spread drops to 0.0004, the percentage of zero returns decreases to 36 %, and the scaled Amihud illiquidity measure drops to 0.03. The liquidity of the futures contract series subsequently worsens with longer time to delivery. For example, aluminum contracts with 3 months to delivery are the most liquid and this liquidity decreases for contracts with longer or shorter time to maturity. The pattern is mirrored in the liquidity estimators for other commodities as well. Hence, in our volatility estimation and forecasting exercises for aluminum, copper, and fuel oil, we use futures contracts with 3 months to delivery, as they are the most liquid among all maturities, and volatility forecasts are least expected to be biased by the large proportion of zero returns. While constructing the time series on returns with 3 months to maturity for aluminum, copper, and fuel oil, we choose prices of the third month prior to delivery month until the contract reaches the first day of 2 months prior to delivery month. We then switch to next contract, which is to be matured in 3 months to make continuous time series. Hence, for these three commodities, the contract time to maturity is always around 3 months. For sugar futures, however, we are mostly interested in the effect that seasonality in trading volume has on volatility forecasting. Therefore, we take contracts from January to December for next January delivery and from November to October for next November delivery. This results in the contract time to maturity to change over time. The practice of switching contracts to the next delivery month is common in the literature (see, for example, Baillie et al. 2007).
In our sample, all commodity futures are traded for 4 h on a trading day starting at 9:00 a.m. and closing at 3:00 p.m. with a 2-h break between 11:30 a.m. and 1:30 p.m. As a result, there are 48 5-min returns on any business day. The (log) return r t;n on a trading day t for the nth interval is computed as where P t;n denote the commodity futures price on day t and the end of the nth interval. The 15-, 30-, 60-min and daily returns are obtained by taking the logarithmic difference between prices that are 15, 30, and 60 min apart. The daily returns are computed as r t ¼ ln P t À ln P tÀ1 .
In Table 3, we provide descriptive statistics of commodity futures contract returns at 5-, 15-, 30-, 60-min and daily intervals. We notice that the average returns are very close to zero irrespective of contracts and data frequencies. Returns are left skewed with fat tails, although the degree of negative skewness and excess kurtosis tend to drop with decreasing sampling frequency. In addition, the percentage of zero returns drops considerably from the 5-min to daily intervals. For example, it is 31.50 % at the 5-min interval, 17 % at the 15-min interval, while only 3.60 % at the daily level for Fuel oil. The trade-off between the improvement in data quality and the loss of information at lower frequencies could be crucial for the outcome of volatility measurement and forecasting exercises. In Fig. 1, we plot the time series of 30-min returns for aluminium, copper, fuel oil, and sugar with January delivery as an example of the data we employ in this paper.
The volatility of intraday returns are known to display periodicity within a trading day, which could contaminate the estimation of conventional volatility models (Andersen and Bollerslev 1997). Following Taylor and Xu (1997), we estimate a simple seasonality term S t;n by averaging the squared returns for each intraday period as follows: where T is the number of trading days in the full sample period. The deseasonalized intraday returns are obtained asr We then make use of the deseasonlized returns to estimate the intraday GARCH family of models. In the out-of-sample forecasting, the intraday forecasts are based on the deseasonlized filtered returns and therefore transformed back to those from the original returns. This is implemented as follows:ĥ t;n ¼Ŝ 2 t;n Âh t;n ; ð13Þ whereh t;n is the intraday variance forecast using the deseasonalized returns andĥ t;n is the transformed variance forecast for the original returns. We produce one-step ahead daily volatility forecasts for daily models. But for intraday models, we produce 16-, 8-, and 4-step ahead forecasts for 15-, 30-, and 60-min intervals and aggregate them to transform into daily forecasts. For the ARFIMA model, it is fitted directly to daily realized volatility aggregated from intraday returns. The out-of-sample forecasts are evaluated against the daily true volatility proxies described earlier. For all sampling frequencies, we use a rolling window forecasting scheme to obtain forecasts from all volatility models.

Empirical analysis 4.1 In-sample results
We report the in-sample parameter estimates of the intraday GARCH, FIGARCH, and IGARCH models for five futures contracts at 15-, 30-, and 60-min intervals in   The table reports the in-sample parameter estimates of the intraday GARCH, FIGARCH, and IGARCH models. In all panels, estimates are obtained using 15-, 30-, and 60-min deseasonalized intraday returns. The models are estimated using quasi-maximum likelihood with Student's t-distributed innovations with v degrees of freedom. Only for Fuel oil, the GARCH model at 15-min interval and for sugar (November), the GARCH, FIGARCH, and IGARCH models at 15-, 30-, and 60-min intervals are estimated assuming a normal distribution. Numbers in parentheses are t-statistics, and ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The in-sample period for each commodity futures contract is reported in Table 1 Volatility forecasting in the Chinese commodity futures... 1139 estimatesĉ are statistically significant at conventional levels. Also, the MA parameter estimateĥ is significantly negative in most cases, capturing the first order negative autocorrelation in the returns. All the parameters in the conditional variance equations are highly significant at the 1 % level exceptâ for 15-min copper contracts. The fact that a þb\1 reveals that the GARCH process is stationary, and, sinceâ þb is close to 1, the volatility process is persistent. For the contract series with return innovations following a Student's t-distribution, the degrees of freedom parameter is between 2 and 4 and statistically significant at the 1 % level. This indicates a fat tail in the return distributions.
In Panel B, when the volatility process is described by an ARMA(1,1)-FIGARCH(1,d,1) model, we notice that the parameter d, the order of fractional integration, is significantly different from zero at the 1 % level for all futures contract series. This implies that the volatility process exhibits a long memory property and attests to the importance of adding this feature in the volatility dynamics of the commodity futures contract returns under scrutiny. It is also worth noting that, similar to the results in Panel A, the degrees of freedom parameter v is highly significant. Panel C shows the parameter estimates of the ARMA(1,1)-IGARCH(1,1) model specification and the results are qualitatively similar to those in Panel A. Table 5 shows the in-sample parameter estimation for the daily GARCH, FIGARCH, and IGARCH models. These results are qualitatively similar to those in Table 4. We observe: (1) negative and significant first order autocorrelation in the conditional mean equation for each model and contract except for the daily IGARCH model using the sugar contract with January delivery; (2) statistically significantb parameters; (3) highly significant fractional integration parametersd; and (4) highly significant degrees of freedom parametersv.
We present the in-sample parameter estimates of the ARFIMA model using the daily realized volatility obtained from the 15-, 30-, and 60-min returns in Table 6. For aluminum, copper, and fuel oil, we set the MA term q ¼ 0 as it is statistically insignificant at any conventional level. The first order autoregression termp is negative and highly significant and the fractional integration termd hovers around 0.4 for each of these three commodities. In cases of January and November contracts for sugar, the first order autocorrelationp tends to be positive and quite often significant. The MA parameterq is close to À0:4 and significant at the 1 % level. Similar to other commodities, the fractional integration parameter estimate for sugar is in the vicinity of 0.45 and is highly significant.
Overall, the in-sample estimates of the GARCH, FIGARCH, IGARCH, and ARMIFA models reported in Tables 4, 5, and 6 using intraday and daily data reveal that, for the four commodities, the return innovations are generally negatively autocorrelated with fat tails. Moreover, the underlying volatility processes are persistent with clear evidence of long memory properties. Table 7 reports RMSFEs for all volatility models, where forecasts errors are computed in comparison with three alternative true volatility proxies. In Panel A, we use the most widely exploited proxy in the literature, namely, the realized volatility measure constructed from the 5-min returns. It is interesting to notice that for aluminum and copper futures contracts, the IGARCH and FIGARCH models produce the smallest RMSFEs, respectively, and both at the daily level. This preliminary evidence suggests that for this particular true volatility proxy, used in computing forecast errors, information contained in Table 5 In-sample parameter estimation of the daily GARCH, FIGARCH, and IGARCH models The table reports the in-sample parameter estimates of the daily GARCH, FIGARCH, and IGARCH models. The models are estimated using quasi-maximum likelihood with Student's t-distributed

Out-of-sample predictions
innovations with v degrees of freedom. Numbers in parentheses are t-statistics, and ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The in-sample period for each commodity futures contract is reported in Table 1 Volatility forecasting in the Chinese commodity futures... 1141 intraday prices does not help in generating more accurate volatility forecasts. For fuel oil, the 30-min FIGARCH model produces the smallest RMSFE. It is also interesting to observe that although the January and November deliveries for sugar contracts differ massively in terms of trading volume (see Table 1), the ARFIMA model utilizing the daily realized volatility obtained from the 15-min returns provides the best forecasts for both futures contracts.
In Panel B, we consider median-based daily volatility as a proxy for true volatility. In this case, the ARFIMA model beats the rest of the competing models by producing the lowest RMSFE. More precisely, the ARFIMA model outperforms the other models for copper, fuel oil, and sugar (both January and November deliveries) when the daily realized volatility is obtained from the 60-min returns. For aluminum, it is the ARFIMA model using the daily realized volatility computed from the 30-min returns. Finally, in Panel C, we make use of range-based volatility as true volatility proxy. Once again, the ARFIMA model is the best performing model for four out of five commodity futures contracts. In particular, the ARFIMA model applied to the daily realized volatility obtained from the 15-min returns leads to the lowest RMSFE for copper. But for aluminum and January and November deliveries of sugar contracts, it is the the 60-min returns based daily realized volatility applied to the ARFIMA model. Fuel oil is the only exception, for which the daily IGARCH model provides the most accurate out-of-sample variance forecasts.
Taken together, we notice three interesting and consistent patterns from the preliminary results in Table 7. First, the ARFIMA model, with its long memory dimension, dominates the other three volatility models in 11 out of 15 commodity/true volatility proxy combinations. Second, GARCH-type models using daily data outperform similar models using  The table reports the in-sample parameter estimates of the ARFIMA(p, d, q) model using the daily realized volatility computed from the 15-, 30-, and 60-min returns. Numbers in parentheses are t-statistics, and ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The in-sample period for each commodity futures contract is reported in Table 1  intraday data. Third, the ARFIMA model applied to the daily realized volatility obtained from the higher frequency returns (i.e., 15-min returns) does not always beat the ARFIMA model using the daily realized volatility computed from the lower frequency returns. The latter two observations are novel for our chosen futures market because the literature seems to agree that intraday data enjoy informational advantage over daily data and that forecasting performance of the ARFIMA model improves with sampling frequency (Martens 2001;Martens and Zein 2004). We plot in Fig. 2 the time series of forecast errors between the ARFIMA model and the GARCH model using 30-min returns when the benchmark is the median-based volatility measure. It is quite evident that for the two products depicted in this figure, the ARFIMA model provides smaller forecast errors over time.
In Table 8, we provide pair-wise comparison following the well-known Diebold and Mariano (1995) and West (1996) test based on the Andrews and Monahan (1992) estimator. We choose the benchmark model in each case as the one with the lowest RMSFE in Table 7. The results suggest that the competing model forecasts are either as accurate statistically as the benchmark model, or, in most cases, significantly worse. It is interesting to notice that in Panel A, for aluminum, the ARFIMA model utilizing the daily realized volatility from the 15-, 30-, ad 60-min returns produces inferior forecasts but the difference from the benchmark is statistically insignificant. Put differently, the null hypothesis of equal MSFEs can not be rejected at any conventional level. In fact, for all model/true volatility proxy combinations, whenever the best performing model utilizes daily data, the ARFIMA model provides forecasts just as good statistically. These include the daily IGARCH model for aluminum and the daily FIGARCH model for copper in Panel A, and the daily IGARCH model for fuel oil in Panel C. For other model/true volatility proxy combinations, the competing models tend to produce statistically inferior forecasts, including both sugar contracts in Panels A and C.
As a robustness check, we provide the Diebold and Mariano (1995) and West (1996) test results obtained by sequentially using each volatility model as the benchmark, based on their increasing RMSFEs, against the remaining alternative models in Tables 10, 11 and 12. These additional results corroborate the conclusion in Table 8 that the benchmark, chosen as the one with the lowest RMSFE in Table 7, is indeed the one with the best volatility forecasting ability.
In Table 9, we perform the SPA test of Hansen (2005) to examine out-of-sample forecasting ability across all competing models and compute the stationary bootstrap p values. The null hypothesis is that the benchmark model, the one with the lowest This table reports the daily out-of-sample RMSFEs (Â10 À5 ) for all models relative to the true volatility proxies: 5-min realized volatility (Panel A), median-based volatility (Panel B), and range-based volatility (Panel C). The out-of-sample period for each commodity futures contract is reported in Table 1 RMSFE, is not inferior to any of the competing models. The test results are resounding. The probability that the benchmark model is at least as good as the competing models in forecasting volatility in the out of sample is 1 or very close to it. Taken together, the results in Tables 8 and 9 clearly confirm and substantiate the observations in Table 7. In other words, when intraday data are directly used in the GARCH-type models, they are no better than daily data for volatility forecasting even after deseasonalization. Hence, if a model is to be recommended for volatility forecasting in the Chinese futures market, it would be the ARFIMA model, as it is consistently the best performing model or not inferior to the best performing one statistically. Finally, we note that although sugar contracts for January and November deliveries differ in terms of trading volume and liquidity, the underlying volatility dynamics is very similar. The in-sample parameter estimates are similar between these two series and both are best forecasted by the same model. When the 5-min realized volatility is the proxy for true volatility, the ARFIMA model using the realized volatility computed from the 15-min returns produces the most accurate forecast for both series, while the ARFIMA model applied to the realized volatility computed from the 60-min interval outperforms  Volatility forecasting in the Chinese commodity futures...

Table 8
Diebold and Mariano (1995) and West (1996)  The table reports the test statistics of the Diebold and Mariano (1995) and West (1996) test based on the Andrews and Monahan (1992) estimator. The benchmark models are those with the lowest RMSFE in Table 7. The forecast errors are computed relative to 5-min realized volatility (Panel A), median-based volatility (Panel B), and range-based volatility (Panel C) measures. ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The out-of-sample period for each commodity futures contract is reported in Table 1 Volatility forecasting in the Chinese commodity futures... 1147 123 competing models for the other two volatility proxies for both series. In other words, seasonality in trading volume and differences in liquidity do not affect volatility model selection.

Conclusion
In this paper, we undertake a comprehensive volatility forecasting exercise in a futures market with unique institutional regulations. In the Chinese commodity futures market, margin rate is time-dependent and investors face higher deposit as contracts move closer to maturity. In addition, although individuals account for the majority of investors, they are not allowed to trade nearby contracts. These two regulations result in a liquidity pattern whereby contracts with 3 months to delivery are the most liquid and we demonstrate this by computing three popular liquidity measures with 5-min intraday data for aluminum, copper, fuel oil, and sugar. In addition, even these most liquid contract series contain large percentage of zero returns at the 5-min interval. We explicitly take these features into account when forecasting volatility and utilize more distant 3 months to maturity contracts at the daily and three different intraday sampling frequencies. We demonstrate that the long memory dimension is present in our data in the in-sample volatility modeling. When it comes to out-of-sample forecasting, we show that the ARFIMA model, which aggregates intraday returns to daily level in generating daily forecasts, is the best-performing model, or equivalent to the best-performing model in statistical terms. The FIGARCH model, which also incorporates the long memory feature in the volatility dynamics, is less efficient in generating forecasts probably due to the fact that large proportions of intraday returns are zero and the deseasonalized intraday returns are directly fed into the model. Furthermore, we show that within the GARCH-family of models, the forecasting performance using the daily data is consistently as good as, if not better than, those using the intraday data, which also attests to the trade-off between information and noise in the intraday data with many zero returns. Finally, it is interesting to note that even though January and November contract series for sugar differ massively in terms of trading volume, their underlying volatility dynamics are well captured and forecasted by the ARFIMA model at the same data sampling frequency.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: Liquidity measures
We use three liquidity estimators widely adopted in the literature to describe the liquidity of the Chinese commodity futures contracts. They are the effective spread of Roll (1984), the proportion of zero returns as in Lesmond et al. (1999), and the Amihud (2002) illiquidity estimator. These measures are shown to perform quite well in capturing the different aspects of the asset liquidity (Goyenko et al. 2009) (Tables 10, 11, 12).

Roll spread
In the seminal paper of Roll (1984), a simple serial covariance spread estimation model is developed to capture asset liquidity. The effective spread is derived from the serial covariance properties of transaction price changes. The model has led to a burgeoning research area in the market microstructure literature with many modifications and extensions (see George et al. 1991;Chang and Chang 1993; and the references therein).
To illustrate, let E and P t denote the effective spread and the closing price on day t, respectively, and D is the change operator. Roll (1984) shows that the serial covariance between changes in prices is In this paper, we follow Goyenko et al. (2009) and adopt a modified version of the Roll (1984) spread so that we can always obtain a numerical value for this liquidity measure. Denoting the price change over the nth time interval as DP n , the effective spread can be expressed as follows: Hence, the lower the effective spread, the higher the liquidity of the asset.
Volatility forecasting in the Chinese commodity futures... 1149

Table 10
Diebold and Mariano (1995) and West (1996)         The table reports the test statistics of the Diebold and Mariano (1995) and West (1996) test based on the Andrews and Monahan (1992) estimator. Based on the results of the RMSFE presented in Table 7, the benchmark models are chosen in terms of increasing RMSFE. ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The forecast errors for all models are computed relative to 5-min measure of true volatility. The out-of-sample period for each commodity futures contract is reported in Table 1 Table 11 Diebold and Mariano (1995) and West (1996) Table 11 continued Benchmark model  Table 11 continued Benchmark model          The table reports the test statistics of the Diebold and Mariano (1995) and West (1996) test based on the Andrews and Monahan (1992) estimator. Based on the results of the RMSFE presented in Table 7, the benchmark models are chosen in terms of increasing RMSFE. ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The forecast errors for all models are computed relative to the median-based measure of true volatility. The out-of-sample period for each commodity futures contract is reported in Table 1 Volatility forecasting in the Chinese commodity futures...

Table 12
Diebold and Mariano (1995) and West (1996)           The table reports the test statistics of the Diebold and Mariano (1995) and West (1996) test based on the Andrews and Monahan (1992) estimator. Based on the results of the RMSFE presented in Table 7, the benchmark models are chosen in terms of increasing RMSFE. ***, **, and * indicate statistical significance at the 1, 5, and 10 % levels, respectively. The forecast errors for all models are computed relative to the range-based measure of true volatility. The out-of-sample period for each commodity futures contract is reported in Table 1 Proportion of zero returns The second liquidity measure we exploit is proposed in Lesmond et al. (1999) and proves especially useful and effective in studying liquidity of emerging markets (see, among others, Bekaert et al. 2007;Lesmond 2005). This measure is based on the transaction cost, that is, if the value of an information signal is insufficient to outweigh the cost associated with trading, market participants will choose not to trade, resulting in a zero return. The measure is easy to implement since it only requires a time series on transaction data. In this paper, the proportion of zero returns in a trading day is defined as follows: Zeros ¼ ð# of intraday time intervals with zero returns Þ=N; where N is the total number of time intervals in a trading day (n ¼ 1; 2; . . .; N). Intuitively, the lower is the proportion of zero returns, the better is the liquidity of the asset.

Amihud illiquidity measure
The illiquidity measure of Amihud (2002) is another popular estimator in the literature (see, among others, Baker and Stein 2004;Amihud et al. 2012). It is a price impact measure that captures the price response associated with one unit currency of trading volume. Hence, the lower is the illiquidity measure, the better is the asset liquidity. More precisely, it is defined as the ratio given by where r n is the asset return in log over the nth time interval and Volume n is the US dollar (in our case, Renminbi) trading volume over the same interval.