Forecasting models for the Chinese macroeconomy: the simpler the better?

We consider the problem of macroeconomic forecasting for China. Our objective is to determine whether well-established forecasting models that are commonly used to compute forecasts for Western macroeconomies are also useful for China. Our study includes 19 different forecasting models, ranging from simple approaches such as the naive forecast to more sophisticated techniques such as ARMA, Bayesian VAR, and factor models. We use these models to forecast two different measures of price inflation and two different measures of real activity, with forecast horizons ranging from 1 to 12 months, over a period that stretches from March 2005 to December 2018. We test null hypotheses of equal mean squared forecasting error between each candidate model and a simple benchmark. We find evidence that AR, ARMA, VAR, and Bayesian VAR models provide superior 1-month-ahead forecasts of the producer price index when compared to simple benchmarks, but find no evidence of superiority over simple benchmarks at longer horizons, or for any of our other variables.


Introduction
With an annual GDP of around US$13.6 trillion 1 in 2018, the Chinese economy is now the second largest in the world and, if current growth rates continue, it will become the largest within the next few years. From being a predominantly closed country in the 1970s, China is now well integrated into the world economy. As examples, in 2017-2018, it accounted for 33.7% of Australia's exports, 2 in February 2019 it held approximately 18% of US Treasury Securities, 3 and in 2018 it provided 20% of the imports of the European Union. 4 Surprisingly, despite the importance of the Chinese economy, there is a paucity of published academic research on forecasting the Chinese macroeconomy. This stands in stark contrast to the published literature on forecasting the macroeconomies of Western countries, which covers a wide range of forecasting models over long periods of time. Examples include Ang et al. (2007), Artis et al. (2005), Atkeson and Ohanian (2001), Bandt et al. (2007), Camba-Mendez et al. (2001), Ciccarelli and Mojon (2010), Faust and Wright (2013), Forni et al. (2003), Marcellino et al. (2003), Pincheira and Medel (2015), den Reijer (2005), Schneider and Spitzer (2004), Schumacher (2007), Stock and Watson (1999), Stock and Watson (2002b), and Stock and Watson (2008).
In this paper, we consider the following question: Do the models that are well established for forecasting the Western macroeconomies also work well for China? Although one might hope that this is the case, the relatively short time span of available Chinese macroeconomic data, variations in data quality and the rapid pace of ongoing structural change could limit the effectiveness of forecasts obtained using established methods. We are particularly interested in determining whether more sophisticated forecasting models that require parameter estimation (e.g. ARMA models, VAR models, and factor models) and forecasting models with fixed tuning parameters (e.g. moving averages and exponential smoothing models) provide superior forecasts to simple parameter-free methods (e.g. the naive forecast and the mean forecast).
The scholarly journals have published a small number of papers on forecasting Chinese macroeconomic variables, but none are sufficient to answer the question that we pose above. Mehrotra and Sánchez-Fung (2008), Lin and Wang (2013), Kamal (2013), Zhou et al. (2013), andHe andFan (2015) all conduct comparisons of alternative forecasting models for Chinese macroeconomic variables, and between them cover a broad range of models. However, these studies consider only a very small number of forecasts. Mehrotra and Sánchez-Fung (2008) and Lin and Wang (2013) provide forecasts for a maximum of 12 months only; Kamal (2013) generates annual out-of-sample forecasts for each of 12 years; Zhou et al. (2013) consider only six out-of-sample months, and He and Fan (2015) produce forecasts over only five quarters. Stekler and Zhang (2013) consider forecasts produced by the IMF, OECD, and private sector forecasters, but their analysis covers only 10-12 time periods. All of the above papers provide rankings of forecasting models, but none provide any measures of sampling variability or hypothesis tests. In the light of the very small number of forecasts, it is reasonable to suspect that these rankings would not be robust to changes in the sampling periods considered. In a more recent paper, Higgins et al. (2016) compute monthly forecasts of Chinese GDP and inflation from 2011 to 2015 with forecast horizons ranging from 1 year to 4 years. This provides 60 1-year-ahead forecasts and 24 4-year-ahead forecasts. However, as is the case in the papers cited above, they provide no measures of sampling variability, so it is not possible for the reader to deduce whether the differences in the mean squared forecasting errors reported for different forecasting models are statistically significantly different from zero.
In this paper, we present the results of an out-of-sample forecasting exercise for the Chinese macroeconomy. We forecast two different measures of real economic activity and two different measures of price inflation in each of at least 155 months. Our measures of price inflation are constructed from the Consumer Price Index and the Producer Price Index, and we measure real activity using data on Industrial Production and Electricity Production. As is the case for most of the studies we cite in the second paragraph of this section, we use seasonally adjusted data since the nonseasonal components of variables are generally of most interest in macroeconomic analysis.
We use 19 different forecasting models, each of which has proved itself useful for forecasting macroeconomic variables in Western economies. Our set of models includes simple approaches such as the naive forecast and the mean forecast, approaches based on smoothing, classical time series methods and factor-based forecasts. We use a rolling window of 100 observations for parameter estimation. We calculate the relative mean squared forecasting errors, test null hypotheses of mean squared error equality using the conditional and unconditional tests of Giacomini and White (2006) and control the familywise error rate using the sequentially rejective Bonferroni procedure of Holm (1979).
To the best of our knowledge, this is the first paper on forecasting the Chinese macroeconomy that uses such a wide range of different forecasting models, considers such a long time span of forecasts, and formally tests null hypotheses of equal forecasting power across alternative models instead of simply presenting point estimates of the magnitude of forecast errors.
The remainder of this paper is structured as follows. In Sect. 2, we briefly describe each of the forecasting techniques that we employ and the statistical methodology that we use. In Sect. 3, we describe the data. In Sect. 4, we provide plots of the data and the results of unit root tests. In Sect. 5, we present our empirical results. In Sect. 6, we discuss our findings, and in Sect. 7, we make some concluding comments.

Methodology
Let T denote the total number of observations available on the variables of interest. We use the first R observations to estimate model parameters. We generate forecasts for the time period R + h using several competing forecasting models, where we set h equal to 1, 3, 6, 9, and 12 months. We then drop the first observation from the estimation sample, add the (R+1) st observation to the estimation sample, re-estimate parameters, and generate forecasts for the time period R + h + 1. We continue this processupdating the estimation sample, re-estimating the parameters, and computing forecasts for the next time period-until we have computed out-of-sample forecasts for P time periods using each forecasting model with a rolling sample of R observations. Thus, In this study, we assess forecast accuracy using the mean squared forecasting error (MSE). Let y t denote the variable to be forecast. If y t is integrated of order zero (I(0)), then we fit a forecasting model to the level of the variable and, at time t, generate the h-months-ahead forecast, which we denote f t+h|t . If y t is integrated of order one (I(1)), then we fit a forecasting model to the first difference of the variable and, at time t, for s = 1, . . . , h, generate the s-months-ahead forecast of the first difference of the variable, which we denote f t+s|t . The h-month-ahead forecast of the level of the variable is then f t+h|t = y t + h s=1 f t+s|t . Our choice of forecasting models was guided by two considerations. Firstly, in this paper our interest is focused on models that are well established and widely used for forecasting macroeconomic variables in Western countries. Thus, we wish to include in our study the types of models that might commonly be used for forecasting by central banks, national treasuries, and by forecasters in industry, rather than models taken from the recent research literature. Similarly, it is not our intention in this paper to propose novel forecasting methods. Secondly, we wish to consider a range of models which includes some very simple techniques, standard smoothing methods, classical time series methods, and some more sophisticated approaches such as Bayesian vector autoregression and large-scale factor models. We settled on the following set of models: Mean For an I(0) variable, for all forecast horizons h ∈ N, the h-period-ahead simple mean forecast computed in time period R is the mean of all the available observations up to and including period R, i.e. f R+h|R = 1 R R t=1 y R−t+1 . For an I(1) variable, the corresponding forecast computed in time period R is If y t is a serially uncorrelated process or a random walk with drift, then the appropriate mean forecast is a mean-square optimal estimator of the mean-square optimal forecast. Naive For all forecast horizons h ∈ N, the h-period-ahead naive forecast of an I(0) variable is equal to the last available observation in the data set, i.e. f R+h|R = y R . If the variable is I(1), then we compute the naive forecast as f R+h|R = y R + h y R . Moving average (MA) For an I(0) variable, for all forecast horizons h ∈ N, the hperiod-ahead moving average forecast of order m computed in time period R is the mean of the last m observations in the data set, i.e f R+h|R = 1 m m t=1 y R−t+1 . For an I(1) variable, the corresponding forecast is f R+h|R = y R + h m m t=1 y R−t+1 . We consider two moving average models. For the first (MA(4)), we (arbitrarily) set the order of the moving average to 4. For the second (MA-opt), we choose the order of the moving average in each time period to minimize the mean squared error of the MA forecast up until that point in time. Thus, the order of the moving average potentially changes for every new forecast in the sequence of forecasts. Simple Exponential smoothing (SES) For an I(0) variable y t , the simple exponential smoothing model produces the smoothed series s t = α y t−1 where α is a smoothing parameter for which 0 α 1. For all forecast horizons h ∈ N, the h-period-ahead simple exponential smoothing forecast is defined as f R+h|R = s R . For an I(1) variable, the corresponding forecast is f R+h|R = y R + h s R . We consider two implementations of this model. For the first (SES(0.5)), the smoothing parameter α is (arbitrarily) set equal to 0.5. For the second (SES-opt), for each forecast, α is set equal to the value that minimizes the mean squared historical one-step-ahead forecasting error up until the time that the forecast is made. We compute the SES forecasts using the ses command in the forecast 5 package in the R programming language. 6 Direct (dAR) and multistep iterated (AR) autoregression We compute both multistep iterated and direct autoregressive forecasts. 7 For an I(0) variable and an autoregression of order p, andβ j is the ordinary least squares (OLS) estimator of β j computed using observations y 1 , . . . , y R . Similarly, for an I(1) variable and an autoregression of order p in first differences, For an I(0) variable, a direct h-step-ahead forecast at time period R is calculated by estimating by OLS the equation y t+h = β 0 + p j=1 β j y t− j + ε t using observations y 1 , . . . , y R−1 and computing f R+h|R =β 0 + p j=1β j y R+1− j . For an I(1) variable, the direct forecast is computed by estimating the equation y t+h − y t = β 0 + p j=1 β j y t− j + ε t using observations y 1 , . . . , y R−h and computing f R+h|R = y R + β 0 + p j=1β j y R+1− j . We consider two implementations of the AR and ARd models. For the first (AR(1) and ARd(1)), the order of the model is (arbitrarily) set equal to 1. For the second (AR(p) and ARd(p)), for each forecast generated, the order is chosen using Akaike Information Criterion (AIC) with a maximum allowed order of 24. Autoregressive moving average (ARMA) The ARMA(p,q) model for an I(0) variable, y t = μ + p i=1 φ i y t−i + q j=1 θ j ε t− j + ε t , may be written in state-space form and estimated using the maximum likelihood method (see Gardner et al. (1980)). Forecasts may then be computed using the Kalman filter (see Section 5.5 of Box et al. (2015)). For an I(1) variable, the model is instead We consider two implementations of the ARMA model. For the first (ARMA(1,1)), we (arbitrarily) set both orders of the model to 1. For the second (ARMA(p,q)), for each forecast generated, the order is chosen using the Akaike Information Criterion (AIC) with maximum allowed orders of 2 and 2. We use the Arima and auto.arima functions in the forecast package of the R programming language to compute the ARMA forecasts.

Vector autoregression (VAR)
We follow Stock and Watson (2002b) by specifying a vector to contain a price inflation variable, a variable measuring real activity, and an interest rate variable. To forecast a price inflation variable, the vector Y t consists of the variable to be forecast, Industrial Production, and the interest rate. To forecast a real activity variable, the vector consists of the variable being forecast, the CPI inflation rate, and the interest rate. The VAR model for an I(0) variable y t is then where y t is the first element of the vector Y t . The parameters are estimated using the OLS method, and the h-step-head forecast of the vector Y t computed in time period R is f R+h|R . For an I(1) variable, the VAR is specified in first difference In each case, the forecast of y R+h may be recovered from the forecast of the vector Y R+h .
We consider two implementations of the VAR model. For the first (VAR(1)), the order of the model is (arbitrarily) set equal to 1. For the second (VAR(p)), for each forecast generated, the order is chosen using the Akaike Information Criterion (AIC) with a maximum allowed order of 4. We compute all the VAR estimates and forecasts using the vars package 8 in the R programming language.

Bayesian vector autoregression (BVAR)
We also estimate a Bayesian vector autoregression using the Minnesota prior of Doan et al. (1984) and Litterman (1986). We refer the reader to Canova (2007) Chapter 10 for technical details of the method. 9 The construction of the vectors and forecasts is as for the frequentist VAR described above. We set the lag order to 4 for all BVAR models. We compute the BVAR forecasts using the BMR package 10 in the R programming language. Factor models Let x t be a N × 1 vector of macroeconomic variables that have been suitably transformed to become I(0). Let s t be a k ×1 vector of unobservable factors such that x t = Bs t + η t , where η t is a N × 1 vector of unobservable errors, B is a N × k matrix of unknown coefficients, and k << N . An estimator of the factor vectorŝ t may be constructed as the principal components of the sample covariance matrix of x t corresponding to the largest k eigenvalues. We standardize all the elements of x t prior to computing the principal components. For an I(0) variable y t in time period R, the h-steps-ahead forecast is computed as f R+h|R =β ŝ R whereβ is the least squares estimator of β for the equation y t+h = β ŝ t + ε t computed using observations from t = 1 to t = R − h. For an I(1) variable,β is the least squares estimator of β for the equation y t+h − y t = β ŝ t + ε t , but the rest of the procedure is identical to the I(0) case. Stock and Watson (2002a) establish conditions under whichβ − β p − → 0 andŝ t − s t p − → 0 as N and R jointly grow. We consider two implementations of this forecasting model. In the first (F(2)), the number of factors is set (arbitrarily) to k = 2. In the second (F(k)), for each time period in the forecasting exercise, the number of factors is estimated by minimizing the PC p1 (k) criterion 11 of Bai and Ng (2002) with the maximum allowed number of factors set to 10.

Factor-VAR
We also consider VAR models in which the vector consists of the variable being forecast and two factors, estimated in the way described above. Note that this is the special case of the FAVAR model of Bernanke et al. (2005). We consider two implementations of this model. In the first (F(2)VAR(1)), the VAR order is (arbitrarily) set to 1. In the second (F(2)VAR(p)), the VAR order is chosen to minimize the AIC with a maximum allowable lag of 4.
Our estimator of the mean squared error of the i th forecast is 1 For all but one forecasting model, we also test the null hypotheses where forecasting model b is a benchmark model that we choose. H u 0 is the hypothesis that model i has the same mean squared forecasting error as the benchmark model. Note that this is a hypothesis about the model with estimated parameters, so sampling uncertainty contributes to the forecasting error. It is possible that a correctly specified forecasting model could have a larger mean squared forecasting error than an incorrectly specified benchmark model if the variance of the sampling error is sufficiently large-as could be the case if the sample is quite small. H c 0 is the hypothesis of equality of the mean squared error conditional on F t , an information set available at time t. Giacomini and White (2006) provide the method and theory for testing both hypotheses, and we refer the reader to their paper for more details of the two tests. We follow Giacomini and White (2006) closely by using the Newey-West estimator of the variance with a bandwidth of h − 1 to construct the test statistics, and for H c 0 , we condition on the lagged value of the difference of the squared forecast errors from the pair of models under consideration. Since our study involves individual tests for hypotheses for many forecasting models, several forecasting horizons, and multiple variables, it is likely to yield some small p values purely by chance, even if all the null hypotheses are true. In order to control the risk of false discovery, in addition to reporting the p values for each hypothesis test, we also report the results of the sequentially rejective Bonferroni procedure proposed by Holm (1979). For a set of m null hypotheses, this consists of rejecting the null hypothesis with the smallest p value if its p value is smaller than α/m. If this hypothesis is rejected, then the hypothesis corresponding to the second smallest p value is rejected if its p value is smaller than α/(m − 1). The process continues until no more hypotheses are rejected, with the p value for the j th null hypotheses being compared to α/(m − j + 1). This procedure provides strong control of the familywise error rate-that is to say, it controls the probability of rejecting at least one true null hypothesis from a collection of multiple hypotheses, for any combination of true and false hypotheses.

Data
Empirical research in economics is often constrained by data availability. This is particularly the case for research on the Chinese macroeconomy. Many thousands of economic time series are available from China's National Bureau of Statistics (NBS), 12 the People's Bank of China, 13 and other government sources, but only a small proportion are of sufficient length to be suitable for a study such as this. Moreover, many series have missing observations and/or implausible outliers. A common concern with Chinese data is the integrity of the data collection process. China is a large and complex country which raises technical challenges for its statistical agencies. Furthermore, questions are sometimes raised about the extent to which political pressure might be such that the published data reflect production targets rather than actual outcomes. We do not explore these issues here, but instead refer the interested reader to Crabbe (2016), Holz (2014), Orlik (2011), Holz (2003, and Holz (2008).
The four variables that we forecast are the Consumer Price Index (CPI) monthon-month inflation rate, the Producer Price Index (PPI) year-on-year inflation rate, the year-on-year growth rate of industrial production (IP), and the month-on-month growth rate of the production of electricity (EP). All our variables are monthly final revisions and are measured from October 1996 to December 2018. We do not consider Gross Domestic Product (GDP) in this study since the data are available only at the quarterly and annual frequencies, which would greatly reduce the number of observations in our sample. 14 The CPI data are supplied by the NBS as month-on-month percentage changes. The PPI and IP data are supplied as year-on-year percentage changes, and the EP data are supplied in billions of kilowatt/hours which we convert to a month-on-month growth rate. We downloaded these data sets from Thomson Reuters Datastream 15 since this was easier to work with than the NBS Website. The IP and EP data sets have a number of missing values and outliers. We cleaned the data sets by removing observations for which the first difference deviates from the median 12 http://www.stats.gov.cn/english/. 13 http://www.pbc.gov.cn/en/3688006/index.html.
14 Two approaches to dealing with this problem are to interpolate the quarterly GDP data to create a monthly series (see, e.g. Higgins et al. (2016)) or to use mixed-frequency modelling techniques (see, e.g. Ghysels (2016)). We are sceptical of the use of interpolation to create monthly series from quarterly series in forecasting studies since it results in two-thirds of the forecasted data points being the outcomes of the interpolation procedure, rather than actual data points. While the presence of outliers and missing values dictates that a certain amount of interpolation is practically unavoidable when analysing Chinese macroeconomic data, our preference is that this is kept to the minimum possible. Mixed-frequency modelling techniques are interesting, but are beyond the scope of the present paper, which focuses on forecasting models that are well established in Western economies. 15 The Datastream codes for our variables IP, EP, CPI and PPI are CHIPTOT.H, CHPBRENTP, CHCPINATR, and CHPXGPRDF, respectively. of the first differences by more than five times the interquartile range. 16 All missing values were then replaced by cubic interpolation. 17 All four variables were seasonally adjusted using the X-13 method with a deterministic adjustment made for the Chinese New Year holiday. 18 Our VAR models also include the Prime Lending Rate published by the People's Bank of China. 19 For the factor models, we constructed factor estimates from 50 different monthly macroeconomic variables sourced from Datastream. Outliers and missing values were dealt with as described above, and with the exception of financial variables, all variables were seasonally adjusted. All 50 variables were either differenced or log-differenced prior to the application of the principal components estimator. "Appendix 1" includes tables which list the variable names, their Datastream codes, whether they have missing values, and a code which indicates whether they were differenced or log-differenced. All our code and data are available online. 20 Readers familiar with the literature will notice that our cross-sectional dimension of 50 variables is much smaller than that of, as examples, Stock and Watson (2002b) (215 variables), Schumacher (2007) (124 variables) or Artis et al. (2005) (81 variables). This is an important consideration since the asymptotic justification for the principal components estimator requires the number of variables to grow simultaneously with the number of observations. However, Boivin and Ng (2006) argue that "...as few as 40 pre-screened series often yield satisfactory or even better results than using all 147 series..." which may provide us with some hope that our 50 variables may suffice. 21 Nonetheless, we must concede that our variables are 'pre-screened' only in the sense that we used all the variables for which we had data available at the monthly frequency over the entire time period of interest.  The results of Augmented Dickey-Fuller (ADF) tests for unit roots 22 are presented for each variable in Table 1. In each case, the maximum lag was set equal to the largest integer smaller than 4(T /100) 2 9 . If a 5% significance level is used, then evidence against a unit root exists for the CPI, PPI, and EP growth rates, but not for the growth of IP. 23 We also estimated each of the frequentist models with fixed model orders using all the available observations for both the levels and first differences of the variables. We note that, when judged by measures such as the R 2 and the F-statistic for overall significance, most of the models appear to fit the variables reasonably well. In the 22 The ADF test statistics were computed using the adf.test command in the tseries package in R. See Trapletti and Hornik (2018). 23 Varying the lag order from 2 to 12 did not change the conclusions of the ADF test. Similar results were also found using the Phillips-Perron unit root test.
interests of saving space, these results are not presented here, but are available in an online supplement. 24

The forecasting results
The first step of the forecasting exercise is to take the first R = 100 observations from the raw 25 data set-that is, observations from October 1996 to January 2005-remove outliers, interpolate missing values, seasonally adjust, compute principal components, estimate the model orders and all the model parameters using only the data in this subsample, and then generate forecasts for the time periods h months after the end of the subsample, where h = 1, 3, 6, 9, 12. After generating forecasts from all models at each horizon, we drop the first observation from the subsample and add an extra observation of raw data to the end of the subsample of raw data. We repeat the same steps for the second iteration of the forecasting exercise, with the subsample used for estimation now running from November 1996 to February 2005. We continue in this fashion, dropping the first observation, appending an extra observation of raw data to the end of the subsample of raw data, adjusting for outliers, missing values 26 and seasonality, estimating principal components, model orders, and all the model parameters using only the data in the subsample, and computing forecasts, until we have used the entire data set. This produces a sequence of 167 1-month-ahead forecasts from February 2005 to December 2018. The sequences of h-month-ahead forecasts for h = 3, 6, 9, 12 start h − 1 months later than the corresponding 1-month-ahead forecast. These forecasts are compared to the data sets generated by removing outliers, interpolating missing values, and applying seasonal adjustment to the entire span of data available for each variable 27 (i.e. from October 1996 to December 2018). We measure the deviations of the forecasts from these variables using the mean squared error, and we test null hypotheses of the equality of the mean squared error of each forecast to that of a suitable benchmark. For each variable forecast, we use as a benchmark model whichever of the naive forecast and the simple mean forecast generates the smallest estimated mean squared error.
Estimates of the mean squared forecasting errors for inflation produced by the models are presented in Table 2. The mean squared errors have been standardized so that the mean forecast has a standardized mean squared error of 1 for the CPI 24 https://cheaton.github.io/China-Forecasting/online_supplement.pdf. 25 By 'raw,' we mean the data that have not had outliers removed, missing values interpolated, or seasonal adjustment applied. 26 For cases in which a missing value occurs at the beginning or end of the subsample, cubic extrapolation is used to replace the missing value instead of cubic interpolation. 27 The more conventional approach, followed by (e.g.) Ang et al. (2007), Artis et al. (2005), Bandt et al. (2007), Ciccarelli and Mojon (2010), Faust and Wright (2013), Marcellino et al. (2003), Marcellino et al. (2006), den Reijer (2005), Schneider and Spitzer (2004), Schumacher (2007), Stock and Watson (1999), Stock and Watson (2002b), and Stock and Watson (2008), is to make a single adjustment for outliers and seasonality to the entire data set, or to source data that are already seasonally adjusted, and then use subsets of these data in the forecasting exercise. As pointed out by an anonymous referee, this approach pollutes the forecasts with information from the future of each subsample. We have also used this approach and found that it produced results that are broadly similar to those we report in this paper.
forecasts and the naive forecast has a standardized mean squared error of 1 for the PPI forecasts since, in each case, these are the benchmarks with the smallest mean squared error. A number of features of Table 2 are worth noting. For the CPI inflation rate, the factor-VAR(1) model with two factors produced the best 1-month-ahead forecasts within-sample and, in general, the more sophisticated forecasting models outperform the simple models. For all other forecast horizons, the more sophisticated techniques either perform worse than the simple mean forecast or provide only a slight improvement. Thus, judged purely by point forecasts, there appears to be little to gain from using sophisticated methods to forecast Chinese CPI inflation at horizons longer than one month.
The results for the PPI inflation rate are markedly different. In this case, the BVAR model generates the smallest sample mean squared forecasting error for horizons of 1, 3 and 6 months, and the improvement over the benchmark naive forecast is substantial. The ARMA, AR(p), and ARd(p) models also perform relatively well at short horizons, but the relative performance declines substantially as the forecast horizon grows. In contrast, the factor models perform extremely poorly relative to the naive forecast at short horizons, but perform the best at horizons of 9 and 12 months. The MA and SES forecasts are generally outperformed by the naive forecast, and the simple mean forecast performs poorly at horizons of 1, 3, and 6 months, but performs quite well at horizons of 9 and 12 months.
The above comments are based purely on the sample estimates of the mean squared forecast errors and take no account of sampling variability. Tables 3 and 4 provide the p values for the conditional and unconditional Giacomini and White (2006) tests, using the mean forecast as the benchmark for the CPI and the naive forecast as the benchmark for the PPI. For cases in which the p value is less than 0.05, if the corresponding relative mean squared error is greater than 1 (i.e. the forecasting model is inferior to the benchmark), the p value is presented in italics. If the p value is less than 0.05 and the relative MSE is less than 1 (i.e. the forecasting model is superior to the benchmark), then the p value is displayed with a bold font. We have also applied the sequentially rejective Bonferroni adjustment of Holm (1979) to each table of p values to control the probability that at least one true null hypothesis in each table is rejected to be less than 0.05. The p values of any such rejected hypotheses are underlined. For example, in Table 3(a) in the row marked 'Factor(2)' and the column headed 'h = 6' appears the figure 0.022. This indicates that the p value for the null hypothesis that the conditional expected value of the mean squared error of the factor model with two factors, when used to forecast 6 months ahead, is equal to that of the mean forecast, is 0.022 when tested against a two-sided alternative hypothesis. This number is printed in a bold font to indicate that, if a Neyman-Pearson hypothesis test were conducted with a 5% significance level, the null hypothesis would be rejected and that the estimated mean squared error of the Factor(2) forecast is less than that of the mean forecast (i.e. the Factor(2) model is superior). Note, however, that the null would not be rejected if a 1% significance level was used. Furthermore, the fact that the number 0.022 is not underlined indicates that, when the sequentially rejective Bonferroni adjustment of Holm (1979) is applied to restrict the probability that at least one true null hypothesis in Table 3(a) is rejected to be less than 5%, the null hypothesis for the Factor(2) model at a horizon of 6 months is not rejected. For this reason, we do not consider the fact Table 2 Inflation: standardized mean squared forecast errors by horizon The rows are the forecasting models described in Sect. 2. The columns are forecast horizons. The table elements are the point estimates of the mean squared forecasting error expressed as a ratio of the mean squared forecasting error of the benchmark model for the same forecast horizon. Thus, a number less than 1 indicates that the model produced a smaller sample mean squared forecasting error than the benchmark model at that forecast horizon. The benchmark models are the simple mean forecast for the CPI and the naive forecast for the PPI Table 3 Inflation The rows are the forecasting models described in Sect. 2. The columns are forecast horizons. The table elements are the p values for the Giacomini and White (2006) conditional test that the mean squared forecasting error is equal to that of the benchmark model. If a p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 2 is less than 1, then the p value is in a bold font, indicating that the model generates significantly better forecasts at the 5% significance level.
If the p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 2 is greater than 1, then the p value is in italics, indicating that the model generates significantly worse forecasts. p values for hypotheses that are rejected by the sequentially rejective Bonferroni procedure of Holm (1979), applied to restrict the familywise error rate of each table to be less than 5%, are underlined. The benchmark model for the CPI forecasts is the mean forecast. The benchmark for the PPI forecasts is the naive forecast that the individual p value for this hypothesis is less than 0.05 to constitute strong evidence in favour of this particular forecasting model. Overall, Tables 3(a) and 4(a) show no strong evidence that the forecasting models considered are able to outperform the simple mean forecast when applied to the CPI inflation rate for China. Furthermore, when restricting the familywise error rate to be less than 5%, we find evidence only that the k-factor model with a horizon of 12 months has a worse performance than the simple mean forecast.
For the PPI with a one-month forecast horizon, when we restrict the familywise error rate to be less than 5%, the conditional test finds evidence that ARMA models provide superior forecasts to the naive model. The unconditional test also finds evidence in favour of the direct AR(p) model, the VAR(p) model, and the BVAR model. Some individual p values are also less than 0.05 for longer forecasting horizons but in each case the p values are reasonably close to 0.05 and the hypotheses are not rejected when controlling the familywise error rate. For these reasons, we do not consider these statistics to provide evidence of superior forecasting power at longer horizons.
We now consider the results for the forecasts of real activity. The relative mean squared forecasting errors are presented in Table 5. For IP at a forecasting horizon of 1 month, all the models considered are inferior to the simple mean forecast. For a forecasting horizon of 3 months, only the AR(p) and ARd(p) models provide smaller mean squared forecasting errors. At longer horizons, the relative performance of the AR, VAR, BVAR, and factor models improves. Interestingly, however, ARMA models are inferior at all forecasting horizons. Similarly, the naive, MA, and SES models always perform poorly.
For EP, the smallest relative mean squared error at any horizon is 0.974 (the Factor(2) model with a horizon of 1 month). Thus, none of the models was able to produce an economically meaningful improvement over the forecasting power of the simple mean forecast.
The p values for the conditional Giacomini and White (2006) test appear in Table 6, and for the unconditional test, in Table 7. The lack of evidence that any of the forecasting models considered are able to consistently produce mean squared forecasting errors that are less than those of the simple mean forecast is striking.

Discussion
The forecasting models that we consider in this paper are all well established, and it may be somewhat surprising that we find so little evidence of their utility in the Chinese context. However, from a theoretical perspective, the mean squared optimal forecast of a stochastic process y t relative to its past values is E(y R+h |y R , y R−1 , y R−2 , . . .) (see, e.g. Hamilton (1994), Section 4.1). Furthermore, for any process y t that has a stable ARMA representation, E(y R+h |y R , y R−1 , y R−2 , . . .) − E(y R+h ) → 0 as h → ∞. Consequently, we should expect our models to be less likely to produce forecasts superior to the mean forecast as the forecast horizon grows. As a simple example, consider the AR(1) model y t = β 0 + β 1 y t−1 + ε t . The mean squared optimal forecast is f R+h = β 0 h−1 j=0 β j 1 +β h 1 y R . Plots of the values of the OLS estimator of β 1 for each  The rows are the forecasting models described in Sect. 2. The columns are forecast horizons. The table elements are the p values for the Giacomini and White (2006) unconditional test that the mean squared forecasting error is equal to that of the benchmark model. If a p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 2 is less than 1, then the p value is in a bold font, indicating that the model generates significantly better forecasts at the 5% significance level.
If the p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 2 is greater than 1, then the p value is in italics, indicating that the model generates significantly worse forecasts. p values for hypotheses that are rejected by the sequentially rejective Bonferroni procedure of Holm (1979), applied to restrict the familywise error rate of each table to be less than 5%, are underlined. The benchmark model for the CPI forecasts is the mean forecast. The benchmark for the PPI forecasts is the naive forecast The rows are the forecasting models described in Sect. 2. The columns are forecast horizons. The table elements are the point estimates of the mean squared forecasting error expressed as a ratio of the mean squared forecasting error of the benchmark model for the same forecast horizon. Thus, a number less than 1 indicates that the model produced a smaller sample mean squared forecasting error than the benchmark model at that forecast horizon. The benchmark model is the simple mean forecast for both IP and EP Table 6 Real activity: p values for conditional test by horizon The rows are the forecasting models described in Sect. 2. The columns are forecast horizons. The table elements are the p values for the Giacomini and White (2006) conditional test that the mean squared forecasting error is equal to that of the benchmark model. If a p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 5 is less than 1, then the p value is in a bold font, indicating that the model generates significantly better forecasts at the 5% significance level.
If the p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 5 is greater than 1, then the p value is in italics, indicating that the model generates significantly worse forecasts. p values for hypotheses that are rejected by the sequentially rejective Bonferroni procedure of Holm (1979), applied to restrict the familywise error rate of each table to be less than 5%, are underlined. The benchmark model is the mean forecast  Giacomini and White (2006) unconditional test that the mean squared forecasting error is equal to that of the benchmark model. If a p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 5 is less than 1, then the p value is in a bold font, indicating that the model generates significantly better forecasts at the 5% significance level.
If the p value is less than 0.05, and the corresponding within-sample relative mean squared error in Table 5 is greater than 1, then the p value is in italics, indicating that the model generates significantly worse forecasts. p values for hypotheses that are rejected by the sequentially rejective Bonferroni procedure of Holm (1979), applied to restrict the familywise error rate of each table to be less than 5%, are underlined. The benchmark model is the mean forecast  p values are in parentheses. * , * * and * * * indicate p < 0.1, p < 0.05 and p < 0.01 of the subsamples used in the forecasting exercises reported in Sect. 5 are presented in Fig. 2, and estimates of both coefficients computed using the full sample 28 are in Table 8. Note that the coefficient estimates for the 'slope' term are fairly small for all variables except the PPI. The forecast equations for each horizon, computed using the coefficient estimates from Table 8 obtained using the whole sample, are presented in Table 9. With the exception of the PPI forecasts, the forecast equations all produce forecasts that are very close in value to the estimated unconditional means of each series for forecast horizons greater than 1. Thus, the main reason why the AR(1) model fails to produce forecasts superior to the simple mean for these variables is that 28 Coefficient estimates for all the models computed using the entire sample are available in the online supplement to the paper. the estimated autocorrelation at a lag of 1 month is small. In contrast, the estimated 'slope' coefficient for PPI is consistently close to 1 which explains why the AR(1) model produces a similar forecasting performance to the naive model, and why both are markedly superior to the mean forecast. More generally, there are a number of potential explanations for the relatively poor performance of standard forecasting models in the Chinese context.
Firstly, it might simply be that the available macroeconomic time series for China are not sufficiently long to provide statistically significant results. Our analytical approach effectively requires two data sets-one to estimate the model parameters and another to produce a series of out-of-sample forecasts. Western macroeconomic time series are generally available over much longer time spans than their Chinese counterparts. For example, Giacomini and White (2006) have 468 observations for their study of forecasting models for US industrial production, personal income, CPI inflation, and PPI inflation and generate forecasts for 318 months. In contrast, we have a total of 267 observations in total, yielding a maximum of 167 forecasts and raising the possibility that statistically significant results might emerge from a larger sample of forecasts, if it were available. Having said this, we note that only 31% of our CPI forecasts produce point estimates of mean squared errors that are smaller than those of the benchmark forecast. For the IP forecasts, this figure is 30%, and for EP it is 12%. Furthermore, as seen in Tables 3, 4, 6, and 7, in many cases, the benchmark model provides forecasts that are statistically significantly better than some of the other models considered. Consequently, we are not convinced that a larger sample would significantly change the conclusions of our study.
A second potential explanation for the failure of our more sophisticated models to consistently outperform simple benchmarks is nonstationarity. Forecasting models in the (vector) ARIMA tradition are well suited to variables that are weakly stationary (perhaps after differencing). In cases where the marginal distributions of variables are slowly and continually changing over time, or where the marginal distributions change more suddenly, the rationale for such models is less clear. It should be noted that there can be little doubt that nonstationarity of these types exists. The question is whether they are so extensive in the Chinese context as to render traditional forecasting methods impotent. One particular potential source of nonstationarity is the financial crisis that commenced in late 2008. In the online supplement 29 to this paper, we have provided plots of the mean squared forecasting errors over 24-month widows for every forecasting model, forecasting horizon, and variable. We have also provided plots of the period-by-period difference in the squared forecasting errors generated by each forecasting model, and the squared forecasting errors generated by the benchmark model. It is clear from these plots that the period immediately following the onset of the crisis was one of great volatility in the forecasting performance of many models relative to the benchmarks. However, with the mean squared errors and p values recomputed using only forecasts from January 2010 to December 2018, 30 we find little difference from the results generated using the full samples that we report in Sect. 5. Consequently, it does not appear to be the case that our results are being driven primarily by the financial crisis. More generally, the impact of nonstationarity may vary across our forecasting models. Trivially, the naive forecast is unaffected by changes in the marginal distribution that occurred more than one month previously, whereas the estimation of parameters in other forecasting models usually places equal weight on all past observations. How nonstationarity would affect the ranking of forecasting models depends on the particular changes that occur in the marginal distribution over time. Therefore, it is difficult to make general comments about these effects. Further investigation into the issue of nonstationarity might proceed by comparing models that explicitly incorporate parameter drift and structural change to the benchmark models that we have used in this paper. This is an interesting but extensive task that we leave for future research.
Finally, it is possible that traditional forecasting methods perform relatively poorly for China because of low data quality. Much has been written about the quality of Chinese macroeconomic data (see, for example, Crabbe (2016), Holz (2014), Orlik (2011), Holz (2003 and Holz (2008)), and we do not have anything to add to this literature. To understand the potential impact of poor quality data, consider the following simple model of measurement error: y t = y * t + e t where y * t is a weakly stationary stochastic process that we wish to forecast, and e t is a measurement error that we assume to be serially uncorrelated and uncorrelated with y * t at all leads and lags. It is simple to show that Corr(y t , y t− j ) Corr(y * t , y * t− j ) for all j ∈ N and that the difference between the two correlations increases with V ar(e t ). Thus, measurement error is a possible cause of the small estimates of the AR(1) slope coefficients reported in Fig. 2 and Table 8 and may contribute to the relatively poor forecasting power of models that exploit correlation structure. Ultimately, if the published data do not reliably measure the variables that they are claimed to represent, then forecasting those variables may be a hopeless task and, in any case, might be argued to be pointless.
In such a situation, the way forward is to seek better ways to measure the aspects of Chinese economic activity that are of interest, and to forecast these measures directly, rather than relying on the official data collection.

Conclusion
For analysts interested in forecasting real activity in China, the results in our paper provide clear guidance. We find no evidence that any of the forecasting models that we consider provide smaller mean squared forecasting errors than the simple mean forecast. For inflation, the results are less clear-cut. For PPI inflation, we found statistically significant evidence that AR, ARMA, VAR, and BVAR models produce superior forecasts to the naive model at a forecasting horizon of 1 month. For each of these models, the estimated mean squared forecasting error is approximately half that generated by the benchmark naive forecasting model. However, we do not find strong evidence that any of the models that we consider produce better forecasts of the PPI than the naive model at horizons longer than 1 month, and we find no evidence that any of the models we consider produce superior forecasts of the CPI than the simple mean model at any horizon.
An important contribution of our paper is that (to our knowledge) it is the first paper on forecasting the Chinese macroeconomy that has formally tested hypotheses that the forecasting models under consideration have forecasting power equal to that of simple benchmark models. Our samples of forecasts range in size from 156 (for 12month-ahead forecasts) to 167 (for 1-month-ahead forecasts). These are much larger samples than have been generated in the prior published literature, which range from five observations to 60. The fact that we find so little statistically significant evidence that sophisticated forecasting models provide forecasts superior to those available from simple benchmark models raises considerable doubt about the generality of previously published results that suggest otherwise, beyond the particular sample periods considered.
While our results might create an impression that, for the most part, Chinese macroeconomic variables cannot be forecast, such a conclusion should be resisted. Instead, our main finding is that, with the exception of the PPI, for the variables we consider there does not exist sufficiently strong serial correlation for the widely used conventional forecasting techniques to consistently outperform a forecast based on an estimator of the mean, in the sense of producing a smaller mean squared forecasting error. Therefore, for CPI, industrial production, and electricity production, we recommend the mean forecast due to its simplicity. Note, however, that our results do not necessarily imply that the more sophisticated models we consider would not perform relatively well when judged against criteria other than the mean squared forecasting error (for example, for sign forecasting). Our findings also do not preclude the possibility that using nonsample information might improve the forecasts, that less conventional forecasting models might perform better than the models we have considered, or that alternative measures of inflation and real activity might have more forecastable structure than the ones used in this paper. Thus, while our results have important implications for those interested in forecasting Chinese macroeconomic variables, considerable scope remains for future research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.