How Good are Out of Sample Forecasting Tests on DSGE Models?
 2k Downloads
 2 Citations
Abstract
Outofsample forecasting tests of DSGE models against timeseries benchmarks such as an unrestricted VAR are increasingly used to check (a) the specification and (b) the forecasting capacity of these models. We carry out a Monte Carlo experiment on a widelyused DSGE model to investigate the power of these tests. We find that in specification testing they have weak power relative to an insample indirect inference test; this implies that a DSGE model may be badly misspecified and still improve forecasts from an unrestricted VAR. In testing forecasting capacity they also have quite weak power, particularly on the lefthand tail. By contrast a model that passes an indirect inference test of specification will almost definitely also improve on VAR forecasts.
Keywords
Out of sample forecasts DSGE VAR Specification tests Indirect inference Forecast performanceJEL Classification
E10 E17Introduction
In recent years macroeconomists have turned to outofsample forecasting (OSF) tests of Dynamic Stochastic General Equilibrium (DSGE) models as a way of determining their value to policymakers both for deciding policy and for improving forecasts. Thus for example Smets and Wouters (2007) showed that their model of the US economy could beat a Bayesian Vector Auto Regression (VAR) or BVAR, their point being that while they had estimated the model by Bayesian methods with strong priors there was a need to show also that the model could independently pass a (classical specification) test of overall fit, otherwise the priors could have dominated the model’s posterior probability. Further papers have documented models’ OSF capacity, including Gürkaynak et al. (2013); see Wickens (2014) for a survey of recent attempts by central banks to evaluate their own DSGE models’ OSF capacity.^{1} But how good are these OSF tests? This question is what this paper sets out to answer.
The value of DSGE models’ OSF capacity to policymakers comes as we said from two main motivations.
The first is to use DSGE models to improve economic forecasting. One can think of an unrestricted VAR as a method that uses data to forecast without imposing any theory. Then if one knows the true theory one can improve the efficiency of these forecasts by imposing this theory on the VAR, to obtain the restricted VAR. This will improve the forecasts, reducing the Root Mean Square Error (RMSE) of forecasts at all horizons. However imposing a false parameter structure on the VAR may produce worse forecasts; the further from the truth the parameters are the worse the forecasts. There will be some ‘crossover point’ along this falseness spectrum at which the forecasts deteriorate compared with the unrestricted VAR.
The second reason is the desire to have a wellspecified model that can be used reliably in policy evaluation; clearly in assessing the effects of a new policy the betterspecified the model, the closer it will get to predicting the true effects. The assessment of the DSGE model’s forecasting capacity is being used by policymakers with this desire, as a means of evaluating the extent of the model’s misspecification.
Notice that the two motivations are linked by the requirement of a wellspecified model. Thus for the DSGE model to give better forecasts than the unrestricted VAR it needs to be not too far from the true model—i.e. the right side of the crossover point. It is harder for us to judge how close the model needs to be to the truth for a policy evaluation: this will depend on how robust the policy is to errors in its estimated effects and this will vary according to the policy in question. But we can conclude that both reasons require us to be confident about the model’s specification.
Thus evaluations of the DSGE model’s forecasting capacity, to be useful, should provide us with a test of the model’s specification; and this indeed is how these evaluations are presented to us. Typically the model’s forecasting RMSE is compared with that of an unrestricted VAR, e.g. the ratio of the model’s RMSE to that of the VAR; there is a distribution for this ratio for the sample size involved and we can see how often the particular model’s forecasts give a ratio in say the 5 % tail, indicating model rejection. The asymptotic distribution for this ratio (of two tdistributions) cannot be derived analytically; but we establish below by numerical methods that it is a tdistribution.

What is the small sample distribution for this ratio for a model (1) if it is true and (2) if it is marginally able to improve other forecasts?

How much power do these OSF evaluations have, viewed as a test of a DSGE model’s specification? In other words can we distinguish clearly between the forecasting performance of a badly misspecified model and the true model.

Can we say anything about the relationship between a DSGE model’s degree of misspecification and its forecasting capacity? There is a large literature on forecast success of different sorts of models—Clements and Hendry (2005); Christoffel et al. 2011). We would like to see how success is related to specification error.
DSGE Models OutofSample Forecasting Tests
DSGE Model OSFs
VAR Model OSFs
OSF Tests
To compare the outofsample forecasting ability, there are two alternative statistics that focus on the difference of the minimum meansquared forecast error (MSFE) between two nested models: the Diebold–Mariano and West (DMW) and the ClarkWest (CW) statistics. Diebold and Mariano (1995) and West (1996) construct ttype statistics which are assumed to be asymptotically normal and where the sample difference between the two MSFE’s are zero under the null. Clark and West (2006) and Clark and West (2007) provide an alternative DMW statistic that adjusts for the negative bias in the difference between the two MSFEs.
However in empirical analysis, both the DMW and CW test statistics take their critical values from their asymptotic distributions. Rogoff and Stavrakeva (2008) criticize the asymptotic CW test as oversized; an oversized asymptotic CW test would cause too many rejections of the null hypothesis. Rogoff and Stavrakeva (2008) and Ince (2014) propose to use the bootstrapped OSF test to avoid this size distortion in small samples.
Our bootstrapped OSF test statistics are similar to these. There is not too much difference between the simulated asymptotic distributions of the RMSE ratio and the RMSE difference. But we focus on the ratio of the RMSEs between the DSGE and the VAR model, as this is the measure usually adopted in macroeconomic forecasting studies, such as those discussed here.
The Power of OSF Tests
Monte Carlo Experiments
We follow the basic procedures of Le et al. (2011) to design the Monte Carlo experiment. We take the model of Smets and Wouters (2007) for the US and adopt their posterior modes for all parameters, including for error processes; the innovations are given their posterior standard errors with the normal distribution (Table 1a, b, Smets and Wouters 2007).
We set the sample size (T) at 200, and generate 1000 samples. We set the initial forecast origin (M) at 133. The VAR and DGSE autoregressive processes are initially estimated over the first 133 periods. The models were then used to forecast the data series 4 or 8periodsahead over the remaining 67 periods, with reestimation every period (quarter). We find the distribution of this for the relevant null hypothesis under our small sample from our 1000 Monte Carlo samples. Our null hypothesis for the OSF tests is (1) the True DSGE model and (2) (discussed in Sect. 4) the False DSGE model that marginally succeeds in improving the forecast.
Asymptotic Versus Small Sample Distributions
We then normalise the ratio statistics by adjusting its mean and standard deviation. This is plotted against a normal distribution in Fig. 2. It can be observed that the large sample distribution is very close to a normal distribution. The 5 % critical value for the normalized large sample ratio is 1.543, which is close to 5 % critical value from the standard normal distribution (1.645).
Power of the Specification Test at 5 % Nominal Value
Power of OSF test
GDP growth  Inflation  Interest rate  Joint 3  

% F  4Q  8Q  % F  4Q  8Q  % F  4Q  8Q  % F  4Q  8Q 
True  5.0  5.0  True  5.0  5.0  True  5.0  5.0  True  5.0  5.0 
1  10.2  5.0  1  5.8  4.7  1  4.7  4.8  1  6.0  4.9 
3  23.2  5.0  3  7.9  4.8  3  6.5  4.2  3  9.4  5.2 
5  34.9  5.2  5  13.4  5.1  5  11.5  4.2  5  15.3  6.0 
7  42.5  5.1  7  21.3  6.9  7  18.9  5.4  7  22.9  6.6 
10  52.3  5.5  10  35.6  10.7  10  30.3  6.5  10  36.2  9.8 
15  58.0  11.0  15  62.7  23.7  15  48.9  11.9  15  73.8  29.5 
20  49.9  60.5  20  97.8  72.4  20  62.7  21.3  20  99.8  90.7 
These results are obtained with stationary errors and with a VAR(1) as the benchmark model. We redid the analysis under the assumption that productivity was nonstationary. The results were very similar to those above. We further looked at a case of much lower forecastability, where we reduced the AR parameters of the error processes to a minimal 0.05 (on the grounds that persistence in data can be exploited by forecasters). Again the results were very similar, perhaps surprisingly. It seems that while absolute forecasting ability of a model, whether it is a DSGE or a VAR, is indeed reduced by lesser forecastability, relative forecasting ability is rather robust to data forecastability. Finally, we redid the original analysis using a VAR(2) as the benchmark; this also produced similar results to those above. All these variants, designed to check the robustness of our results, are to be found in Appendix B.
What we see from Table 1 is that the power is weak. On a 1yearahead forecast, 4Q, the rejection rate of the DSGE model on its joint performance remains low at the one year horizon until the model reaches 20 % falseness, and at the two year horizon does not get above 40 % even when the model is 20 % false. Notice also that the individual variable tests show some instability, which is due to the way the OSF uses reestimated error processes for each overlappingsample forward projection: each time the errors are reestimated the full model in effect is changed and sometimes this improves its forecasting performance, sometimes worsens it. Thus forecast performance does not always deteriorate with rising parameter falseness, When all variables are considered jointly this is much less of a problem as across the different variables the effects of reestimation on forecast performance are hardly correlated.
To put this RMSE test in perspective consider the power of the indirect inference Wald test, in sample using a VAR(1) on the same three variables (GDP, inflation and interest rates)—taken from Le et al. (2012a) which also describes in full the procedures for obtaining the test, based on checking how far the DSGE model can generate in simulated samples the features found in the actual data sample (Table 2).
Rejection rates for Wald and likelihood ratio for 3 variable VAR(1)
% F  Wald insample II  Joint 3:4Q  8Q 

True  5.0  5.0  5.0 
1  19.8  6.0  4.9 
3  52.1  9.4  5.2 
5  87.3  15.3  6.0 
7  99.4  22.9  6.6 
10  100.0  36.2  9.8 
15  100.0  73.8  29.5 
20  100.0  99.8  90.7 
The Connection Between Misspecification and Forecast Improvement
For our small samples here we find that the crossover point at which the DSGE model forecasts 1 year ahead less well on average than the unrestricted VAR is for output growth 1 % false, for inflation and interest rates 7 % false; for the three variables together it is also 7 %. This reveals that the lower the power of the forecasting test for a variable the more useful are False models in improving unrestricted VAR forecasts. Thus for output growth where power is higher, the DSGE model needs to be less than 1 % false to improve the forecast; yet for inflation and interest rates where the power is very weak a model needs only to be less than 7 % false to improve the forecast. This is illustrated in the two cases shown in Fig. 3. In the lower one the false distribution with a mean RMSE ratio of unity (where the DSGE model is on average only as accurate as the unrestricted VAR) is 7 % false; hence any model less false than this will have a distribution with a mean ratio of less than unity and will therefore on average improve the forecast. In the upper one the false distribution with a mean RMSE ratio of unity is only 1 % false; so to improve output growth forecasts you need a model that is less than 1 % false. Essentially what is happening with weak power is that as the model becomes more false its RMSE ratio distribution moves little to the right, with the OSF performance deteriorating little; this, as we have pointed out, may be because as the model parameters worsen, the error parameters offset some of this worsening.
What this shows is that if all a policymaker cares about is improving forecasts and the power of the forecast test is weak, then a poorly specified model may still suffice for improvement and will be worth using. This could well account for the willingness of central banks to use DSGE models in forecasting in spite of the evidence from other tests that they are misspecified and so unreliable for policymaking. We now turn to how central banks can check on the forecasting capacity of their DSGE models using OSF tests.
OSF Tests of Whether a DSGE Model Improves Forecasts
Power of Left Hand and Right Hand Tails
Power of OSF tests: LHT and RHT
Joint (Det)RHTail  Joint (Det)LHTail  

% F  4Q  8Q  % F  4Q  8Q 
True  True  16.7  18.8  
1  1  14.2  17.4  
3  3  9.8  14.8  
5  5  7.2  12.9  
7  5.0  7  5.0  11.3  
10  11.3  10  9.4  
15  46.8  5.0  15  5.0  
20  99.5  70.5  20  
25  100  100  25  
30  100  100  30  
35  100  100  35  
40  100  100  40 
The main problem with these tests remains that of poor power.
On the one hand, policymakers could use a DSGE model that was poor at forecasting without detection by the RH tail test. Thus for example a model that was 3 % more false than the marginal one would only be rejected on the crucial 4Qahead test 11.3 % of the time on the RH tail.
On the other hand, they could refuse to use a DSGE model that was good at forecasting without detection; for example a model that was 3 % less False than the marginal one would only be rejected on the 4Qahead test by the LH tail 9.8 % of the time.
We can design a more powerful test by going back to Table 2 and using simply the right hand tail as a test of specification. What is needed is a test of the DSGE model’s specification (as true) that has power against a model that is so badly specified that it would marginally worsen forecasting performance on the joint 3 variables the marginal forecastfailure model: as we have seen such a model is at the 4Q horizon 7 % false and at the 8Q horizon 15 % false. Now the power of OSF specification tests against such a bad model is larger: Table 3 below shows that if on an OSF 4Q test at 95 % confidence a model is not rejected (as true), then the marginal forecastfailure model (the 7 % false model) has a 22.9 % chance of rejection. On an 8Q test the equivalent model (15 % false) has a 29.5 % chance of rejection. Thus the OSF test has better power against the marginal forecastfailure model; but it is still quite weak.
Policymakers could however use the II insample test of whether the model is true also shown in that Table. Against the 4Q 7 % false model it has power of 99.4 %, and against the 8Q 15 % false model power of 100 %. Thus if policymakers could find a DSGE model that was not rejected by the II test, then they could have complete confidence that it could not worsen forecasts.
If no DSGE model can be found that fails to be rejected, then this strategy would not work and one must use the Diebold–Mariano test faute de mieux, on whatever DSGE model comes closest to passing the II specification test.
Reviewing the Evidence of OSF Tests
If we first consider the forecasting performance of these DSGE models, what we see from Table 4 is that the RMSE ratio of DSGE models relative to different timeseries forecasting methods varies from better to worse according to which variable and which timeseries benchmark is considered: Gürkaynak et al. (2013) note that there is a wide variety of relative RMSE performance. Wickens (2014) who reviews a wide range of country/variable forecasts finds the same. No joint performance measures are reported in these papers; however Smets and Wouters (2007)’s joint ratio comes out at 0.8 against a VAR(1) 4Qahead and 0.66 8Qahead.^{4} Thus on these joint ratios the LH tail rejects the marginal forecastfailure model, strong evidence that the SW model forecasts better than a VAR1.
If we turn now to consider DSGE models’ specification from these results, we see first that in general they do not reject these DSGE models. But because of the low power of the OSF tests, the same would be true with rather high probability of quite false models. Le et al. (2011) show that the SW model is strongly rejected by the II Wald test, which is consistent with these OSF results, since as we have seen a false DSGE model may still forecast better than a VAR. They went on to find a version of the model, allowing for the existence of a competitive sector, that was not rejected for the Great Moderation period. By the arguments of this paper this model must also improve on timeseries forecasts.
Conclusions
OSF tests are now regularly carried out on DSGE models against timeseries benchmarks such as the VAR1 used here as typical. These tests aim to discover how good DSGE models are in terms of (a) specification (b) forecasting performance. Our aim in this paper has been to discover how well these tests achieve these aims.
We have carried out a Monte Carlo experiment on a DSGE model of the type commonly used in central banks for forecasting purposes and on which outofsample (OSF) tests have been conducted. In this experiment we generated the small sample distribution of these tests and also their power as a test of specification; we found that the power of the tests for this purpose was extremely low. Thus when we apply these results to the reported tests of existing DSGE models we find that none of them are rejected on a 5 % test; but the lack of power means that models that were substantially false would have a very high chance also of not being rejected. Researchers could therefore have little confidence in these tests for this purpose. We show that they would be better off using an insample indirect inference test of specification which has substantial power.
The reason for this relative weakness of OSF tests on DSGE models may be that the model errors, which are increased by the model misspecification, nevertheless when projected forward compensate for the poorer forecast of the structural parameters. It follows that weak power implies that a DSGE model may be badly misspecified and yet still forecast well. Thus a corollary of the low power is that DSGE models can still improve forecasts even when badly misspecified.
Viewed as tests of forecasting performance against the null of doing exactly as well as the VAR benchmark, OSF tests of DSGE models are used widely, with both the left hand tail of the distribution testing for significantly better performance and the right hand tail for significantly worse performance. Power is again rather weak, particularly on the left hand tail. An alternative would again be to use an insample indirect inference test of specification; if a DSGE model specification can be found that passes such a test, then it may not only be fit for policy analysis but will also almost definitely improve VAR forecasts.
Footnotes
 1.
 2.
We only reestimate the errors for a given False model (for each overlapping sample). If we reestimated the whole False model each period, it would have variable falseness.
 3.
It is defined as follows. Let \(f_y ,\,f_\pi ,\,f_r \) be the OSF errors of output growth, inflation and interest rate respectively. Denote\(f=(f_y ,\,f_\pi ,\,f_r )'\). Then f is a (\(Tlm)*3\) matrix. We can calculate the covariance of f. The joint RMSE is defined as \(\sqrt{\vert cov(f)\vert } .\)
 4.
Smets and Wouters (2007) calculate the overall percentage gain as \((\log (\vert cov(f_{VAR} )\vert )\log (\vert cov(f_{VAR} \vert )\log (\vert cov(f_{DGE} )\vert )/2k\), where k is the number of variables (here \(=\) 3). We convert this to joint ratio as follows: \((\log (\vert cov(f_{VAR} )\vert )\log (\vert cov(f_{DGE} )\vert )/2k=(log\sqrt{\vert cov(f_{DSG} )\vert }log\sqrt{\vert cov(f_{VAR} )\vert } )/k\approx \frac{\sqrt{\vert cov(f_{DSG} )\vert } \sqrt{\vert cov(f_{VAR} )\vert } }{\sqrt{\vert cov(f_{VAR} )\vert } *k}=\frac{JRMSE_{DSG} JRMSE_{VAR} }{JRMSE_{VAR} *k}(JointRatio+1)/k\).
References
 Adolfson M, Linde J, Villani M (2007) Forecasting performance of an open economy dynamic stochastic general equilibrium model. Econom Rev 26(2–4):289–328CrossRefGoogle Scholar
 Christoffel K, Coenen G, Warne A (2011) Forecasting with DSGE models. In: Clements M, Hendry D (eds) Oxford handbook of economic forecasting. Oxford University Press, OxfordGoogle Scholar
 Clark T, West KD (2006) Using Outofsample mean squared prediction errors to test the martingale difference hypothesis. J Econom 135:155–186CrossRefGoogle Scholar
 Clark T, West KD (2007) Approximately normal tests for equal predictive accuracy in nested models. J Econom 138:291–311CrossRefGoogle Scholar
 Clements M, Hendry D (2005) Evaluating a model by forecast performance. Oxford Bull Econ Stat 67(Supplement):931–956CrossRefGoogle Scholar
 Del Negro M, Schorfheide F (2012) Forecasting with DSGE models: theory and practice. In: Elliott G, Timmermann A (eds) Handbook of forecasting, vol 2. Elsevier, New YorkGoogle Scholar
 Diebold FX, Mariano RS (1995) Comparing predictive accuracy. J Bus Econ Stat 13:253–263Google Scholar
 Edge RM, Kiley MT, Laforte JP (2010) A comparison of forecast performance between federal reserve forecasts, DSGE modelc. J Appl Econom 25(4):720–754CrossRefGoogle Scholar
 Edge RM, Gürkaynak RS (2010) How useful are estimated DSGE model forecasts for central bankers? Brook Pap Econ Act 41(2):209–259CrossRefGoogle Scholar
 Giacomini R, Rossi B (2010) Forecast comparisons in unstable environments. J Appl Econom 25(4):595–620CrossRefGoogle Scholar
 Gürkaynak RS, Kisacikoglu B, Rossi B (2013) Do DSGE models forecast more accurately outofsample than VAR models? In: CEPR discussion paper no. 9576, July 2013. CEPR, LondonGoogle Scholar
 Ince O (2014) Forecasting exchange rates outofsample with panel methods and realtime data. J Int Money Finance 43(C):1–18CrossRefGoogle Scholar
 Juillard M (2001) DYNARE: a program for the simulation of rational expectation models. In: Computing in economics and finance, p 213Google Scholar
 Le VPM, Meenagh D, Minford P, Wickens M (2012a) Testing DSGE models by indirect inference and other methods: some Monte Carlo experiments. Cardiff economics working paper E2012/15Google Scholar
 Le VPM, Meenagh D, Minford P, Wickens M (2012b) What causes banking crises? An empirical investigation. Cardiff economics working paper E2012/14Google Scholar
 Le VPM, Meenagh D, Minford P, Wickens M (2011) How much nominal rigidity is there in the US economy—testing a New Keynesian model using indirect inference. J Econ Dyn Control 35(12):2078–2104CrossRefGoogle Scholar
 Rogoff KS, Stavrakeva V (2008) The continuing puzzle of shorthorizon exchange rate forecasting, NBER W.P. 14071Google Scholar
 Smets F, Wouters R (2007) Shocks and frictions in US business cycles: a Bayesian DSGE approach. Am Econ Rev 97(3):586–606CrossRefGoogle Scholar
 West KD (1996) Asymptotic Inference about predictive ability. Econometrica 64:1067–1084CrossRefGoogle Scholar
 Wickens M (2014) How useful are DSGE macroeconomic models for forecasting? Open Econ Rev 25(1):171–193CrossRefGoogle Scholar