Testing DSGE Models by Indirect Inference: a Survey of Recent Findings

We review recent findings in the application of indirect inference to DSGE models. We show that researchers should tailor the power of their test to the model under investigation in order to achieve a balance between high power and finding a robust model; this will involve choosing only a limited number of variables on whose behaviour they should focus. Also recent work reveals that it makes little difference which these variables are or how their behaviour is measured whether via a VAR, IRFs or moments. We also review identification issues, how to test part of a model and whether alternative evaluation methods such as forecasting or likelihood ratio tests are potentially helpful.

M e e n a g h, D avi d ORCID: h t t p s://o r ci d.o r g/ 0 0 0 0-0 0 0 2-9 9 3 0-7 9 4 7, Mi nfo r d, P a t ri ck ORCID: h t t p s://o r ci d.o r g/ 0 0 0 0-0 0 0 3-2 4 9 9-9 3 5X, Wick e n s , Mic h a el ORCID: h t t p s://o r ci d.o r g/ 0 0 0 0-0 0 0 2-6 8 6 2-0 6 7 4 a n d Xu, Yong d e n g ORCID: h t t p s://o r ci d.o r g/ 0 0 0 0-0 0 0 1-8 2 7 5-1 5 8 5 2 0 1 9. Tes ti n g D SG E m o d el s by in di r e c t infe r e n c e: a s u rv ey of r e c e n t fin di n g s. O p e n E c o n o mi e s R evi e w 3 0 (3) , p p. 5 9 3-6 2 0. 1 0. 1 0 0 7/ s 1 1 0 7 9-0 1 9-0 9 5 2 6-w file P u blis h e r s p a g e : h t t p:// dx. doi.o r g/ 1 0. 1 0 0 7/ s 1 1 0 7 9-0 1 9-0 9 5 2 6-w < h t t p:// dx. doi.o r g/ 1 0. 1 0 0 7/ s 1 1 0 7 9-0 1 9-0 9 5 2 6-w > Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s. S e e h t t p://o r c a . cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s. Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
Indirect Inference is a method for estimating and testing models of any size, complexity or nonlinearity by comparing the behaviour of data simulated from the model with that of selected observed data where behaviour is summarised by a set of descriptive features, known as the auxiliary model. In estimation the parameters of the structural model are chosen to best match its simulation with the selected data. In testing the simulation of the model with a given set of parameter estimates is compared with the selected data. This gives a joint test of the given parameter values and the structure of the model. Rather than using asymptotic theory, by simulating the model many times, in estimation, it is possible to derive numerical distributions of the estimated parameters and, in testing, the numerical distribution of the test statistic can be employed. The use of indirect inference in estimating and testing macroeconomic and other structural models is not a well-known empirical tool among economists. It attracted attention in the 1990s as a way of estimating nonlinear models and a variety of papers were written then on its estimation properties, essentially seeking to discover whether it could mimic the asymptotic properties of maximum likelihood (it did). But its potential role in testing models, especially in small samples, and its properties as an estimator in small samples, was not explored until the work described here which began around 2000. It is particularly useful for testing DSGE models which, as they are commonly estimated by Bayesian methods, tend not to be subjected to any formal statistical test. Le et al. (2016a) have provided a survey of testing DSGE models using indirect inference. In this paper we extend their survey by covering recent research which addresses a number of questions that have arisen in the use of indirect inference.
The earlier survey discussed the power of indirect inference tests in small samples, comparing this with the power of tests based on frequentist likelihood methods. It also suggested ways in which modellers could use this power to determine the robustness of their model, especially when used for policy analysis. It was assumed in that survey that the auxiliary model would be a VAR of low order in a few variables while the structural model would be a DSGE model, whether of small size as in Clarida et al. (1999), or large as in Smets and Wouters (2007). Since then the method has been applied to microeconomic and trade models where the features of the structural model are quite different; we do not explicitly consider these models here, but the questions we examine and the answers we give can be easily extended to them.
We address the following issues. In section 2 we discuss some important methodological issues in testing DSGE macroeconomic models such as whether it is worth testing DSGE models when, as some claim, they are irredeemably mis-specified a priori and, if it is, how best to estimate and test DSGE models. In section 3 we compare the power of indirect inference tests with the main alternative frequentist method, the likelihood ratio test in small samples. Asymptotically, the two cannot be distinguished under the usual assumption that the model is true. In part, but for completeness, this section recaps material from our earlier survey. In section 4, we review the evidence related to a variety of issues that arise when carrying out the indirect inference test, such as how many and which variables should be used in the auxiliary model, and whether moments, IRFs or a VAR are preferable auxiliary models. In section 5 we review the relationship between indirect inference and identification. In section 6 we conclude with a review of our main findings. Some of these issues have been addressed in published papers, including our own, in which case we have drawn on these in summary form to bring them into this conspectus of available work. We hope this will be helpful to applied economists using indirect inference.

Why Test a Mis-Specified Model?
It is nearly 40 years since DSGE models (in the form of RBC models) were first used in macroeconomics and, arguably, are now the standard approach. Throughout this time, and continuing to this day, their use has been heavily criticised from a number of perspectives. We will consider some criticisms often made by classically-trained econometricians.
A philosophical question in testing DSGE models raised by some econometricians is whether these models should be dismissed because they are so 'unrealistic' that one must consider them as inherently 'mis-specified' and so cannot be treated as 'true' for testing purposes. There is an irony in this as the impetus behind the use of DSGE rather than traditional macroeconomic models was Lucas's critique that the latter were essentially reduced form and not structural models and therefore could not be used for policy or control purposes.
Influenced by the Lucas critique, the focus in DSGE models is on structural modelling (estimating deep structural rather than Breduced-form^parameters). This has resulted in DSGE models being smaller and simpler than traditional macroeconometric models, especially in their dynamic specification. The smallness of DSGE models reflects the influence of Friedman's( 1953) advocacy of the use of simple rather than complicated models but it makes them more likely to be rejected by the data than traditional macroeconometric models. Traditional models usually have a flexible, and sometimes data-determined, dynamic specification which gives them a better chance of passing statistical tests, though it raises the question of whether this is due to data-mining. An example of a DSGE model that focuses on structural specification with hardly any structural dynamics is the Smets and Wouters (2007) DSGE model of the United States. Significantly, the disturbances in this model are very highly serially correlated, which could indicate severe dynamic mis-specification. The rejection of DSGE models using conventional testing procedures and the frequent presence of highly serially correlated structural disturbances, have all contributed to undermining the high ideals originally envisaged for the DSGE approach to macroeconomic modelling and may explain why some econometricians regard DSGE models as mis-specified and of little empirical relevance.
The issue of the 'truth' of models has a long history in economics and statistics. It was also addressed by Friedman (1953) who argued that the appropriate criterion for judging economic models is their predictive ability and not their 'realism' or 'truth'. Rather than dismiss DSGE models as 'incredible', as some have done, or accept that there is no point in testing them because they would fail the test, it would be better to find a way of putting them on firmer statistical foundations. Thus a test of these models should not be about whether or not they are 'true' but whether they are 'pseudo-true' meaning able to match the data of interest. This idea was developed from testing non-nested hypotheses by Cox (1961Cox ( , 1962. Indirect inference tests proceed on this basis; the model is treated as though true and, if rejected, is assumed to be mis-specified. The usefulness of the test is judged by its power which, in the case of indirect inference tests, we have found to be very high and superior to likelihood based tests.

How Best to Test and Estimate DSGE Macroeconomic Models?
The issue of how best to test and estimate DSGE models also has a long history. In an interview with Evans and Honkapohja (2005) Tom Sargent, referring to rational expectations models, remarked that Lucas and Prescott thought that likelihood ratio tests were rejecting Bt o om a n yg o o dm o d e l ŝ and that Bcalibration was intended as a balanced approach to professing that your model, though not correct, is still worthy as a vehicle for quantitative policy analysis^. The clear implication is that DSGE models, although possibly mis-specified, may still be 'good' in some a priori sense even if rejected statistically, and that calibrating rather than estimating DSGE models using classical econometric methods was a way of imposing a priori knowledge of the model parameters.
If traditional macroeconometric models are not structural but, due to the flexibility this allows, particularly in their dynamic specification, they can be specified in such a way that they pass statistical tests, whereas DSGE models are structural, deliberately simple and, because they are usually rejected using classical inference, strong prior restrictions are imposed in their estimation. One reason why classical likelihood methods are likely to reject mis-specified models is that they take account of all of the model's features and parameters, and not just its key features which may be well specified. For example, the RBC model's long-run properties may be good but the short-run properties may be poor because there is little information on how to specify them. Consequently, failure to allow sufficient lags in the dynamics could easily result in rejection even though the long-run characteristics of the model may be acceptable. Likelihood methods deal with this by just choosing the parameter values to best fit the dynamics revealed in the data and may generate biased estimates in order to do so. This has led to the counter criticism that conventional econometric models and methods arelikely to involve a strong element of data-mining. In effect, Lucas and Prescott were looking for a way to give more weight to some features of the model than others.
Subsequently a compromise was reached between the strong use of a priori knowledge in calibration and classical estimation through the use of Bayesian methods. If the mode of the posterior distribution is chosen as the point estimate of a parameter then, in effect, the Bayesian estimate is a weighted average of the prior and the likelihood functions where the weights reflect the precision of the two types of information; the tighter the prior or the the less informative the data likelihood, the more the posterior estimates are dominated by prior beliefs.
As a result of using Bayesian methods to estimate DSGE models, the models are rarely tested for mis-specification. This is despite the fact that the models are almost certainly to some degree incorrect. It is usually regarded as sufficient that the coefficient estimates are significantly different from zero as judged from their posterior distribution. It is common to find that the priors for the parameters are close to their posterior distributions. This could be interpreted in two ways. Either it implies that the priors are very strong relative to the data likelihood and so dominate the posterior thereby undermining this check on model specification. Or, the data likelihood is small implying that the model would not fit the data using classical inference which suggests either that the model is mis-specified, or that these data are not very informative.
Indirect inference is ideally suited to testing a previously estimated model such as a Bayesian estimated DSGE model. As we know that a DSGE model is not a 'true' representation of an economy the test should be interpreted not as whether the model is true' but whether it is 'pseudo-true' and hence provides a sufficiently good representation of the data as determined by the probability that the model is not rejected by the test. As useful as this is, it is still somewhat of a blunt instrument as, in the event that the model is rejected, it does not reveal which features of the model are causing the rejection. We have attempted to address this issue in Minford et al. (2019)byproposing an indirect inference test of parts of a model rather than the whole model. We found that the rejection of the whole Smets-Wouters model of the United States (Smets and Wouters 2003) was likely to be due to mispecification of the wage-price sector of the model.

The Auxiliary Model
The early, calibrated, RBC models were evaluated against the data by comparing data simulated from the calibrated model with observed data by the method of Bmatching the moments^. This consists of inspecting the variances, covariances and serial correlation of the two sets of data. Inspecting the moments is not a formal statistical test but it does illustrate the use of an auxiliary model to evaluate a model by comparing simulated and observed data. In order to obtain a formal statistical test it is necessary to derive a single statistic that summarises the model. Indirect inference provides such as test through a comparison of the properties of an auxiliary model that is estimated on the simulated and the observed data. There is more than one choice of which property to base the test on. Examples are (i) the scores, (ii) the estimated coefficients of the auxiliary model, and (iii) the impulse response functions.
A natural choice of the auxiliary model is the VAR solution of a DSGE model: a detailed example is provided in Appendix 1.
It can be shown that a linearised DSGE model of general form where the variables y t are determined by the model and z t are exogenous variables, including shocks, has the solution And if the exogenous variables are generated by the VAR then E t z t + s = B s z t ,(s > 0) and so the solution is the VARX y t ¼ Py t−1 þ Hz t þ Rz t−1 þ Sξ t and the complete data set (y t ,z t )isth eV AR Hence, a VAR is a natural choice of auxiliary model. We may note in passing that the form of the solution of DSGE models implies that their forecasting performance will be similar to that of a VAR unless there is better information available about future values of the exogenous variables, for example, where the exogenous variables are policy variables and there is a credible and accurate announcement of the future policy stance. Supporting evidence for this is provided by Wieland and Wolters (2012)an dW ic ke ns(2014).
We have based our indirect tests on a comparison of the VAR coefficients from simulated and observed data. Using impulse response functions would just involve using functions of these coefficients. We have also found that using VARs of low order in only a few variables provide an indirect test with good power even for a large model such as that of Smets and Wouters which has a reduced form solution that is a VAR of fourth order in seven variables. And we have found that using numerical estimates of the distribution of the test statistic gives tests of higher power than those based on the asymptotic distribution of the test statistic. This may reflect the small sample sizes available for estimating and testing DSGE models consisting of 200 or less observations.
Although a formal test of a DSGE model should clearly be carried out, matching the moments in the way used to evaluate RBC models does have some advantages. Whereas testing the whole model via a single statistic can be carried out in a way consistent with statistical theory, inspecting individual elements of the model can help reveal which parts of the model best capture the data and which don't. Thus, as discussed in more detail in Wickens (2012), the basic RBC model based on a single productivity shock such as that of King et al. (1988) produces simulated moments which show that consumption is not smoothed as much as the theory predicts, and that real wages are not as flexible as predicted. Much of the subsequent development of DSGE models was to produce models that fitted the data for these features. Another way of focussing on the aspects of the model that are inconsistent with the data, and do so in a formal statistical manner is, as noted earlier, to test parts of the model. .

Testing DSGE Models by Frequentist Methods-Indirect Inference and the Likelihood Ratio
In our previous survey we devoted much space to comparing the standard 'direct inference' test, the likelihood ratio test (LR), to the indirect inference test (IIW) in the context of small samples. (The two test procedures are explained in detail in Appendix 2). We showed two main things. First, that the LR test had no power against a mis-specified model, whereas the IIW test had very high power, effectively rejecting a mis-specified model 100% of the time. Secondly, in the situation where the model is well specified but the parameters are numerically inaccurate, we found that the IIW test had considerably more power, rejecting models whose parameters were randomly wrong by 5-7% for 100% of the time, whereas the LR test rejects such models with only small frequency; to reject 100% of the time it needs parameters to be wrong by as much as 20%.
We explained these results by noting that the LR test, when faced with a numerically incorrect model, is similar in power to an indirect inference test where the distribution of the auxiliary model parameters -e.g. the VAR coefficients -is found from the data, as opposed to the restricted model. Thus, here, the IIW test is simulated from the structural model being tested, and the distribution of the VAR coefficients that this structural model implies is used in the test. Plainly this distribution closely reflects the features of the structural model. By contrast, the LR test takes the distribution implied by the sample data, as generated by the (unknown) true model.
The low power of the LR test when the model is mis-specified (but has unrestricted error dynamics) has a different cause. When one tests such a mis-specified model by LR, one first has to estimate it by ML. Thus one is asking: is this (mis-specified) structural model generating the data sample? To give the model the best chance of passing the test, first one re-estimates the model together with its unrestricted error dynamics. In finding a model with best fit, ML will use the estimates of the error processes thereby minimising the forecast error that the LR test is based on. As a result, finds a well-fitting model by substituting fitted error processes that are good at creating low forecast error. Thus it is hard to distinguish the mis-specified model from the true model using the LR test. Moreover, the data sample from the true model will also easily fit the mis-specified model. By contrast when one re-estimates the mis-specified model by II, there is simply no way the structural model can generate the same data behaviour as the true model. Whatever estimates are found will generate different reduced form behaviour. Hence the very high rejection rate. We show the key results from our previous paper on this and report some further findings.

Mis-Specification
First we show the experiment where we test a mis-specified model. The Table below shows the results when the true (Smets-Wouters) model is New Keynesian and the misspecified is a New Classical version and for the reverse case where the true model is New Classical and the mis-specified model is New Keynesian. As can be seen, the power of the II test is close to 100%, whereas that of the LR test is zero (Table 1).
We now investigate a more subtle form of mis-specification. It takes the form of a failure to include in our model features from a more complex model that is in fact generating the data. A fortiori an LR test would again be unable to detect this more subtle form of mis-specification. We set up a Monte Carlo experiment in which the data generating process (DGP) is a more complex model. Our starting point will be, as in Le et al. (2011Le et al. ( , 2016a, the Smets-Wouters model of the US where the data are from the early 1980s. Le et al. (2011Le et al. ( , 2016b found that this model when modified to allow for a competitive sector and for banking, can explain well the main US macro variables: output, inflation and interest rates. It is this model, and versions similar to it, that we have used in previous Monte Carlo experiments. To this model we can add money and a regime shift contingent on the state of the economy, from the Taylor Rule to the zero bound, as in Le et al. (2016b). This makes the model's parameters state-contingent so that it has this form of nonlinearity. We then treat this nonlinear model as if it were the DGP generating the data. Using the indirect inference test with a VAR as the auxiliary model we estimate the power function for the falseness criterion described above in order to assess the sensitivity of the power function to the presence of greater nonlinearity in the true model than the 'assumed true' DSGE model we started with.
We looked at three very similar models, of varying complexity. All three are based on the Smets-Wouters model as modified in Le et al. (2011). Model 1 is that model exactly. Model 2 is,that model with the financial shock replaced by the Bernanke et al. (1999) model of banking (the 'financial accelerator'). Model 3 is the same model, together with an extension in which collateral is required and base money acts as cheap collateral, and the additional nonlinearity of the zero bound constraint, triggered whenever the Taylor Rule interest rate falls below a low threshold. These last two models are set out in Le et al. (2016b).
From the point of view of 'realism' and 'truth' we regard model 3 as the most realistic, model 2 as a linear approximation to it, and model 1 as a simpler approximation to model 2. We investigate whether in each case the simpler, less realistic, model can be treated as a valid approximation to the more realistic one.
We carried out the following three experiments with sample sizes between 75 and 200, and with 1000 sample replications. Our IIW test was based in all cases on the coefficients of a three variable VAR(1) in (y, π, R); as is the usual practice in applying these tests, it also includes the three variances making 12 parameters in all. Table 2 shows that in all cases there is an overwhelming probability of rejection, close to 100% and falling to 80% with the smallest sample size of 75 and the two models closest in complexity (models 1 and 2).
The models used for this experiment were those that Le et al. (2011Le et al. ( , 2016a where the parameters were estimated by indirect estimation using US data. In practice, when we carry out the IIW test on a model for a particular sample we re-estimate the model for each sample generated by the true model. As this is highly time-consuming, we did it selectively for two model pairs and different sample sizes: for a sample of 125, model 3 is the complex model and model 1 is the simpler model; with a sample of 75, model 2 is the complex model and model 1 is the simpler model. Table 3 shows that the rejection rate for model 1 for the first pair is still 100%, even though there is some increase in closeness. Thus, even for a sample as small as 125, the rejection of mis-specification remains virtually 100%. In the second case, Table 4, where model 1 is tested using data from model 2 with a sample of only 75, it is somewhat harder to distinguish model 1 from model 2, the two closest models: the rejection rate falls to 68.6%. However, rejection is still overwhelmingly probable. As noted above, an LR test would have had zero power for these mis-specification tests; plainly, the IIW test provides considerable power against mis-specification. To summarise, we have found that our IIW test can establish a) whether a model can predict relevant features of the data and, if so, b) the bounds within which we can be sure of its specification and parameter values. For the widelyused DSGE model examined here, we found that, if the model passes the test about the behaviour of three key macro variables, then its power largely guarantees that no other specification can be correct, and its parameter values lie within a 7-10% region of those estimated.

Parameter Inaccuracy within a Correct Specification
We now consider for a wide variety of VAR auxiliary models how the LR test performs, as compared with the IIW test for different levels of structural parameter inaccuracy. Our results, based on the simulations of the structural model being tested, are reported in Table 5.
Two things can be seen in the comparison of these two tests in the bottom two panels. First, in general, the power of the LR test does not reach 100% rejection until the model parameters reach 20% falsity regardless of the VAR order which reflects the detail with which data behaviour is described. The IIW test has high power even when the detail is kept rather low -e.g. with a VAR(1) in just three variables. Thus, we find that a structural model that is just 7% false is rejected with 98% frequency. Second, and by contrast, as the detail included in the VAR description rises with higher order, the IIW test acquires more power.
Turning to the top panel, where the II test is carried out using the distribution of VAR coefficients derived from the data sample VAR itself (i.e. unrestricted by the structural model being tested), two things emerge. First, as the size of the VAR increases, the information from the data sample is increasingly unable to provide well-defined estimates of the distribution of the VAR coefficients. This is because the variation in too many coefficients has to be evaluated on too little data. Second, for lower order VARs, where evaluation is possible, the power is similar to that of the LR test. This is because, as noted in Appendix 2, these two tests are transforms of each other and therefore tend to produce similar test results. This section has summarised the findings of our earlier survey paper, from which we concluded that, whatever VAR was used as the benchmark comparator model, LR tests would provide little or no power against mis-specification, and rather low power against numerical inaccuracy. It has also added a further result, namely, that more subtle aspects of mis-specification can be detected with high precision by indirect inference tests. Next we focus just on the IIW test, considering how best to use it and in what particular form. There are a variety of practical concerns about best to carry out indirect inference tests. One is whether it matters which features of the data are chosen as the 'descriptors' for the Wald test. As noted above above, a VAR is a natural choice of auxiliary model and the test can be based on its moments, its coefficients or its associated impulse response functions; the test results are largely the same. The key element in the power of the test is how many of these descriptors are used; this relates to the point made above that using more is equivalent to requiring the model to replicate more detailed features of the data, as with pixels in a photograph. Thus we showed above how increasing the number of VAR coefficients raised the power. Another concern is the choice of variables used for the data description. For example, in evaluating the Smets-Wouters model we showed test results using the three main macro variables, output, inflation and interest rates. Would it be any different had we used three other variables, say consumption, investment and real wages? The answer is no, or there is a negligible difference, (Meenagh et al. 2018) as any three variables give much the same results. The reason is related to the last issue: provided the same amount of information about the data features is included, the test works similarly. One can think of each piece of information as being a nonlinear combination of the model's structural coefficients; the question is whether this is matched by their effect on the data. The number of pieces of information gives the number of matches required by the test. What matters is the amount: more VAR coefficients for example, but not which VAR coefficients (Tables 6 and 7).
This may appear to be puzzling; the original idea in DSGE modelling of comparing the moments of data simulated from the model with those from the observed data stressed that it was good to select data features of interest to the user. We have found in our Monte Carlo work on the small sample properties of the IIW test that, provided the number of descriptive features is held constant, whichever variables are focused on, and however their behaviour is measured, whether by moments, IRFs or VAR coefficients, the IIW tests provide roughly the same power and estimates with much the same bias. Why is this? Three variables are used in VAR are (y, pi, r), as in Le et al. (2011). Source: Minford et al. (2016) An analogy would be with taking a fingerprint or footprint or skin sample from a person; given that person's DNA, each would be a test of whether that person was the source. Here we take a sample of behaviour and ask whether it comes from this DSGE model? It would seem that provided our data sample is of reasonable size and samples enough features, it can act as a rather accurate test of whether the DSGE model generated it.
For policymakers this is a reassuring conclusion. If they find a model that passes ANY of these tests, whichever they choose, they can establish from the test'sp o w e r with certainty how false their model could be, in different respects that concern them. Usually they will be concerned with 'general falseness' where any or all of their parameters could be mis-estimated: they can then say within what bounds the parameters would lie for their model. They could use this to calculate robustness tests for their policy proposals.

Forecasting Tests
In effect the within-sample LR test assesses how accurate a model is for 'nowcasting' as it is based on the size of within-sample predictive errors. The next Table shows the power of out-of-sample forecasting LR tests for three variables jointly 4 and 8 quarters ahead, together with the IIW test. It can be seen that the out-of-sample LR test has weak power compared to the IIW test; the rejection rises close to 100% only with 20% or more falsity. Non-stationarity does not appear to affect these results (Table 8).

Tests of Parts of Models
The focus in our tests to this point has been on a full DSGE model. As noted earlier, if the full model fails the test then it is necessary to find out which parts of the model are causing the rejection it in order to re-specify the model. Alternatively, the investigator may not have the time and resources to construct or investigate a full model and would be satisfied with using a few equations of a partially specified model. Minford et al. (2019) have proposed a limited information indirect inference test for testing a subset of equations of the full DSGE model. Like the indirect inference test for a full DSGE model, this limited information test of part of the model is very powerful. The test exploits a useful way to carry out the limited information estimation of a part of a simultaneous equation system. The idea is to replace the structural equations that are not to be tested by their reduced form solution while keeping the structural equations that are to be tested and then to carry out the indirect inference test for the resulting completed model. In practice the unrestricted auxiliary model is used to replace the structural equations not tested. As a result the equations i nt h ea u x i l i a r ym o d e lb a s e do nt h es i m u l a t e dd a t af r o mt h ec o m p l e t e dm o d e l will be the same as those for the observed data. The test of the subset of interest is therefore based solely on the two sets of estimates of the other auxiliary equations. The simulated data from the completed model can also be used for the analysis of a partially specified model.
To illustrate the use of this test we examine the specification of two sectors of the SW model: the wage-price equations and the consumption-investment equations. Using an auxiliary VAR(1) model the limited information and full model tests of the wageprice sector have the power properties reported in Table 9.
The corresponding tests of the consumption-investment equations have the power properties reported in Table 10.
For both sectors the power of the limited information and the full information tests are similar. However, the power of the tests of the wage-price sector are much higher than those of the consumption-investment equations. The first implication is that there is little or no loss in power from using the limited information test. A second implication is that the behaviour of the Smets-Wouters model is much more sensitive to the specification of the wage-price sector than the consumption-investment equations. This suggests that the reason the original full Smets-Wouters model is rejected is that the wage-price sector is mis-specified; any mispecification of the consumption-investment equations appears not to affect the full model. So far in this review we have considered only the properties of tests of models, especially their power in small samples, for models that have already been estimated.
In the case of DSGE macroeconomic models this is typically by Bayesian methods for which the model is not subject to a test except, possibly, by inspection of the significance of the posterior estimates via their posterior distribution. We have previously noted some problems in interpreting such a check. If the prior dominates the likelihood function and hence the posterior then such a check will not reveal how well the model fits the observed data. We have also noted the claim that classical likelihoodbased methods tend to reject too many Bgood^models, hence the use of Bayesian estimation to limit the role of observed data. One reason why likelihood methods may produce estimates not consistent with theory is that they attempt to find a model of best fit. If the model is mis-specified then it will produce biased estimates in order to maximise the fit. This may be why Bgood^models are often rejected, meaning the estimated parameters are not compatible with economic theory. Nonetheless, we have found that models estimated by maximum likelihood and tested using the LR test have low power compared with indirect inference. Specifying the model with unrestricted error dynamics is often found to be a way of improving the in-sample fit of models estimated by maximum likelihood, but it may also have the effect of lowering the power of the LR test, especially in small samples. Indirect estimation is an alternative to both Bayesian and maximum likelihood estimation. We have found that it provides an estimator with smaller biases than maximum likelihood, typically around 1% for the average absolute bias across the DSGE parameters. This property comes from the high power of the test in rejecting false parameter values: the IIW tends to rise increasingly rapidly as the parameter estimates diverge from the true values. Regardless of whether the estimator is based on VAR coefficients, IRFs or moments, the small sample bias is small: it is essentially the same for the first two, with a slightly larger bias for moments. To illustrate, when one compares the indirect estimates based on these three descriptors using different threevariable sets, we find that the bias is small and hardly differs. These results are reported in Tables 11 and 12. α income share of capital h external habit formation ι p degree of price indexation ι w degree of wage indexation degree of price stickiness ξ w degree of wage stickiness φ elasticity of the capital adjustment cost function Φ 1 + the share of fixed costs in production ψ elasticity of the capital utilization adjustment cost r Δy Taylor rule coefficient ρ Taylor rule coefficient (interest rate smoothing) r π Taylor rule coefficient r y Taylor rule coefficient σ c elasticity of intertemporal substitution for labor σ l elasticity of labor supply to real wage This bias can be driven as low as one wishes by increasing the number of features in the auxiliary model, as the power rises with the number of features until the point is reached where the full reduced form VAR is used (or its equivalent in moments or IRFs). This is because the indirect estimator and the indirect inference test are based on the same objective function: deviations of the estimates of the parameters of the auxiliary model from simulated and observed data. The bias reduction comes at the cost of the indirect inference test having extremely high power against even the slightest parameter falsity. It follows that if one has a tractable model that is not exactly the true model the parameter values it will estimate a) may not pass this powerful test - as they are simply the values that get closest to passing it, b) may not pass the weakerpowered test either -as they were not selected as the values that get closest to passing this weaker test. From the user's viewpoint the key aim is to find a tractable model with parameter values that pass the test; having found such a model it is then possible to assess the robustness of the whole-model results to potential falsity, as described earlier.
If, however one cannot find a model that passes the test, there is no way to make this assessment. Hence our view is that it is best to use in indirect estimation the same criterion as for the indirect test statistic that was given above. It is notable that Dridi et al. (2007) propose a two-step procedure to achieve both objectives: estimation and evaluation of mis-specified DSGE models. In the first step the model is estimated using a well chosen set of moments; in the second step, the model is evaluated with chosen features of the data that the model tries to replicate. They derive the asymptotic distribution of the test statistic under the hypothesis that the DSGE model is mis-specified and therefore use the variance-covariance matrix from the unrestricted VAR. Hall et al. (2012), andGuerron-Quintana et al. (2017) use the IRFs as the data descriptor in the auxiliary model and discuss both the small sample and large sample properties of the indirect approach to estimation, but not as a method of testing a model. As we have seen above, Minford et al. (2016) compare the test with different data descriptors and find that mostly the properties are quite similar. The two ways of conducting indirect estimation are, of course, closely related. There has been a persistent suspicion that DSGE model parameters cannot be uniquely recovered from the data. In other words, the model is not identified. Canova and Sala (2009) and Le et al. (2012) have provided examples of the lack of identification of DSGE models in the New Keynesian model, see also Appendix 1. In response to such concerns various efforts have been made to establish whether various prominent DSGE macro models are identified. One avenue has been to use the rank condition which tests whether, with no limits on data availability, a DSGE model's parameters can be uniquely discovered -Iskrev (2010), Komunjer and Ng (2011), Qu and Tkachenko (2012). Another avenue is to use indirect inference in the way suggested by Le et al. (2017). These two methods establish 'precise' identification -i.e. whether, in the presence of unlimited data, another set of parameters can be found that generate exactly the same reduced form as the DSGE model in question. A related issue is that of 'weak identification'. By this is meant whether individual model parameter values can be distinguished with any confidence from other values from some false, competing, version of the model. As the data sample becomes smaller this becomes of increasing concern; when data is unlimited this reduces to the same as precise identification. In practice data is limited and so one would like to examine identification in the context of much, but not unlimited, data. This can be addressed by indirect inference by asking how much power the test has against false parameter values in relevant data sample sizes. Le et al. (2017) find that Smets-Wouters parameters are generally strongly identified according to this criterion but, in a small three equation New Keynesian model similar to that of Clarida et al. (1999), several parameters suffer from weak identification.
This analysis used the model's full reduced form which gives maximum discrimination against false parameter values. In practice we use, as we have seen, a low order VAR in only a few variables (or the equivalent in moments or IRFs). Does using such a low-order approximation to the reduced form create a weakness of identification?
We know that the Smets Wouters model is identified in the sense that its full reduced form is unique to it -see Le et al. (2017). We may, however, be concerned that as we reduce the number of variables and the order of the VAR that is approximating the reduced form, the amount of information included drops sufficiently to give very imprecise estimates of the DSGE model parameters. When the sample is also small, this problem could become more acute. We then have the problem of weak identification.
This problem can be investigated through Monte Carlo experiments. We falsify individual parameters of the Smets Wouters model progressively and check the rejection rate over many samples from the true model. If the power is poor so that the rejection rate barely rises with such falsity then we can regard the parameters as weakly identified as, plainly, we cannot distinguish well between possible false and true parameter values. The results of this experiment are reported in Table 13.
This exercise shows that rejection rates vary across parameters: while the rejection rates of three of the parameters rise only slowly, rejection rates for the others rise quite fast. This mirrors the findings of Le et al. (2017) for the full VAR reduced form of the Smets and Wouters model and a large sample. It seems that reducing the VAR to be of order 1 with only three variables and using a much smaller sample, does not create weak identification.
7Co n cl u si on s Indirect Inference is a method for testing and estimating models of any size, complexity or nonlinearity by comparing the behaviour of data simulated from the model with that of selected observed data where behaviour is summarised by a set of descriptive features, known as the auxiliary model. Its use in the testing and estimation of macroeconomic and other structural models has increased in the past few years. Nonetheless, as it is a relatively novel procedure, many questions remain about its detailed application. As these questions go beyond the scope of the survey in Le et al. (2016a), we update that survey by providing answers to some of these additional questions.
The earlier survey discussed the power of indirect inference tests in small samples in comparison with tests based on the main frequentist testing alternative using the data likelihood. The survey suggested ways in which modellers could use this power to determine the robustness of their policy or other user results. In that survey it was assumed that the auxiliary model would be a VAR of low order in a few variables while the structural model would be a DSGE model of some sort.
In this update we have discussed the view that DSGE models are so unrealistic that it is pointless to test them. We have argued that the main purpose of testing is to establish whether, even if they are mis-specified, they are robust enough to be of practical use and they should be regarded as pseudo-true for the purposes of hypothesis testing. As DSGE models models are commonly estimated using Bayesian methods in which the priors often dominate the data likelihood, but not as a result tested, a check on the robustness of the estimates seems to us to be advisable. Indirect inference requires the use of an auxiliary model. We have reported results from our previous work and from new results obtained for this paper which show how the choice of auxiliary model affects the power of the test. We have found that using a VAR of low order in only a few variables may be adequate for providing a powerful test of a large model such as that of Smets and Wouters (2007) even though its true reduced form is a much higher order VAR in more variables. Moreover, the selection of which variables to use in the auxiliary VAR model is found to make little difference. It also makes little difference to the test results whether one uses a VAR as the auxiliary model or moments (as in the Simulated Method of Moments) or impulse response functions as in some recent applied work.
We also showed how one can test parts of models using indirect inference. There is little or no loss of power compared to using the full model. We have shown how indirect inference tests can be used to check both the precise and weak identification of DSGE models, although their identification does not seem to be much of a problem in practice. Likelihood ratio tests often have low power within sample, we have shown that they also have low power in forecasting tests out of sample.
In summary, we continue to think, as suggested in the original survey of Le et al. (2016a), that indirect inference provides a powerful test of structural macro models, which enables policymakers and other users to assess rather accurately how robust their models are to possible errors of specification and estimation.
Appendix 1: The auxiliary model: a VAR representation of a DSGE model There are several ways of deriving a VAR representation of a DSGE model. We make use of the ABCD framework of Fernandez-Villaverde et al. (2007). We consider solely what these authors call the 'square' case, where the number of errors and the number of observable variables are the same. We also consider only DSGE models with no observable exogenous variables. Both the Smets-Wouters model Wouters 2003, 2007) and the 3-equation model New Keynesian model used by Le et al. (2017) and Liu and Minford (2014) for their numerous IIW tests fit this framework. (Other classes of models, for example those with 'news shocks', require a different treatment which is beyond our scope here.) To illustrate, consider the 3-equation New Keynesian model of Le et al. (2017): 1 ðÞ þ e yt r t ¼ γπ t þ ηy t þ e rt e it ¼ ρ i e i;t−1 þ ε it i ¼ π; y; r ðÞ This has the solution π t y t r t where.
where z 0 t ¼ π t ; y t ; r t ½ ; e 0 t ¼ e πt ; e yt ; e rt ÂÃ ; Φ ¼ K Â H: Thus the matrix Φ is restricted, having 9 elements but consists of only 5 structural coefficients (the ρ i can be recovered directly from the error processes), implying that the model is overidentified according to the order condition. The model is not identified, however, if the ρ i =0 for all i. 2 The solved structural model can be written in ABCD form as follows where y (replacing z above) is now the vector of endogenous variables and x (replacing e above) is the vector of error processes: Le et al. (2017), also establish that it is identified using the IIW test in unlimited-size sampling.
Note that y t = Φx t is the (solved) structural model. Hence x t = Φ −1 y t . The VAR representation is 3 We may also note that More generally, the solution of a linearised DSGE model (including the SW model and the 3-equation model) can be summarised by a state-space representation: 4 where x t is an n × 1 vector of possible unobserved state variables, y t is a k × 1 vector of variables observed by an econometrician, and ε t is an m × 1 vector of economic shocks affecting both the state and the observable variables, i.e., shocks to preferences, technologies, agents' information sets, and economist's measurements. The shocks ε t are Gaussian vector white noise satisfying E ε t ðÞ ¼ 0; E ε t ε 0 t ÀÁ ¼ I. The matrices A, B and C are functions of the underlying structural parameters of the DSGE model. Using the ABCD framework of Fernandez-Villaverde et al. (2007), the state-space representation can be written as the VAR We have assumed that the DSGE model includes no observable exogenous variables. If it does then the solution to the DSGE model contains exogenous variables as well as lagged endogenous variables: in general, lagged, current and expected future 3 If the DSGE model also had one-period lags in one or more of the equations so that the solution became z t = Φe t + Λz t − 1 then we would obtain a VAR(2) as follows: (1) x t = Ax t − 1 + Bε t (2) y t = Cx t − 1 + Dε t + Λy t − 1 Using x t − 1 = Φ −1 (y t − 1 − Λy t − 2 ) we obtain y t ¼ ΦPΦ −1 þ Λ ÀÁ y t−1 −Φ −1 Λy t−2 þ Φε t exogenous variables. If, however, the exogenous variables are assumed to be generated by a VAR process then the combined solution of both the endogenous and exogenous variables is a purely backward-looking model that can be represented as a VAR. 5 Appendix 2: The LR and the IIW test statistics-asymptotic comparisons In indirect inference we do not impose the restrictions on the coefficients of the auxiliary model that are implied by the structural model. Instead, we estimate the auxiliary model on data simulated from the structural model and compare these estimates with those obtained from using the observed data. In both cases the auxiliary model is estimated without any coefficient restrictions. The restrictions imposed by the DSGE model are reflected in the simulated data and not through explicit restrictions on the auxiliary model. Since both the LR test and the IIW test involve estimation of an unrestricted VAR, first we briefly review the maximum likelihood estimation (MLE) of a standard unrestricted VAR. Consider a randomly generated sample of y t of size T.I fη t is assumed to be NID(0, Σ) then the log-likelihood function is  (2007), Del Schorfheide (2004, 2006) and Del Negro et al. (2007aNegro et al. ( , 2007b (together with the comments by Christiano (2007), Gallant (2007), Sims (2007), Faust (2007) and Kilian (2007)), and Wickens (2014 In order to find the variance matrix ofV it is convenient to re-express the VAR. Denoting the T observations on the i th element of y t as the T × 1 vector y i and of η t as η i , each equation of the VAR may be written as where v 0 i is the i th row of V and Z is a T × k matrix with t th row y t − 1 . The VAR may now be written in matrix form as where Y ¼ In general v 94 is a biased estimate of v as Z consists of lagged endogenous variables, but plim v 94 ¼ v and the limiting distribution of ffiffiffi ffi The LR test The LR test for a DSGE model based on the observed data compares the likelihood function of the auxiliary VAR derived from the DSGE model with the likelihood function of the unrestricted VAR computed on the observed data. The former is based on the estimate of the variance matrix of the structural errors from the solution to the DSGE model. On the assumption that the auxiliary model is the solution to the DSGE model and is a VAR, this is also the error variance matrix of a restricted version of the auxiliary VAR. The latter is based on the estimate of the error variance matrix of the unrestricted auxiliary VAR. As the auxiliary model is a VAR, the LR test is, in effect, based on the one-period ahead forecast error matrix. Thus, the logarithm of the likelihood ratio test is where L R and L U denote the likelihood values of the restricted and unrestricted VAR, respectively, and Σ R and Σ^are the restricted and unrestricted error variance matrices. Note that, given estimates of the DSGE model, we can solve the model for v, and hence we can calculate η t and Σ R ¼ T −1 T t¼1 η t η 0 t . Note also that the LR test can be routinely transformed into a (direct inference) Wald test between the unrestricted and the restricted VAR coefficients, v.
To obtain the power function of the LR test we endow the structural model with false values of the structural coefficients and compare the restricted VAR with the unrestricted VAR on the observed data which are assumed to be generated by the true model. The implied false model has the VAR The forecast errors for the false model are η t þ q t ðÞ η t þ q t ðÞ 0 then the LR test for the false model is given by: Thus the power of the test derives from the distance The IIW test In the IIW test we simulate data from the solution to the already estimated DSGE, randomly drawing the samples from the DSGE model's structural errors. We then estimate the auxiliary VAR using these simulated data. We repeat this many times to obtain the average estimate of the coefficients of the VAR which we take as the estimate of the unrestricted VAR. The simulated VAR may be written y S;t ¼ V S y S;t−1 þ η St where y S, t is the data simulated from the DSGE model and V S is the (average estimate of v) or, in the form of eqs. (9)and(10), as The IIW test statistic, which computes the distance of these estimates from the unrestricted estimates based on the observed data, is: where W S is the covariance matrix of the limiting distribution of v S ,a n di s given by On the null hypothesis that the DSGE modeland hence the auxiliary VARare correct, the asymptotic distribution of the estimate of v S is the same that of the MLEv. Moreover, asymptotically, this IIW statistic will have the same distribution asv−v ½ 0 W −1v −v ½ and hence will have the same critical values. 6 In general, the IIW statistic differs from a standard Wald statistic in indirect inference which isv−v S ½ 0 W −1v −v S ½ where W is the covariance matrix of the unrestricted model; we refer to this as the unrestricted IIW statistic.
The power of the IIW test is calculated, like that for the power calculations for the LR test, by simulating the DSGE model using false values of its coefficients and now using these data to estimate the unrestricted VAR from eq. (12). The IIW statistic is then computed from where v F is the mean vector of coefficients and W F is their variance matrix, which corresponds to W S . Consider the decomposition It follows that the IIW statistic can be decomposed aŝ where the last term is based on the difference between the true and the false values of the coefficients. Hence the power of the IIW test derives from the second term on the right-hand side of eq. (19).

Comparing the power of the two tests
We have seen that the LR test compares the one-step ahead forecast error matrix of the unrestricted VAR with that of the model-restricted VAR using the observed data, whereas the IIW test asks whether the distribution of the VAR coefficients based on the simulated data (the restricted model) covers the VAR coefficients based on the observed data (the unrestricted model). We have also found that on the null hypothesis that the DSGE model is true the limiting distributions of the two sets of estimates are the same. It follows from eq. (7) that, on the null hypothesis, the error variance matrix using simulated data is where Σ^is the error variance matrix of the unrestricted VAR using the observed data and Δ is O p T − 1 2 .
Using the result that vec(AXB)=(B ′ ⊗ A)vec(X), and vec(V ′ )=v, it can be shown that In other words, on the null hypothesis that the DSGE model is the true model, the LR test based on observed data is asymptotically equivalent to using the IIW test, which is based on simulated data.
In the power calculations we use The power of the test derives from the last term which reflects the difference between V S and V F .ThismakesΔ of order O p (1), which does not vanish as T → ∞, but causes the power of the test to tend to unity. It is worth relating this finding to the work of Dridi et al. (2007) who propose a Wald II test that treats the model being investigated as mis-specified; they therefore use the variance-covariance matrix from the unrestricted VAR which is generated by the unknown true model. This II test is asymptotically equivalent to the LR test, as we have seen, and differs from the IIW test proposed here which is based on the restricted VAR generated by the DSGE model being investigated-in effect this IIW test treats the DSGE model as pseudo-true and therefore as the null. In what follows we systematically compare the small sample properties of these different tests; as we will see the IIW test has the greatest power. It is really irrelevant whether the DSGE model being tested is regarded as 'mis-specified' or not, since as we have already argued, this not the issue: the issue is whether such a model is sufficiently close to the data on the test chosen to be regarded as 'pseudo-true'. For establishing this it is helpful to have a test with as much potential power as possible; as we will show below, that potential power can then be tailored flexibly to the user'spr ob le m.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.