Quantitative Macroeconomics: Lessons Learned from Fourteen Replications

I replicate all tables and figures from fourteen papers in Quantitative Macroeconomics, with an emphasis on incomplete market heterogeneous agent models. I report three main findings: (i) all (non-welfare related) major findings of the papers replicate, (ii) welfare findings based on linear approximation methods—1st-order perturbation, linear and log-linearization around steady-state, and linear-quadratic methods—should be treated as quantitatively suspect, (iii) decisions around methods for discretizing exogenous shocks have a large and unappreciated influence on results and should be prominently discussed in papers. While some smaller aspects of the papers do not replicate exactly, rather than nitpick in the body of this paper I instead describe some lessons learnt that may be useful for practitioners working with Quantitative Macroeconomic models. The replications use global methods allowing for non-linearities and I argue that these are important and need to be more widely used. I provide a checklist that researchers can use when trying to check that their work will be more easily reproducible. Matlab codes implementing the replications using the VFI Toolkit are provided, and full results of all replications are given in the online appendix. I conclude with three core points for best practice: (i) codes be made directly available (e.g., on github, not only ’on request’, and not just inside a zip file), (ii) report not just baseline parameters but also hyperparameters, equilibrium values, non-baseline parameters and initial conditions, and (iii) replication means rewriting codes from scratch, not just re-running available codes.


Mathematics Subject Classification
Imitation is the sincerest [form] of flattery.
-Colton, Charles Caleb (1824) I quasi-replicate a number of classic papers in Quantitative Macroeconomics. The replications are quasi-replications in two senses: I do not attempt to use the same numerical methods to solve the model as the original authors, and I (only) replicate all figures and tables relating to the model. 1 My interest is not in nitpicking about where the original papers report a 'wrong number' (whether due to typo, coding error, etc.), and for this reason I relegate all the actual replicated tables and figures to the appendix. The focus of this paper is instead on the lessons to be learned from these replications and on providing some suggestions for best practice based on the experience of performing the replications.
My main finding is that there is no replication crisis in the Macroeconomics of Quantitative Macroeconomics, but there is a minor crisis in the Quantitative. By this I mean that the major conclusions from all the papers replicated are unchanged, but most of the papers contain some numbers that are incorrect by a magnitude that is quantitatively important. 2 Replication is typically thought of as relating to data and statistics. So why replicate computational results from Quantitative Macroeconomics? The main reason is the exact same reason underlying the importance of replication to data and statistics: establishing the reliability of existing results. The need to do so follows directly from thinking of computational models as a form of laboratory in which we run experiments (Bona and Santos, 1997). A secondary use for replications follows as Economists often learn to write code by solving existing models and replication provides the needed reliable solutions for this. 3 If anything, simple mistakes may be more common when computing Quantitative Macroeconomic models than in other parts of Economics as they depend not only on using data and statistics but also require substantial coding. An additional reason is to understand the influence on Macroeconomics of the choice of which numerical methods are used to solve the model: I document some interesting examples of the importance of this. Table 1 provides a list of the papers I replicate with a focus on general equilibrium heterogeneous agent models with incomplete markets. In defense of the selection I simply note that the replicated papers are well cited with a mean number of citations of 394 in Ideas Repec and of 1167 in Google Scholar as of early 2021 (and a minimum number of citations of 80 and 224, respectively). The codes implementing these replications are all available at github.com/vfitoolkit/vfitoolkitmatlab-replication. Note that this covers a range of 'model-types' including partial and general equilibrium; finite and infinite horizon, including overlapping generations; stationary equilibrium and transition paths; and agent entry and exit. And involves analysing a variety of model 'outputs' including time-series properties, cross-sectional distributions, aggregates, and panel-data.
Replication of these papers was performed using discretized value function iteration with simulations and agents distributions computed on discretized state space; these methods have known reliable convergence properties to the true solution under conditions that are applicable to a broad class of Macroeconomic models (Kirkby, 2017a(Kirkby, , 2019 as well as performing well on accuracy in comparisons with other methods (Aruoba et al., 2006;Santos, 2000;Peralta-Alva and Santos, 2014) as long as sufficiently large grids are used. 4 These discretized grid methods would be inappropriate for the solution of state-of-the-art models where a trade-off between speed-and-accuracy has to be made. For replication however the appropriate focus is on accuracy and robustness at the expense of speed. Discretized grid methods combine a high accuracy, as long as large grids are used, with known convergence properties and a robustness to a wide range of model properties. While it is impossible to know for certain that the solutions of the replications given here are the true solutions I am confident that the replicated solutions are accurate as the answers given are insensitive to the grid sizes used; this 'insensitivity' is given a precise meaning below. 5 Implementation of the replications makes use of the VFI Toolkit for Matlab (Kirkby, 2017b), which has the advantage that the outputs of most functions that make up the codes involved in the replications have been widely tested and hopefully therefore less likely to contain errors. 6 Table 2 shows for each replication the quartiles of the percentage difference between the replication and original results. 7 It is based on all the entries of all the Tables from each paper: the absolute percentage difference between the replication value and the value in the original paper was calculated for every table entry, and the quartiles of these are reported. The main weakness of this is that it obviously misses any Figures. To ensure that the replication results are not driven by numerical error the replications were required to pass the test that a 'substantial' 8 increase in the grid size results in a change in the upper quartile of the absolute 4 To be precise what matters is not having a large size of the grid, but having small spacings between grid points. 5 Reassurance may be taken from the replication of Imrohoroglu (1989): the original author Ayse Imrohoroglu has since run a second replication joint with Kanika Aggarwal and got the same results as the replication reported here (communication by email from Ayse Imrohoroglu). 6 Coding errors are a genuine concern. A recent replication by Bédécarrats et al. (2019) of Crepon et al. (2015), an empirical analysis of a field experiment, found numerous coding errors, and that analysis would likely have contained way less lines of code than most quantitative Macroeconomics papers. 7 Thanks to a referee for advising the addition of such a measure. percentage difference of less than 5% between the results of the two replications (grid and substantially increased grid); note that this is much stricter than it first sounds as, e.g., many papers contain numbers like 0.1, so if this changed to 0.11 with the substantially increased grid this would be a change of greater than 5%. As a result it is believed that the replication numbers are the accurate numbers however this cannot be known for certain as, e.g., a parameter that should be set to 2.4 could instead be set to 2.6 due to a typo. Comparison of the measure across papers should be taken as illustrative rather than definitive as papers that provide, e.g., greater breakdown of statistics across different subpopulations, will somewhat naturally be likely to display greater numerical error.
The only 'substantial' failure to replicate is the welfare results of early papers. This appears to be explained by the use of linear-quadratic methods, while we use non-linear methods to solve the models. For papers such as Imrohoroglu (1989) and Díaz-Giménez et al. (1992) the methods used solved the policy function with enough accuracy that their findings on model statistics related to policies and stationary distributions replicate fine. However those same methods led to highly inaccurate welfare evaluatations as the value functions were not accurately computed. This finding is not entirely novel, but it's importance is widely underappreciated. Kim and Kim (2007) show that 1st-order approximation methods deliver incorrect welfare results if even when using the correct (to 1st-order) optimal policies (although these can be largely avoided by putting the 1st-order solution into the unapproximated welfare function), while Judd et al. (2017) show further that 1st-order solution methods are simply incorrect for many Macroeconomic models, deriving minimum error bounds that are large enough to be troubling. I conjecture Intergen.=Intergenerational linkages (in 2-period OLG) that this problem, inaccurate welfare results, is likely widespread in early Quantitative Macroeconomics papers and recommend that any welfare result from pre-2000 should be treated as quantitatively suspect until replicated. The continued widespread use of linear-quadratic methods in Ramsey optimal policy where maximizing the welfare function is part of the computational exercise leaves some major open questions about the results of that literature until replication studies are undertaken in that area. Loosely related, first-order (and second-order) perturbation methods have also been shown to give incorrect solution to the Diamond- Note: For all the entries of all the Tables from each paper: the absolute percentage difference between the replication value and the value in the original paper is calculated, then the quartiles of these are calculated. When both the replication and original values are zero this is considered to be a zero absolute percentage difference. Values of parameters and other things that are 'impossible' not to perfectly replicate are omitted from the calculations. All figures are omitted from this measure. The top decile was calculated but it is heavily influenced by, e.g., errors where the replication value is 0.003 and original is 0.001, which seems misleading and so is not reported. There were some 'Nonzeros' for which the replication is non-zero and the original is zero, the number of non-zeros were: 2 in Hansen (1985), 1 in Hopenhayn and Rogerson (1993), 1 in Aiyagari (1994), 12 in Hubbard et al. (1994), 1 in Imrohoroglu et al. (1995), 1 in Conesa and Krueger (1999)  Readme file may also give rough guidance on runtimes (a few minutes/hours/weeks).
Parameters: In codes, parameters are stored in a data structure that can be exported as JSON.
Parameters: Where parameter names differ between paper and codes a 'dictionary' is provided.
Parameters: Include general equilibrium values, initial conditions, alternative calibrations.
Parameters Bonus: Include hyperparameters for numerical methods used.
Explicit formulae provided for all model statistics reported in paper.
When codes/functions are taken from previous projects, mention their source.
Bonus: Codes contain easy to understand comments and variable names.
Bonus: Codes make it clear which parts of code are generating which results in paper.
Bonus: In paper describe the numerical methods used, even just 'same as paper X'.
Moretensen-Pissarides model of search-labor markets (Petrosky-Nadeau and Zhang, 2017). 9 One topic that requires much greater discussion in Quantitative Macroeconomics papers is the discretization of shocks. 10 Many papers contain a substantial discussion of calibration and some robustness exercises to parameter values. The choice of discretization method by contrast rarely warrants more than a passing mention, often in a footnote, despite being vastly more important in most models than many parameters. In practice the discretization choices play a key role in determining income risk and the distributions of earnings and wealth. More subtle is the relationship between the exogenous shocks and market incompleteness. Note that in most incomplete market models the incompleteness arrives precisely because there are no assets with returns that span the space of idiosyncratic shocks. Hence when the idiosyncratic shocks are small the markets are largely complete, while when idiosyncratic shocks are large markets are very incomplete. The discretization of exogenous shocks, because it determines both the riskiness and range of the idiosyncratic shocks is therefore also determining the degree of market incompleteness that distinguishes heterogeneous agent models from representative agent models. Quantitative Macroeconomic papers would be much improved by treating these choices of shock discretization to the same level of discussion, analysis and sensitivity as any other modelling decision. As an example of their importance Guerrieri and Lorenzoni (2017) use the Tauchen method to discretize an AR(1) shock in a study of the credit crisis that followed the Great Financial Crisis of 2007. 11 Just changing the hyperparameter of the Tauchen method to other reasonable values can cause the zero-lower bound on interest rates to bind for decades, rather than the few years in the baseline model (and seen in reality).
Replication in Economics Controversy about replication has raged in Psychology where a project by the Open Science Collaboration to repeat one hundred influential studies was able to successfully replicate the original results in only around 40% of cases (Collaboration, 2015). 12 Closer to home for Economists have been controversies about the results of Reinhart and Rogoff (2010) on the relationship between government debt and economic growth, and Miguel and Kremer (2004) on the effects of deworming on education in Kenya. 13 Within the field of lab experiments in Economics Camerer et al. (2016) try to replicate 18 studies published in American Economic Review and Quarterly Journal of Economics during 2011-2014, and conclude that replication is successful in 60-80% of the papers (depending on exact metric of 'success'). In a related study Dreber et al. (2015) find that prediction markets in which people can bet on which replications will succeed and fail did well in the sense that when they predicted a replication would fail it did (when prediction markets predicted that the replication would succeed this was largely unrelated to outcome of replication); this suggests that informally the profession is aware of certain existing results that are unlikely to replicate. Ferraro and Shukla (2020) provide evidence that suggests empirical environmental economics suffers from many of these issues and suggest a variety of ways the profession might adapt and improve.
While replication is important it is not a panacea for all problems. 14 Even papers that were retracted due to known error continue to be cited; 20,000 articles listed as retracted by Replication Watch were still cited 85,000 times after retraction. Other loosely related issues include p-hacking and the bias of publication to only publish statistically signicant results (Brodeur et al., 2016(Brodeur et al., , 2020. The problems don't just lie with the studies themselves, newspapers rarely report on null-findings and rarely do follow-ups to reporting on results that fail to hold in reproduction studies (Dumas-Mallet et al., 2017). 15 Replications are also often potentially difficult, expensive, and time-consuming: a recent effort to replicate 50 papers studying cancer, with a budget in excess of $1.3 million, ended up replicating just 18. Certainly, the replications in the present paper consumed a lot of time.
We are not aware of any existing replication study in Quantitative Macroeconomics (beyond the two or three individual replications mentioned in the 13 Reinhart and Rogoff (2010) originally argued that there was a cut-off for Government debt of around 90%-of-GDP, below which there was little relationship with economic growth and above which there was a strong negative correlation; but the statistical significance of the specific 90%-of-GDP cut-off was shown to be due to Excel error (Herndon et al. 2013); the broader negative correlation holds, only the cutoff failed to replicate. Miguel and Kremer (2004) argued, based on a randomized controlled trial, that deworming of children in Kenya had large positive effects on school attendance and educational outcomes. Two studies, one a replication and another a re-analysis questioned some aspects of the results. At the end of the day the results of the original study appear to stand-up well (more: links, short video). 14 For example it will not detect data fraud, which while very rare does occur; e.g Evidence of Fraud in an Influential Field Experiment About Dishonesty describes the use of fraudulent data by Shu et al. (2012); note that the paper reports multiple experiment results and the only one which uses fraudulent data was performed by Ariely & Mazar. The original findings has already been rejected by a followup paper which lead to all authors of the original revising their views. Of course, it is possible the researchers simply received 'faulty' data from the company and did not notice; Facebook accidently gave researchers erroneous data via their Social Science One project which was spotted by one of the many research groups using it. 15 ''This year, a study looked at how newspapers reported on research that associated a risk factor with a disease, both lifestyle risks and biological risks. For initial studies, newspapers didn't report on any null findings, meaning those that had results without expected outcomes. They rarely reported null findings even when they were confirmed in subsequent work. Fewer than half of the ''significant'' findings reported on by newspapers were later backed by other studies and meta-analyses. Most concerning, while 234 articles reported on initial studies that were later shown to be questionable, only four articles followed up and covered the refutations. ''(source) introduction). The closest is Chang and Li (2015) who look at research transparency or 'the basic goal of computational reproducibility' (in the words of Miguel, 2021). They take a very different approach and rather than try to replicate the results of the papers, their interest is instead whether the original authors of the papers supply codes, and when they do whether these codes can simply be run to computationally reproduce the results of the papers. A similar approach is taken by Gertler et al. (2018) who find that in 203 papers from top Economics journals while many provide code only in 37% of cases did it actually run, and in only 14% of cases was there both raw data and the code that generates the papers results (tables and figures) from this data. These approaches are in line with the AEA (American Economic Association) Code and Data Policy, 16 although the interest of the AEA policy is about ensuring that a study is reproducible, rather than whether a study has been replicated. While subtle, the distinction is important as reproducible can be thought of as true even though the code or data-treatment contains errors and would fail to replicate; that original code runs and reproduces tables and figures in no way tests for the existence of errors in the code itself although it does make it much easier to detect and resolve them.
This current approach to replication in Quantitative Economics with it's focus on reproducibility obviously misses any issues of whether the original results were themselves correct, which is the main purpose of replication. While availability of code is important, reproducibility is not replication. Replication necessarily involves writing new code as simply running existing codes includes replicating all the errors made in the original when treating the data and writing the code. Availability of code is important because code often contains information unintentionally missing from a published paper. For example, papers simply forget to state some initial condition, or the weights used during calibration, or the formula for a certain moment, or parameter values of a counterfactual exercise, etc.
Zimmermann (2015) suggests the need for a Journal of Replication in Economics as a way to overcome the current status quo in which academics typically receive little to no recognition or reward for performing replications. The area of Applied Econometrics is ahead in this area with the Journal of Applied Econometrics having a Replication Section since 2003. An online effort by ReplicationWiki Höffler (2017), hosted by the University of Göttingen, aims to provide a clearinghouse for replications, on the assumption that people already perform replications and simply need some outlet for them. Nor can citations nor a large following literature be relied on as a substitute for replication: oestrogen receptor cycling in the field of breast cancer research was built on two papers each of which had more than 1000 citations over nearly 20 years, but has now been found to be completely incorrect with neither of the original papers being replicable (Holding, 2019). Christensen et al. (2019) is a recent book that describes many of these issues, problems, and possible solutions, but with a focus on purely empirical work based on regressions and randomized controlled-trails. 17 It provides a good guide for those interested in improving the reproducibility of their own work.
For Quantitative Macroeconomics researchers interested in trying to ensure that their own computational work is reproducible Sect. 3 presents a checklist, based on my experience with difficulties commonly encountered. This checklist is strictly intended as an aid for researchers, not as a requirement to be imposed. Naturally it will be incomplete but should help researchers who wish to make their work more transparent and reproducible avoid the oversights most common in the literature.
By making replication easier to perform it is hoped that issues such as robustness of model prediction and sensitivity to parameters and model specification will become easier to perform. The importance of developing computational modelling packages such as Dynare, EconARK, GDSGE, niqlow and VFI Toolkit should viewed as part of contributing to this. 18 The literature on empirical regressions has begun developing tools to address these issues of specification searching with a good overview provided by Chapter 7 of Christensen et al. (2019). 19 Quantitative Macroeconomics would also benefit from such an approach, and simple replication of existing results is a first step on the road to being able to solve models easily enough to make this possible.
The rest of this paper simply describes some general lessons learnt from the process of replicating these papers. Much of what follows might be misread as picking on certain authors/papers by calling out their minor errors. This is far from my intention, which is to understand where common errors are being made and how the profession might do better. The best defense of my intentions is that any author/paper which appears in this work was one I have chosen to spend a few days 17 Randomized controlled-trials (RCTs) provide a gold-standard, but not a silver bullet. One issue is whether randomization ends up truly random. An EconTalk podcast with James Heckman describes an RCT for a drug to treat AIDS. Participants randomly received the AIDS drug (treatment) or a placebo (control). Because at that time AIDS was a death-sentence the participants were so terrified that they met up outside the lab, put all their pills into a bowl, and then each took a handful containing a mixture of drug and placebo. The Doctors performing the trail were unaware that their randomization had failed. A second issue for many RCTs is lack of power to find effects due to small sample sizes. An example is documented for Microfinance initiatives by Dahal and Fiala (2019), who find that of all eight peerreviewed RCT publications not a single one has sufficient sample size to have enough power to find a statistically significant result, given the likely (as indicated by point estimates) size of such a result. Note, the issue is not just the 'raw' sample size but also the compliance or take-up rate (what fraction of those offered microfinance loans actually use them); the problems with the AIDS study could be viewed as their having zero net compliance rate (no actual treatment of the treated, relative to control) and hence no statistical power. 18 Important related efforts aim to develop the underlying libraries and tools, rather than direct modelling, such as QuantEcon. 19 Blinder and Watson (2016) provide the odd case of a paper the second-half of which sets out to show that out of a few tens of possible specifications only a few lead to a statistically significant result (in explaining the 'D-G gap'). Rather than concluding that the most of the few statistically significant variables are likely the result of specification searching across various regressions leading to spurious significance, they instead present it as a robustness exercise. This has now been gamified. Riffing on the article entitled Let's Take the Con Out of Econometrics (Leamer, 1983) one might conclude that they claim to have turned the con into a pro! This is my personal opinion and the reader should obviously treat it as such; both authors have plenty of other good papers and I am a big fan of the other work of Mark Watson in particular, especially on understanding long-run relationships between variables.
of my life in replicating as I thought it was sufficiently important in the development of Quantitative Macroeconomics. 20 After all, [replication] is the sincerest form of flattery!

Lessons Learned from Common Issues
Some of the main issues encountered during the replications provide lessons for best practice that Macroeconomists can learn from. However the one common pitfall from which there is nothing to be learned is that coding bugs do occur, this appears to have affected a small fraction of the numbers reported in the papers; as a friend expressed it, if you start with n bugs and squash one you are left with n bugs. The main issues and recommendations based on these are discussed. The recommendations are then summarised as a checklist in Sect. 3.
Issue: Graphing Probability Distributions I recommend that researchers plot cumulative density functions, rather than probability density functions. Probability density functions can mislead for two reasons: first, they obviously depend on the number of grid points used; second, they appear more sensitive to numerical error. Since many solution methods in quantitative economics involve discretizing shock processes this leads to very different looking probability density functions when the number of grid points used to discretize the shock changes; loosely, doubling the number of grid points would halve the probability mass at each point. 21 This issue is minimized but not entirely eliminated when using cumulative density functions.
One alternative approach is to parametrize the probability density -say as Chebyshev polynomials, or as a mixture of parametric probability distributions, etc.
-but this approach is likely limited if the interest is in, eg., inequality and the shares of Total Income held by the Top 1% as the parametrization will implicitly impose some assumptions on these shares. 22 Comparing a number of alternatives I concluded that when probability density functions are plotted the best performance comes from graphing kernel-smoothed density functions estimated from the discretized probability mass function. 23 Issue: Only baseline case parameters are provided Papers essentially always provide all the parameter values for their baseline calibrations (a few do not report the final value of things such as general equilibrium prices that would be of much use for replications when trying to understand where differences may be arising). 20 For many of the replications reported here I chose to spend a few days replicating, but actually ended up spending a few weeks and in some cases months. 21 Probability density functions can also be misleading in the sense that they are very sensitive to numerical error. For example Imrohoroglu (1989) graphs the probability density function, finding two spikes, and provides an interpretation of the intuition said to underlie the existence of these spikes. These spikes appear to have been numerical approximation error and disappear when the grid is made much finer. 22 In theory they needn't impose any strong assumptions on the shares as the order of the polynomials/ mixtures approaches infinity. But in practice the polynomials/mixtures are typically low-order, as otherwise most of the computational advantages to using them are lost. 23 The codes replicating Imrohoroglu (1989) contain a commented out section comparing a few alternatives. However a number of papers do not report all the parameter values for alternative calibrations, such as those used for 'policy experiments' or difference 'cases' (e.g., Hubbard et al., 1994). Such parameter values would be appropriate for inclusion in a technical computational appendix.
Issue: Naming variables Many papers use different names for variables in their papers and code, complicating reading the code for anyone else. Ideally this would not occur, but a more reasonable solution might be the provision of dictionaries anywhere this does occur.
Issue: Reporting parameter values Three main problems occur: First, the reported parameter is for a different time-period to the model (e.g., report the annual value, when model the period is two months). Second, reported standard deviation is for the stochastic process, but equations describe it as being for the innovations to that process. Third, parameters that vary over life-cycle are only reported as a Figure (so exact values are unavailable). 24 To be more precise about the second of these, many papers will, e.g., have an AR(1) process and describe r to be the standard deviation of the innovations, but then when reporting the calibrated variables instead report r as the standard deviation of the AR(1) process itself. My own suggestion is to use a notation that always specifically emphasises when, e.g, a standard deviation is that for innovations to the AR(1) process z call it r , and when for the AR(1) itself call it r z . This simply helps to differentiate between the two standard deviations which are otherwise often and easily mixed up by accident during writing.
Issue: Calibration Details Many papers will describe which moments were targeted by the calibration. But they will not provide details on how the calibration itself was implemented. While in earlier papers this was fine as most moments are targeted independently, more recent papers often jointly target a number of moments. This typically will mean they have implemented a single-objective optimization that assigns each target moment a weight (multi-objective optimization is also a possibility but based on informal conversations seems rarely used by Economists). These weights are not typically reported (eg.,  do not provide such detail). I suggest that papers should more often include a technical computational appendix which provides this kind of detail. Along the same lines the initial values from which such optimization takes place are almost never given. The availability of codes turned out to be important factor in mitigating this. For example Auerbach and Kotlikoff (1987) describe the calibrated values of their age-dependent parameter e, but do not explain that these are in fact the log values, and that one must take their exponential and then normalize them so that the age one value of e 1 is set to 1; Figure 5.2 made it clear that something was missing in the original description of the calibrated values of e and as their codes are available it was easy enough to find out what.
Issue: Availability of Codes In a few cases the original codes are available from the authors website. In most cases however one had to contact the author directly, and even then some authors no longer had codes (to be fair some of these papers are from the early 1990s). As an extreme example the codes for Aiyagari (1994) are unavailable online and the author is deceased. While there is an increasing requirement from journals to provide codes 25 the most obvious improvement would be an increased use of github to make codes publicly available; journals that already provide their own online code repositories are a perfectly satisfactory substitute/complement. This issue appears to already be well recognized in Economics and is therefore likely 'already solved' as it were. Current approaches typically have journals provide codes in downloadable zip files making the process much more onerous than if each Journal simply uploaded all codes to its own github repository or similar; this would make them all instantly searchable and easily accessed and read. The importance of making codes available is the clearest lesson from the replications reported in this paper. Where authors provided codes (often on email request) these were able to resolve many other problems that arose during replication for many of the reasons described elsewhere in this paper.
Issue: Parameter Robustness and Numerical Approximation Errors Many papers have a 'default' parametrization and have performed some kinds of tests to check that their numerical methods are performing well at minimizing numerical error. They then look at how changing parameters would change certain model outputs. Often these tests will, eg., induce further curvature into certain parts of the solution and this interacts with the numerical methods to worsen their performance. For example Aiyagari (1994) reports the degree of precautionary savings (eg., as the resultant interest rate) for various parametrizations. While the results relating to low-risk and low-risk-aversion are numerically accurate, those relating to high-risk and high-risk-aversion contain substantial numerical error.
The results of tests for the magnitude of numerical errors, such as Euler Equation residuals (Santos, 2000), are sensitive to the parameter values. This fact is known to be the case from the theory underlying such tests but the issue is often ignored in practice. 26 One possibility would be that when measures of numerical accuracy are presented they should be reported across the range of parameter values that are made use of in the model. An alternative might be for the profession to move more towards the use of adaptive numerical methods, such as those in Brumm and Scheidegger (2017), which assess approximation errors and then update based on them as part of the solution method itself. Both of these suggestions are rather onerous so for the present simply having researchers more aware of this issue might be the best approach. 25 For example, providing codes is now required by all the 'Top 5' Economics journals. 26 A more serious issue is that for Euler equation residuals, and many other numerical error statistics, the errors in discretizing exogenous shocks -by numerical quadrature such as Tauchen & Rouwenhorst methods-are taken as given. For example when calculating the euler equation residuals for a standard representative agent model in the consumption euler equation there is an expectation term that is evaluated using the discretized exogenous shocks, and the euler equation residuals not only ignore that this expectation may involve substantial numerical error, but in fact penalise policies that would be true under the correctly evaluated expectation but appear incorrect with the approximated expectation. This bias toward using few points for discretizing exogenous shocks and instead putting them in the endogenous states so as to minimize euler equation residuals seems a leading contender to explain why many papers use so few points to discretize exogenous shocks.
Issue: Welfare Evaluations Some of the replicated papers used linear-quadratic methods (Díaz-Gímenez, 2001) to solve the value function problem. Replication of these papers often showed high accuracy in variables that depend on the stationary distribution and policy function. However the welfare calculations appear to contain substantial numerical error. It is suspected, but not known, that this reflects that linear-quadratic methods perform fine for computing policy functions but provide a poor approximation of the actual value function itself. Since welfare calculations are based on the value function itself they were therefore erroneous. This illustrates how numerical errors in different aspects of the model can be very different. It is common practice to report the results of tests for the magnitude of numerical errors, such as Euler Equation residuals, which look at the policy function. It is important to understand the conditions under which these also imply limited numerical errors elsewhere in the model (Santos, 2000;Fernandez-Villaverde et al., 2006;Kirkby, 2019). In the current instance of the errors in the value function and linear-quadratic methods the theory relating the value function and Euler equation residuals (Santos, 2000) does not apply. 27 Issue: Formulae for model statistics Typically, when reporting model statistics papers provide a verbal description of how they are calculated, but rarely include an explicit equation. This lead to some difficulties in replication. For example, in Díaz-Giménez et al. (1992) most statistics could be replicated exactly, but a few table entries could not, it seems likely this is simply because I was unable to turn the verbal descriptions into the precise formula. Another example: Restuccia and Urrutia (2004) calculate 'cross-sectional disparity' as the standard deviation of log earnings, but what is unclear from the written description is that in this two-period OLG model the 'cross-section' is computed conditional on age being 2, not across the whole model economy. One solution would be to put more formulas in Technical appendices, however this seems overly onerous given that the same issue can largely be solved by improved availability of codes. 27 It does not apply for two reasons: first the linear-quadratic methods themselves, secondly as there are periodically-binding constraints. This second reason is worth emphasising as it applies to almost all heterogeneous agent incomplete markets models: since we do not know where the periodically binding constraint actually binds the Euler equation residuals are not a formally valid measure of numerical error. Li (2015) explains this problem in detail and derives numerical bounds for Euler equations with periodically-binding constraints but they turn out to be insufficiently tight to be practically useful. To explain, consider the following standard Euler equation in a representative agent model c t ¼ E t ½ð1 þ r tþ1 Þc tþ1 . Euler equation residuals are defined as: EER jÊ t ½ð1 þr tþ1 Þĉ tþ1 Àĉ t j, where hat signifies the 'computed' evaluation. Theory underlying the interpretation of Euler equation residuals as a measure of numerical error (Santos, 2000) relies on the equality (i.e., that the policy solutions are interior values of the choice set). By contrast, consider the Euler equation in a standard heterogeneous agent incomplete market model, c t E t ½ð1 þ r tþ1 Þc tþ1 , where we now have an inequality. When borrowing constraints bind this will be strictly less than (\), at which point the euler equation residuals should be non-zero, so they are not a measure of numerical error. Theory tells us that the constraint will be equal and strictly unequal for different parts of the state space, e.g., in many Bewley-Huggett-Aiyagari models borrowing constraints are known to be binding for low asset/income agents. The problem is we don't know from theory where the constraint should bind. As mentioned, in principle it is possible to derive a corrected version of euler equation residuals for this, but in practice the resulting numerical error bounds are too loose to be useful (Li, 2015). This does not mean euler equation residuals are completely misleading about model accuracy, as long as the computed solution gets the equality/strict-inequality region correct they will be a good guide, but they are not a formal measure of numerical error.

Influence of Numerical Methods on Economics
The need for greater discussion in Quantitative Macroeconomics papers of the discretization of shocks -on par with the usual discussion of parameter choices and the sensitivity of results-stems from the large influence these have in many models on driving both modelling choices and quantitative results.
Why is exogenous shock discretization so important in incomplete markets models? Because the exogenous shocks and the degree of market incompleteness are essentially the same thing. In most models if the idiosyncratic shocks were zero, then markets would be complete. It is precisely the idiosyncratic shocks that make markets incomplete, and hence the discretization of these exogenous shocks is indirectly determining the degree of market incompleteness and driving the differences of the models from standard representative agent models.
The main discretization methods are quadrature methods for AR(1) shock processes with normally distributed innovations, namely the Tauchen and Rouwenhorst methods (Tauchen, 1986;Rouwenhorst, 1995). 28 Both perform acceptably in most situations as long as sufficient grid-points are used although the later is to be preferred when shocks are highly persistent. 29 The more recent Farmer-Toda method outperforms both of these, except for Rouwenhorst for very highly persistent shocks (Farmer and Toda, 2017). When any of these methods are used both grid-size and any hyperparameters need to be reported, and more importantly some sensitivity/robustness analysis to these choices should be performed. The most common 'error' in the literature is simply to choose 'too few' grid-points and ignore the large quantitative impact of this in driving results. The same is true for finitehorizon models with AR(1) shock processes with normally distributed innovations where the parameters are age-dependent: the natural extension of the Rouwenhorst method performs best, and the natural extension of the Tauchen method is transparent (Fella et al., 2019). The main point here though is not so much which method is used, but that these choices need to be discussed in the papers at least as much as any other calibration choice; they only become irrelevant with grid-sizes of a magnitude never seen in practice.
The focus of all of these common discretization processes on normallydistributed shocks also seems misguided. Given that discrete Markov processes will be used to compute the models, why run the data through the straight-jacket of an AR(1) process before it reaches the model? Why not go more directly from data to discrete Markov process? This approach allows much more general and realistic shock processes to be used, and is likely to be especially important in any attempts 28 Variations of these exist (Tauchen and Hussey 1991;Adda and Cooper 2003;Floden 2008) but I recommend against their use. The Tauchen-Hussey method (Tauchen and Hussey, 1991) in particular should no longer be used. It's poor performance is well documented and the existing alternatives are just as easy to implement. An indication of how widespread this method is comes from it's inclusion as a central part of the textbook and toolkit of Miranda and Fackler (2002), and it's inclusion in QuantEcon as a standard numerical quadrature method (the algorithm is often coded as the use of a function called qnwnorm). 29 The superior performance of the Rouwenhorst method for highly persistent shocks is documented for stochastic Real Business Cycle by Kopecky and Suen (2010) and the Diamond-Mortensen-Pissarides search model by Petrosky-Nadeau and Zhang (2017) to model income risks, rare disasters (and more broadly the impacts of climate change), and asset prices. Several methods to do this already exist and the literature would be improved by their more widespread adoption; again, alongside more discussion in papers of these discretization choices and their impact on results. Some existing approaches include the quadrature method of Farmer and Toda (2017) which allows more non-parametric approximations such as an AR(1) with gaussian-mixture shocks, the approach of  who simply calibrate a four-state Markov directly, Toda (2020) which works directly from the raw data, and the use of histograms to create 'bins' and then simply 'count-andnormalize' transitions to implement the maximum-likelihood estimator of a finitestate markov; Kirkby (2017b) explains this in detail for model of Hansen (1985).
Beyond just the choice of discretizing shock processes, the reporting of various choices of numerical methods and hyperparameters would ideally also be more widely discussed in papers. But given the onerous nature of trying to test for sensitivity/robustness of these choices this is probably best left to replication studies using different methods.
One article I would have liked to replicate but did not is Kydland and Prescott (1982). The reason is itself an interesting example of the important role played by the choice of numerical methods, especially those that involve large amounts of approximation. The model of Kydland and Prescott (1982) contains a sixdimensional state variable, making it prohibitively complicated for the discretized value function iteration methods I use in these replications. The model can however be easily solved using the linear-quadratic value function iteration methods used by Kydland and Prescott (1982), which involves solving for six coefficients, rather than a full six-dimensional object. This is because using linear-quadratic value function iteration methods means that the full distribution of the shocks does not matter for evaluating expectations of next periods value function, only their conditional mean.
The issue of the use of linear and log-linear, and first-and second-order perturbation in welfare evaluations has already been described in the Introduction. The results of Judd et al. (2017) showing that the minimum error bounds on linear, log-linear, and first-order approximations are large enough to be problematic for most Economic models should dissuade Economists from using them in any application. This is especially true thanks to the implementation of second-order and higher methods in many available codebases (including Dynare). Users should also be aware that first-order methods imply only the conditional mean matters for expectations, and that with second-order only the conditional mean and conditional second moment matter; this means they are, e.g., simply unusable for any study of the impact of rare events/disasters or conditional changes in volatility. Wherever possible Economists should be making greater use of global non-linear solution methods. 30 One potential issue we have not addressed is the uniqueness of general equilibrium. This is not thought to be a problem for any of these papers, but nor is it known for most of these that it is not a problem. Aiyagari (1994) provides a good illustration: (i) We know that the equilibrium is unique if the CES parameter (r in c 1Àr 1Àr ) is less than one (Light, 2020), mathematically interesting, but not an economically relevant calibration, (ii) We know that there is more than one equilibrium for some calibrations (Acikgoz, 2018), again, mathematically interesting, but not an economically relevant calibration (the calibration that displays multiple equilibria has a capital-output ratio of 50, empirically realistic range is 3-8). (iii) We know 'almost certainly' that for standard calibrations Aiyagari (1994) has a unique general equilibrium, an algorithm proven to converge to any and all equilibria with certainty, but then stopped after finite time, finds just one equilibrium, hence almost certainly (Kirkby, 2019). So while multiple equilibria are a theoretical concern they do not appear likely to be of concern in the present replications and none of the replication results suggest that there was a concern. Note that this issue is even stronger for general equilibrium transition paths where not only is uniqueness not known but nor is existence theoretically established. Table 3 is provided to act as a simple checklist that researchers interested in ensuring reproducibility of their work can use to avoid common omissions. The table is not intended to be comprehensive, but is intended to make it easier to avoid omissions that are common in the existing literature.

Conclusions
We end simply with an inculcation to the importance of reproducibility of results in Economic Science, and in Science more generally: ''Non-reproducible single occurrences are of no significance to science.' ' -Popper, Karl (1934, The Logic of Scientific Discovery) interpretation, please consult the papers themselves. The pdf 'appendix' for each of the papers can be found at:https://github.com/vfitoolkit/vfitoolkit-matlab-replication There is a folder named after each paper, and inside those folders there is a pdf with the name of the paper.
Codes which perform these replications, creating all the Tables and Figures from scratch can be found at (the same place): https://github.com/vfitoolkit/vfitoolkitmatlab-replication If you want to run the codes simply run the one called 'AuthorYear.m', which will automatically call all of the other files. You will need to create a folder 'SavedOutput' and two subfolders of this called '/SavedOutput/LatexInput' and './ SavedOutput/Graphs/' in which to save all the tables and figures respectively. The tables are all output as standard latex tables.
These codes were all implemented in Matlab, and for purposes of this paper were run in Matlab (versions between 2018a and 2021a) using the VFI Toolkit (vfitoolkit.com). They were run on a variety of computers all running Linux (Kubuntu is the best distro ;), with NVIDIA gpus (with 2gb to 40gb GDDR ram) and from two to twenty CPU cores and with memory of 16gb to 120gb.
The replication codes were written with robustness, transparency and ease to follow what is being done in mind, and with little to no concern for run-time (many unnecessary objects are computed). Most therefore take from a few days to two weeks to run. They often compute over 100 times as many combinations of calibrations as necessary.
To ensure accuracy the grid sizes were increased until a 'substantial' change in the grid size resulted in a 'very tiny' change in results. To be precise, the grid sizes between which the upper quartile of the absolute percentage differences for all numbers presented in tables was less than 0.05 (5%) were: The ordering of the grid sizes is always the same as in the codes and reflects the ordering of the concepts of decision variables, endogenous states, exogenous states that underlies the algorithms used by the VFI Toolkit. Note that in many models for some variables, in particular the exogenous variables, the grid size is not something that can be increased without changing the interpretation; e.g., two exogenous states that represent employment and unemployment. Some papers use different grids for baseline models and alternative models, the above reports the baseline model grid sizes. This criterion of an upper quartile difference of less than 5% is more demanding than it first sounds as, e.g., many papers have many numbers like 0.1 for which a change to 0.11 is a change of more than 5%.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.