Bayesian Estimation of Economic Simulation Models using Neural Networks

Recent advances in computing power and the potential to make more realistic assumptions due to increased flexibility have led to the increased prevalence of simulation models in economics. While models of this class, and particularly agent-based models, are able to replicate a number of empirically-observed stylised facts not easily recovered by more traditional alternatives, such models remain notoriously difficult to estimate due to their lack of tractable likelihood functions. While the estimation literature continues to grow, existing attempts have approached the problem primarily from a frequentist perspective, with the Bayesian estimation literature remaining comparatively less developed. For this reason, we introduce a Bayesian estimation protocol that makes use of deep neural networks to construct an approximation to the likelihood, which we then benchmark against a prominent alternative from the existing literature. Overall, we find that our proposed methodology consistently results in more accurate estimates in a variety of settings, including the estimation of financial heterogeneous agent models and the identification of changes in dynamics occurring in models incorporating structural breaks.

. Ultimately, the Great Recession of the late 2000s and the perceived failings of traditional approaches, particularly those built on general equilibrium theory, would lead to the birth of a growing community arguing that the adoption of new paradigms harnessing contemporary advances in computing power could lead to richer and more robust insights (Farmer and Foley 2009;. Perhaps the most prominent examples of this new wave of computational approaches are agent-based models (ABMs), which attempt to model systems by directly simulating the actions of and interactions between their microconstituents (Macal and North 2010). In theory, the flexibility offered by simulation should allow for more empirically-motivated assumptions and this, in turn, should result in a more principled approach to the modelling of the economy (Chen 2003;LeBaron 2006). The extent to which this has been achieved in practice, however, remains open for debate (Hamill and Gilbert 2016). While ABMs initially found success by demonstrating an ability to replicate a wide array of stylised facts not recovered by more traditional approaches (LeBaron 2006;Barde 2016), their simulation-based nature makes their estimation nontrivial . Therefore, while the last decade has seen the emergence of increasingly larger and more realistic macroeconomic models, such as the Eurace (Cincotti et al. 2010) and Schumpeter Meeting Keynes (Dosi et al. 2010) models, their acceptance in mainstream policy-making circles remains limited due to these and other challenges.
The aforementioned estimation difficulties largely stem from the simulation-based nature of ABMs, which, in all but a few exceptional cases 2 , renders it impossible to obtain a tractable expression for the likelihood function. As a result, most existing approaches have attempted to circumvent these difficulties by directly comparing model-simulated and empirically-measured data using measures of dissimilarity (or similarity) and searching the parameter space for appropriate values that minimise (or maximise) these metrics (Grazzini et al. 2017;Lux 2018). The most pervasive of these approaches, which Grazzini and Richiardi (2015) call simulated minimum distance (SMD) methods, is the method of simulated moments (MSM), which constructs an objective function by considering weighted sums of the squared errors between simulated and empirically-measured moments (or summary statistics).
Though MSM has been widely applied in a number of different contexts 3 and has desirable mathematical properties 4 , it suffers from a critical weakness. In more detail, the choice of moments or summary statistics is entirely arbitrary and the quality of the associated parameter estimates depends critically on selecting a sufficiently comprehensive set of moments, which has proven to be nontrivial in practice. In response, recent years have seen the development of a new generation of SMD methods that largely eliminate the need to transform data into a set of summary statistics and instead harness its full informational content (Grazzini et al. 2017).
These new methodologies vary substantially in their sophistication and theoretical underpinnings. Among the simplest of these approaches is attempting to match time series trajectories directly, as suggested by Recchioni et al. (2015). More sophisticated alternatives include information-theoretic approaches (Barde 2017;Lamperti 2017), simulated maximum likelihood estimation (Kukacka and Barunik 2017), and comparing the causal mechanisms underlying real and simulated data through the use of SVAR regressions . In addition to the development of similarity metrics, attempts have also been made to reduce the large computational burden imposed by SMD methods by replacing the costly model simulation process with computationally efficient surrogates (Salle and Yildizoglu 2014;Lamperti et al. 2018).
Interestingly, the aforementioned approaches are all frequentist in nature, with Bayesian estimation being significantly less prevalent 5 . As it currently stands, only one study in the literature (Grazzini et al. 2017) has focused extensively on the use of Bayesian techniques and recent work by Lux (2018) involving sequential Monte Carlo methods includes attempts at Bayesian estimation, though the work as a whole focuses more on a frequentist approach.
While the estimation literature has certainly been growing, it still suffers from a number of key weaknesses. Perhaps the most significant of these is a lack of a standard benchmark against which to compare the performance of new methods. For this reason, most new approaches have traditionally only been tested in isolation and comparative exercises have been relatively rare. For this reason, we compared a number of prominent estimation techniques in a previous investigation (Platt 2019) and found, rather surprisingly, that the Bayesian estimation procedure proposed by Grazzini et al. (2017) consistently outperformed a number of prominent frequentist alternatives in a series of head-to-head tests, despite its relative simplicity. We therefore argued that more interest in Bayesian methods is warranted and suggested that increased emphasis should be placed on their development.
In line with this recommendation, we introduce a method for the Bayesian estimation of economic simulation models 6 that relaxes a number of the assumptions made by the approach of Grazzini et al. (2017) through the use of a neural networkbased likelihood approximation. We then benchmark our proposed methodology through a series of computational experiments and finally conclude with discussions related to practical considerations, such as the setting of the method's hyperparameters and the associated computational costs.
In this section, we introduce the reader to a number of the essential elements of our investigation, including a brief discussion of the fundamentals of Bayesian estimation, a description of the approach of Grazzini et al. (2017) (our chosen benchmark), and an introduction to our proposed estimation methodology. .

Bayesian Estimation of Simulation Models
For our purposes, we consider a simulation model to be any mathematical or algorithmic representation of a real world system capable of producing time series (panel) data of the form where θ is a model parameter set in the space of feasible parameter values, T is the length of the simulation, i represents the seed used to initialise the model's random number generators, and x sim t,i (θ) ∈ R n for all t = 1, 2, . . . , T. In general, estimation or calibration procedures aim to determine appropriate values for θ such that X sim (θ, T, i) produces dynamics that are as close as possible to those observed in an empirically-measured equivalent, 5 There is a rather substantial literature on what are called approximate bayesian computation methods that has gained a significant following in biology and ecology (Sisson et al. 2018). Unfortunately, the vast majority of these methods rely on converting data to a set of summary statistics and their appeal for estimating economic ABMs is therefore limited. 6 It is worth noting that while we focus on ABMs, the proposed methodology is applicable to any model capable of simulating time series or panel data. For this reason, the methodology would be equally applicable to competing modelling approaches.
Bayesian estimation attempts to achieve the above by first assuming that the parameter values follow a given distribution, p(θ), which is chosen to reflect one's prior knowledge or beliefs regarding the parameter values. This is then updated in light of empirically-measured data, yielding a modified distribution, p(θ|X), called the posterior. Bayesian estimation can therefore be framed in terms of Bayes' theorem as follows: Unfortunately, obtaining an analytical expression for the posterior is typically not feasible. Firstly, the normalisation constant, p(X), is unknown or determining it is nontrivial. Secondly, the likelihood, p(X|θ), is intractable for most simulation models, particularly large-scale macroeconomic ABMs. Nevertheless, these limitations can be overcome to some extent. Grazzini et al. (2017) provide a method for approximating p(X|θ) for a particular value of θ, which then allows us to evaluate the right-hand side of p(θ|X) ∝ p(X|θ)p(θ).
The above may then be used along with Markov chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings algorithm, to sample the posterior. This is possible since most MCMC techniques only require that we are able to determine the value of a function proportional to the density function of interest rather than the density function itself. It should be apparent, however, that the overall estimation error will depend critically on the method used to approximate the likelihood.
. The Approach of Grazzini et al. (2017) As previously stated, Grazzini et al. (2017) provide a method to approximate the likelihood for simulation models, which we now discuss in more detail. In essence, the approach is based on the assumption that, for all t ≥T, we reach a statistical equilibrium such that x sim t,i (θ) fluctuates around a stationary level, E[x sim t,i (θ)|t ≥T], which allows us to further assume that x sim T,i (θ), x sim T+1 (θ), . . . , x sim T,i (θ) constitutes a random sample from a given distribution 7 . It is then possible to determine a density function that describes this distribution, which we denote byf (x|θ), using kernel density estimation (KDE), finally allowing us to approximate the likelihood of the empirically-sampled data 8 for a given value of θ as follows: It should be apparent that the above results in a simple strategy that is easy to apply in most contexts. It must be noted, however, that this is largely made possible through strong assumptions that seldom hold in practice. In more detail, notice that ordered time series (panel) data is essentially being treated as an i.i.d. random sample, implying that x sim t,i (θ) ⊥ x sim 1,i (θ), . . . , x sim t−1,i (θ) for all t = 2, 3, . . . , T. Unfortunately, such independence assumptions do hold for most simulation models, since x sim t,i (θ) is likely be dependent on at least some of the previously realised values, whether this dependence is explicit or mediated through latent variables. Additionally, such assumptions result in a likelihood function that makes no distinction between θ values that result in identical unconditional distributions but differing 7 The samples need not all be drawn from a single Monte Carlo replication and may instead be drawn from the statistical equilibria reached by each replication in an ensemble generated using various random seeds. In practice, we simulate an ensemble of R such Monte Carlo replications for each candidate set of θ values and combine the samples from each replication into a single random sample. 8 Note that we have assumed, as in the case of the simulated data, that the empirically-sampled data fluctuates around a stationary level. temporal trends. Since many economic simulation models and particularly largescale macroeconomic ABMs produce datasets that are characterised by seasonality or structural breaks, there is likely to be some impact on the quality of the resultant parameter estimates. Nevertheless, Platt (2019) demonstrates that despite the above shortcomings, the method of Grazzini et al. (2017) is able to provide reasonable parameter estimates in many contexts, while also outperforming several more sophisticated frequentist approaches. This warrants further investigation and naturally leads one to ask whether relaxing the required independence assumptions would allow for the construction of a superior Bayesian estimation method. .

Likelihood Approximation using Neural Networks
We now begin our discussion of a relatively simple extension to the likelihood approximation procedure proposed by Grazzini et al. (2017) that is capable of capturing some of the dependence of x sim t,i (θ) on past realised values. As a starting point, we assume that for all L < t ≤ T, implying that x sim t,i (θ) depends only on the past L realised values. Our task, therefore, is the estimation of the above conditional densities, for all L < t ≤ T, where φ = φ(θ) are parameters associated with the density estimation procedure. In our context, we make use of a mixture density network (MDN), a neural network-based approach to conditional density estimation introduced by Bishop (1994). The aforementioned scheme consists of two primary components 9 , a mixture of K Gaussian random variables, where we denote x sim t,i by y and x sim t−L,i , . . . , x sim t−1,i by x, and functions α k , µ k and Σ k of x which determine the mixture parameters. Here, α k , µ k and Σ k are the outputs of a feedforward neural network taking x as input and having weights and biases φ(θ), which are determined by training the network on an ensemble of R Monte Carlo replications simulated by the candidate model for parameter set θ. Using the trained MDN, it is then possible to approximate the likelihood of the empiricallysampled data for a given value of θ as follows: While alternative density estimation procedures could potentially have been employed, our consideration of MDNs is motivated primarily by their desirable properties. Specifically, MDNs are, in theory, capable of approximating fairly complex conditional distributions. This follows directly from the fact that mixtures of normal random variables are universal density approximators for sufficiently large K (Scott 2015) and the fact that neural networks are universal function approximators (Hornik et al. 1989), provided they are sufficiently expressive. Therefore, provided that K is sufficiently large and the constructed neural network sufficiently deep (and wide), the above methodology should result in accurate conditional density estimates. .

Method Comparison and Benchmarking
Given that we have now described our proposed estimation methodology, we proceed to discuss our strategy for benchmarking it against the approach of Grazzini et al. (2017), where we follow a similar strategy to that employed in Platt (2019).
We begin by letting X sim (θ, T, i) be the output of a candidate model, M. Since empirically-observed data is nothing more than a single realisation of the true datagenerating process, which may itself be viewed as a model with its own set of parameters, it follows that we may consider X = X sim (θ true , T emp , i * ) as a proxy for real data to which M may be calibrated.
In this case, we are essentially estimating a perfectly-specified model using data for which the true parameter values, θ true , are known. It can be argued that a good estimation method would, in this idealised setting, be able to recover these true values to some extent and that methods which produce estimates closer to θ true would be considered superior. This leads us to define the following loss function whereθ is the parameter estimate (posterior mean) produced by a given Bayesian estimation method. In practice, it is important that bothθ and θ true are normalised to take values in the interval [0, 1] before the loss function value is calculated. This is because even relatively small estimation errors associated with parameters that typically take on larger values will increase the loss function value substantially more than relatively large estimation errors associated with parameters that typically take on smaller values if no normalisation is performed. Therefore, for each free parameter, with an analogous transformation being applied to θ true j . The above allows us to develop a series of benchmarking exercises in which we compare the loss function values associated with our proposed method and that of Grazzini et al. (2017) for a number of different models, free parameter sets, and θ true values 10 . In all of these comparative exercises, we aim to ensure that the overall conditions of the experiments are consistent throughout, regardless of the method used to approximate the likelihood. Therefore, in all cases, we set the length of the proxy for real data to be T emp = 1000, the number of Monte Carlo replications in the simulated ensembles to be R = 100, the length of each series in the simulated ensembles to be T sim = 1000, and the priors for all free parameters to be uniform over the explored parameter ranges. Additionally, we have also used the same lag length, L = 3, for all estimation attempts involving our neural network-based method. While seemingly arbitrary, this choice has very clear motivations that are discussed in detail in Section 5.1.
Finally, the MCMC algorithm used to sample the posterior and its associated hyperparameters remain unchanged in all experiments. Rather than using a standard random walk Metropolis-Hastings algorithm, we have instead employed the adaptive scheme proposed by Griffin and Walker (2013), which allows for more effective initialisation, faster convergence, and better handling of multimodal posteriors 11 .
With our estimation and benchmarking strategies now described, we introduce the candidate models that we attempt to estimate. Their selection is primarily jus-tified by their ubiquity; each has appeared in a number of calibration and estimation studies 12 , leading them to become standard test cases in the field. While computationally-inexpensive to simulate, most are capable of producing nuanced dynamics and thus still prove to be a reasonable challenge for most contemporary estimation approaches. Since our focus here is the benchmarking of the proposed estimation procedure as opposed to estimating the candidate models using empirical data, our discussion will be relatively brief. In empirical investigations, however, it would be necessary to provide some justification that the chosen models were reasonable representations of the considered systems.
. Brock and Hommes (1998) Model The first model we introduce, and by far the most popular in the existing literature, is the Brock and Hommes (1998) model, an early example of a class of simulation models that attempt to model the trading of assets on an artificial stock market by simulating the interactions of heterogenous traders that follow various trading strategies.
We focus on a particular version of the model that can be expressed as a system of coupled equations 13 , where y t is the asset price at time t (in deviations from the fundamental value p * t ), n h,t is the fraction of trader agents following strategy h ∈ {1, 2, . . . , H} at time t, and R = 1 + r.
Each strategy, h, has an associated trend following component, g h , and bias, b h , both of which are real-valued parameters. The model also includes positive-valued parameters that affect all trader agents, regardless of the strategy they are currently employing, specifically β, which controls the rate at which agents switch between various strategies, and the prevailing market interest rate, r.
Finally, assuming an i.i.d. dividend process, the fundamental value p * t = p * is constant, allowing us to obtain the asset price at time t, .

Random Walks with Structural Breaks
The second model we consider is a random walk capable of replicating simple structural breaks, defined according to Unlike the Brock and Hommes (1998) model, the above is not a representation of a real-world system, but rather an artificially-constructed test example designed to challenge estimation methodologies 14 . Its inclusion is justified on the grounds that, as previously discussed, many large-scale ABMs produce dynamics that are characterised by structural breaks and the fact that it allows us to compare our approach against that of Grazzini et al. (2017) in cases where the considered data demonstrates clear temporal changes in the prevailing dynamics. .

Franke and Westerhoff (2012) Model
The final model we discuss shares a number of conceptual similarities with the previously described Brock and Hommes (1998) model, being a heterogeneous agent model that simulates the interactions of traders following a number of trading strategies. It is, however, different in a number of key areas, particularly in how the probability of an agent switching from one strategy to another is determined and in its incorporation of only two trader types, chartists and fundamentalists. As in the case of the Brock and Hommes (1998) model, the core elements of the model can be expressed as a system of coupled equations d f referred to as wealth and predisposition (WP). As a final remark, we consider r t = p t − p t−1 , the log return process, rather than p t in our estimation attempts.
. Brock and Hommes (1998) Model We now proceed with the presentation of the results of our comparative experiments, beginning with the Brock and Hommes (1998) model 16 .
In these experiments, we consider a market with H = 4 trading strategies and focus on estimating g 2 , b 2 , g 3 , and b 3 , the trend following and bias components for two of these strategies. For the first free parameter set, we consider g 2 , b 2 ∈ [−1, 0] and g 3 , b 3 ∈ [0, 1], corresponding to a contrarian strategy with a negative bias and a trend following strategy with a positive bias respectively. For the second free parameter set, we instead consider g 2 , b 2 , g 3 ∈ [0, 1] and b 3 ∈ [−1, 0], corresponding to trend following strategies with positive and negatives biases respectively.
Referring to Figure 1, we observe that, for the first free parameter set, there is a pronounced difference in performance between our proposed methodology and that of Grazzini et al. (2017). While both approaches perform similarly when estimating the bias components, our proposed procedure results in marginal posteriors for g 2 and g 3 that not only have means noticeably closer to the true parameter values, but are also significantly narrower and more peaked, with their density concentrated in a smaller region of the parameter space. This can be seen as indicative of reduced estimation uncertainty.  Table 1 elaborates on these findings and reveals that similar behaviours also emerge in the case of the second free parameter set. Specifically, we find that the posterior means (µ posterior ) for both methods result in more or less equivalent estimates for b 2 and b 3 , while the posterior mean for our proposed method appears to result in noticeably superior estimates for g 2 and g 3 in both cases, ultimately leading to lower loss function values. We also observe that our approach results in reduced posterior standard deviations (σ posterior ) consistently for all free parameters, in line with our observation of reduced estimation uncertainty in Figure 1.
In Appendix B, where we describe the method used to sample the posteriors, we indicate that we run the procedure multiple times with different initial conditions and combine the obtained samples into a single, larger sample from which we estimate µ posterior and σ posterior . We can, however, estimate the posterior mean for each of these runs individually and determine the standard deviation of µ posterior across the instantiations of the algorithm, which we call σ sample . As shown in Table  1, this standard deviation is generally very small for both methods, suggesting that the posterior mean estimates are generally robust 17 . .

Random Walks with Structural Breaks
Moving on from the Brock and Hommes (1998) model, we now discuss the estimation of a random walk incorporating a structural break. In these experiments, we consider a fixed structural break location, τ = 700 18 , and determine the extent to which both methods are capable of estimating the pre-and post-break drift, d 1 , d 2 ∈ [0, 1], and volatility, σ 1 , σ 2 ∈ [0, 10], for differing underlying changes in the dynamics. While the loss function described in Section 2.4 will still be used as our primary metric, we note that since the considered free parameters directly define the dynamics that characterise the different regimes of the data, it would also be worthwhile to assess the extent to which the competing approaches are able to correctly identify the relationships between the parameters and hence the shift in the pre-and post-break dynamics (∆ d and ∆ σ ). Before proceeding, however, there are a number of nuances that should be highlighted. Being a random walk, the model clearly produces non-stationary time series and therefore violates a key assumption of the method of Grazzini et al. (2017). For this reason, it is necessary to consider the series of first differences, x t − x t−1 , rather than x t itself. While our approach does not make stationarity assumptions, we have none the less considered the series of first differences when applying both methods to make the comparison as fair as possible. It should also be noted that we have assumed the location of the structural break to be unknown or difficult to determine a-priori (as is the case in most practical problems), meaning that we apply both estimation approaches to the full time series data to estimate both the pre-and post-break parameters simultaneously. If, however, the location of the structural break was known, it would be possible to estimate the relevant parameters separately using appropriate subsets of the data, a less challenging undertaking that we do not consider here. Now, referring to Table 2, we see that both our proposed estimation methodology and that of Grazzini et al. (2017) perform similarly well when attempting to estimate the pre-and post-break volatility, with both producing reasonable estimates for the free parameters and both being able to identity the correct shift in the dynamics. Referring to Tables 3 and 4, however, we see that more pronounced differences emerge when attempting to estimate the pre-and post-break drift. While this is clearly evident from the fact that the loss function values associated with our proposed methodology are noticeably lower in all cases, a more detailed analysis reveals further distinctions worth mentioning. Table 3, which presents the results for cases involving an increasing drift, reveals that our proposed methodology has correctly identified an increasing trend in both cases and has also correctly identified that the increase in drift for parameter set 4 is three times that of parameter set 3. In contrast to this, the method of Grazzini et al. (2017) incorrectly suggests a decreasing trend in both cases. Table 4, which presents the results for cases involving a decreasing drift, similarly shows that our proposed methodology delivers superior performance when attempting to identify the change in drift. This change in the relative performances of each method when estimating the drift rather than the volatility is a direct consequence of the relationship between the deterministic and stochastic components of the model. For the selected parameter ranges, the random fluctuations, t , dominate the evolution of the model, with the drift producing a more subtle effect, particularly after the structural break occurs. For this reason, correctly estimating the pre-and post-break volatility is a far less challenging task than estimating the pre-and post-break drift. Therefore, while both methods perform well when estimating parameters associated with dominant effects like volatility, our method's incorporation of dependence on previously ob-served values seems to be important when estimating parameters related to more nuanced and less dominant aspects of a model. .

Franke and Westerhoff (2012) Model
As stated in Section 3.3, the final model we consider has a number of alternate configurations differing in how the attractiveness of fundamentalism relative to chartism, a t , is determined during each period. For this reason, we consider two of these configurations, HPM and WP, and focus on estimating the parameters associated with the rules governing a t : α n ∈ [0, 2], α 0 ∈ [−1, 1], α p ∈ [0, 20], α w ∈ [0, 15000], and η ∈ [0, 1], while also estimating the standard deviation of the noise term appearing in the chartist demand equation, σ c ∈ [0, 5] 19 . Referring to Table 5, we see that our proposed estimation methodology appears slightly more effective than that of Grazzini et al. (2017) for the HPM parameter set, producing superior estimates for all but one of the considered free parameters and resulting in a lower loss function value. Nevertheless, the estimates do not differ substantially when comparing the methods. Despite this, we see, in what is a seemingly analogous trend to what was observed in the random walk experiments, that the differences in performance are more pronounced for the WP parameter set. In particular, we see a substantial difference in the loss function values associated with each method, brought about by differences in the quality of estimates produced for η. = 0.01, β = 1, φ = 0.12, χ = 1.5, and σ f = 0.758 for the HPM parameter set and µ = 0.01, β = 1, φ = 1, χ = 0.9, α 0 = 2.1, and σ f = 0.752 for the WP parameter set, as suggested by Franke and Westerhoff (2012).
As illustrated in Figure 2, the method of Grazzini et al. (2017) produces a wide posterior for η that is dispersed across the entirety of the explored parameter range, which results in a relatively poor estimate. In contrast to this, we see that the proposed methodology fares better, producing a far narrower posterior and a significantly more accurate estimate. While it is nontrivial to identify any definitive causes for the observed behaviours due to the nonlinear nature of heterogeneous agent models, it is worth pointing out that the inclusion of wealth dynamics in the WP version of the model introduces a dependence of a t on the previous return via Eqns. 24-26, which may in turn increase the strength of the relationship between the current and previously observed values in the log return time series.
As a final remark, notice that for the vast majority of the free parameters considered, the proposed methodology also results in lower posterior standard deviations, as was the case for the Brock and Hommes (1998) model.

. Overall Summary
In the preceding subsections, we have focused primarily on analysing the results on a case-by-case basis. Here, however, we provide a summative comparison across all of the considered models. This is achieved though the consideration of a number of key performance metrics, presented in Table 6, which compare the approaches at both a global and individual parameter level. The first of the aforementioned metrics, and the most important, LS mdn < LS kde , indicates how often the proposed methodology results in lower loss function values, and hence measures its relative ability to recover the true parameter set. We observe that in all cases considered, our methodology results in lower loss function values, which can be seen as indicative of dominance at the global level.
The second metric, |µ i mdn − θ i true | < |µ i kde − θ i true |, determines how often our proposed methodology produces superior estimates for individual parameters in a free parameter set. In some situations, one might find that the estimates obtained for a subset of the free parameters by the method of Grazzini et al. (2017) are superior, even if the overall estimate for the entire free parameter set is not as good. Nevertheless, we find that in over 80% of cases, our methodology also results in superior estimates at the level of individual parameters, a comfortable majority. It should also be noted that in virtually all situations where |µ i mdn − θ i true | > |µ i kde − θ i true |, such as some cases of b 2 and b 3 in the Brock and Hommes (1998) model, and σ 1 and σ 2 in the random walk model, the differences in the estimates produced by both methods are incredibly small. In contrast to this, a sizeable number of cases where |µ i mdn − θ i true | < |µ i kde − θ i true |, such as g 2 and g 3 in the Brock and Hommes (1998) model, and η in the Franke and Westerhoff (2012) model, are characterised by comparatively large differences in the estimates obtained by the competing approaches. This suggests that our proposed methodology also demonstrates a degree of dominance at the level of individual parameters.
The final metric, σ i mdn < σ i kde , indicates how frequently our proposed methodology results in reduced posterior standard deviations for individual parameters, which occurs in slightly below 80% of the considered cases, again a comfortable majority 20 .
Based on the evidence presented by the above metrics as a whole, it would appear that our proposed methodology does indeed compare favourably to that of Grazzini et al. (2017), which was itself already shown to dominate a number of other contemporary approaches in the literature by Platt (2019). This ultimately validates our method as a worthwhile addition to the growing toolbox of estimation methods for economic simulation models. .

Choosing the Lag Length
As previously stated, we set L = 3 in all estimation experiments involving our proposed method. Naturally, one may wonder whether this is an arbitrary choice or if there is a systematic way of choosing L. Similarly, one may also wonder if the obtained results are robust to this choice, even if only to some extent. We now address both issues. When applying the proposed methodology, we observed a phenomenon that appeared to be relatively consistent throughout the experiments. In more detail, we observe that while increasing L initially has a pronounced effect on the estimated conditional densities, there exists some L * ≥ 0 such that for L ≥ L * , or, in other words, the MDN essentially ignores the additional lags. We illustrate this graphically in Figure 3. Here, we train an MDN on 100 realisations of length 1000 generated using the Brock and Hommes (1998) model initialised using parameter set 1. We then randomly draw an arbitrary sequence of 6 consecutive values from a time series of length 1000, also generated by the Brock and Hommes (1998) model. This then allows us to use the MDN to plot the conditional density functions for differing choices of L, conditioned on the values generated in the previous step, and observe the aforementioned trend.
Repeating this exercise on models for which the true lag, L true , is known a-priori (see Figures 4 and 5), we see that L * = L true . This has a number of important implications. Firstly, it implies that plots of the type we have constructed here can be used as a means to systematically inform the choice of L for arbitrary models. Secondly, and perhaps more importantly, it implies that if L ≥ L true , the procedure should demonstrate at least some robustness to the choice of lag, provided that the MDN is sufficiently expressive and sufficiently well-trained. This explains why simply setting L = 3 resulted in a high level of estimation performance in our experiments, regardless of the considered model, since the models considered are not characterised by long-range dependencies 21 . .

Computational Costs
At this point, one may ask whether the proposed estimation routine compares favourably to other contemporary alternatives in terms of computational costs. As stated by Grazzini et al. (2017), the cost of generating simulated data using a candidate model is generally dominant, particularly for large-scale models that may need to be run for several minutes in order to generate a single realisation. It is therefore imperative that any estimation methodology keep the simulated ensemble size, which we call R, to a minimum. As previously stated, we have selected R = 100, which results in a relatively large training set of R(T sim − L) = 99700 training examples. This compares favourably to most alternatives in the literature on a number of grounds. Firstly, most studies which have attempted to estimate models of similar complexity make use of ensembles consisting of a far greater number of realisations, typically in excess of R = 1000 (Barde 2017;Lamperti 2017;Lux 2018). Secondly, the training set associated with R = 100 is already large relative to the complexity of the network architecture we employ 22 .
To illustrate this point, we repeat the experiments associated with parameter set 1 of the Brock and Hommes (1998) model, changing only the simulated ensemble size, which has been halved to R = 50. We find that even with this drastic decrease in the number of Monte Carlo replications, the proposed methodology still performs well and results in a lower loss function value than was obtained using the method of Grazzini et al. (2017) in the original experiments, with a ratio of LS MDN /LS KDE = 0.7249 23 . This provides some evidence that even for greatly reduced ensemble sizes, our approach remains viable, and implies that the complexity of the candidate model and hence the employed neural network would likely need to be increased substantially before any increase in R beyond 100 is required.
In addition to concerns related to the size of the simulated ensemble, it is also worthwhile to consider the actual computational costs of the neural network training procedure relative to those associated with the generation of a single model realisation. For this reason, Figure 6 demonstrates the total training time required by various neural network configurations, most of which are larger than that of the network employed in this investigation, which typically takes ∼ 5 seconds to be completely trained. We find that even for substantially more complex neural networks than those considered in our investigation, the overall training time is still 21 The interested reader should refer to Appendix C for additional discussions. typically less than 40 seconds, which compares favourably to the simulation time of large-scale models, and we additionally find that the increase in computational time is linear for both increases in the lag length and network width. Further, it should be noted that GPU parallelisation was not employed when generating the aforementioned computational cost diagrams. Given the significant speedup that could be expected with the use of such hardware, typically in the region of 20× (Oh and Jung 2004), we find there to be at least some evidence that the time taken to train the neural network will generally be negligible in comparison to the time taken to generate a single model realisation, even for far more sophisticated neural networks and candidate models. This would, however, require further testing that is beyond the scope of this investigation and we thus suggest that the proposed routine be applied to more sophisticated models in future work.
In the preceding sections, we have introduced a neural network-based protocol for the Bayesian estimation of economic simulation models (with a particular focus on ABMs) and demonstrated its estimation capabilities relative to a leading method in the existing literature.
Overall, we find that our method delivers compelling performance in a number of scenarios, including the estimation of heterogeneous agent models typically used to test estimation procedures, and less orthodox examples, such as identifying dynamic shifts in data generated by a random walk model. In all of the cases tested, we find that our proposed methodology produces estimates closer to known ground truth values than the approach proposed by Grazzini et al. (2017) and also find that it typically results in narrower and more sharply peaked posteriors for larger free parameter sets.
In addition to our primary findings, we also discuss practical issues related to the applicability of the proposed routine. We demonstrate that the lag length, which can be viewed as our approach's primary hyperparameter, can be systematically chosen and that the overall estimation performance demonstrates at least some robustness to this choice. Further, we provide a number of arguments as to the protocol's computational efficiency relative to a number of prominent alternatives in the literature and therefore suggest that attempts be made to apply it to models of a larger scale in future research.
The author would like to thank J. Doyne Farmer for helpful discussions that greatly aided the process of preparing this manuscript and the UK government for the award of a Commonwealth Scholarship. Responsibility for the conclusions herein lies entirely with the author.
While we presented an overview of our estimation procedure in Section 2, the associated discussions were primarily illustrative and omitted several key details. We thus provide a more technical, step-by-step discussion of our approach in this section. .

Training Set Construction
The primary aim of our methodology is the construction of an approximation to the likelihood function for a given set of parameter values, θ. In order to facilitate this process, we make the simplifying assumption that x sim t,i (θ) depends only on x sim t−L,i (θ), . . . , x sim t−1,i (θ), for all L < t ≤ T. Our problem therefore reduces to the estimation of conditional densities of the form p x sim t,i x sim t−L,i , . . . , x sim t−1,i : θ . In order to estimate the above conditional densities, we will require an appropriate dataset, which is constructed in a number of stages. The first of these stages involves the use of the candidate model to generate an ensemble of R Monte Carlo replications, X sim (θ, T sim , i), i = i 0 , i 0 + 1, . . . , i 0 + R − 1, for a given value of θ. This is then followed by the construction of two ordered sets for each Monte Carlo replication i in the ensemble, and Finally, the sets X train i (θ), i = i 0 , i 0 + 1, . . . , i 0 + R − 1 are concatenated, in order, to produce a single, larger ordered set, X train (θ), with an analogous procedure being applied to Y train i (θ) to yield Y train (θ). In essence, X train (θ) consists of rolling windows of length L drawn from the ensemble of Monte Carlo replications, while Y train (θ) consists of the x sim t,i (θ) values that directly follow each window in X train (θ). Together, they form a training set of size R(T − L) that can be used to approximate the required conditional densities. .

Neural Network Specification and Training
With an appropriate dataset now constructed, we proceed with a more detailed discussion of the MDN itself. As a starting point, let H be a feedforward neural network with input layer h 0 (taking in windows of length L), hidden layers h 1 , h 2 , . . . , h n−1 , output layer h n , and weights and biases ψ. The mixture parameters are then defined as and where diag(x) is a diagonal matrix with diagonal x and This results in an expanded neural network with weights and biases that takes windows of length L as input and outputs α, µ k , and Σ k as defined above. At this stage, there are a number of nuances worth highlighting. In Eqn. 30, notice that we make use of the so f tmax function. This ensures that the mixture weights, α, are strictly positive and sum to one, as required. Additionally, notice that in Eqn. 32 we consider a diagonal rather than a full covariance matrix 24 . If we had not made such an assumption, we would have to ensure that the covariance matrices returned by our neural network were positive definite. Though possible in principle, this would significantly increase the number of network parameters and have a potentially detrimental effect on computational performance (Rothfuss et al. 2019). Finally, it should be apparent from Eqn. 33 that the neural network outputs a vector of log variances rather than the diagonal covariance matrix, allowing us to avoid imposing positivity constraints on the network output. Now, all that remains is the training of our constructed network, which is achieved through the application of maximum likelihood estimation to our training set. Denoting by X train m the m-th entry in X train (θ) (with Y train m being similarly defined), maximum likelihood estimation is equivalent to solving arg min using stochastic gradient descent methods. .

Data Normalisation and Regularisation
While the scheme we have just described could be applied as is, it is likely to perform suboptimally in its current form. This is because neural networks, like most machine learning techniques with a large number of free parameters, have a tendency to overfit the training data and thus perform poorly out-of-sample, particularly when the training set is small (Murphy 2012). In practice, this is often addressed using early stopping, a technique that requires a percentage of the data to be kept separate from the training set in order to evaluate out-of-sample performance during each epoch (Prechelt 1998). Such a solution is, however, undesirable in our context, since it requires the generation of additional data, an expensive undertaking for large-scale simulation models.
Fortunately, Rothfuss et al. (2019) present a set of best practices for conditional density estimation using neural networks that provides an alternative solution for overfitting. In particular, a technique called noise regularisation is employed, in which small random perturbations are applied to the data during the training process. It can be shown that this ultimately results in a complexity penalty that favours smoother density estimates that are less prone to overfitting (Rothfuss et al. 2019). For this reason, we apply Gaussian perturbations to training examples in X train (θ) and Y train (θ), which we denote by ξ x ∼ N (0, η x I) and ξ y ∼ N (0, η y I), respectively. It should be apparent that the degree of regularisation depends directly on the magnitudes of the standard deviations η x and η y relative to the range of variation in the training data 25 . This implies that η x and η y would have to be adjusted for each candidate model in order to result in the same degree of regularisation. Rothfuss et al. (2019) therefore propose a data normalisation scheme that ensures the training data exhibits zero mean and unit variance, eliminating the need to retune these hyperparameters for each candidate model. This is achieved through the application of a simple transformation to each training example. Lettingμ x andσ x be vectors that contain estimates of the mean and standard deviation along each dimension for training examples in X train (θ), this transformation is given byX withμ y ,σ y andỸ train m being defined analogously. Once the network has been trained on the normalised dataset, we are required to evaluatef (x, y, φ), originally defined in Eqn. 8. This is achieved through a simple procedure. Firstly, the normalisation transform is applied to x and y using the samê µ y ,σ y ,μ x andσ x values defined in Eqn. 37, yieldingx andỹ.x is then fed through the trained neural network to yield corresponding mixture parameters, allowing us to evaluate the density atỹ, which we denote byg(x,ỹ,φ). It should be noted thatg does not directly correspond tof , since we have made a change of variables and the volume of the probability density is not preserved under the normalisation transform forσ y = 1. Rothfuss et al. (2019) do, however, prove that whereσ (j) y is the j-th element ofσ y , allowing us to easily calculate the required density. .

Neural Network Architecture
In essence, we have defined a general neural network-based approach to simulation model estimation that is independent of the specific network architecture (number of hidden layers, number of neurons, type of activation functions, and so on) used. Nevertheless, for the sake of completeness, we briefly introduce the (relatively simple) architecture employed in our study, which is used consistently throughout unless stated otherwise. For the mixture model itself, we set the number of mixture components to be K = 16, with the associated mixture parameter network consisting of 3 hidden layers, each with 32 neurons and ReLU activations. This was trained using the wellknown Adam optimiser (Kingma and Ba 2015) over 12 epochs 26 , with a batch size of 512 and noise regularisation parameters η x = η y = 0.2.
The above architecture, which performed well for all of the estimation tasks conducted, was, perhaps rather surprisingly, the first architecture we considered and was chosen by hand rather than through an automated optimisation procedure. Attempts to improve performance by increasing the number of hidden layers, neurons, and mixture components seemed to have little effect, suggesting that the proposed network is sufficiently expressive to produce high-quality density estimates for our considered set of problems. We suspect that this will likely hold for other models of similar complexity and therefore make the recommendation that our proposed architecture be used as a baseline for future investigations employing this estimation methodology.
For more complex models, however, it may be necessary to construct more expressive networks and in such cases we would suggest that some form of hyperparameter optimisation be carried out. This is beyond the scope of our investigation, however, and we thus leave it to future research.
In this section, we briefly discuss the adaptive Metropolis-Hastings algorithm that has been employed in all of the conducted estimation experiments. Our discussion here is mainly illustrative and positioned in the context of our investigation. The interested reader should therefore refer to the original contribution by Griffin and Walker (2013) for theoretical justifications and a more general discussion.
In essence, the approach is centred on the idea of maintaining a set of samples, s , θ 4. If accepted, set θ s+1 = θ s with θ (n) s replaced by z, otherwise simply set θ s+1 = θ s .
Repeating the above for S iterations, we obtain a sequence of sample sets that can be used to compute expectations of the form In our investigation, we set S = 5000 and N = 70 in all cases, with convergence typically observed at some point before s = 1500, leading us to discard the first 1500 sets as part of a burn-in period. When constructing the posterior samples, we repeat this entire sampling process 5 times and collect the obtained sets to form a larger collection of 5 × 3500 × 70 = 1225000 samples 27 .
Ultimately, this has become our MCMC algorithm of choice for two main reasons: 1. The number of iterations required to reach convergence in random walk Metropolis-Hastings algorithms depends significantly on the initialisation of the algorithm. If, for example, the initial candidate parameter set has a particularly low posterior density, it could take a substantial period of time before convergence is observed. Since the algorithm proposed by Griffin and Walker (2013) is initialised using a sample of points from a number of areas of the parameter space, this problem is less pronounced.

2.
Most random walk Metropolis-Hastings algorithms require careful tuning of the proposal distribution, usually with the aim of obtaining an acceptance rate of roughly 25%, in order to ensure a good balance between local exploration of high density areas of the parameter space and global coverage of the parameter space as a whole (Robert and Casella 2010). This can be difficult to achieve in practice, making an adaptive approach that determines the proposal distribution automatically particularly appealing.
In Section 5.1, we provided evidence that our proposed estimation procedure demonstrates some robustness relative to the choice of lag length, L. Here, we provide a more complete demonstration by repeating all of the previously conducted estimation experiments involving our approach, changing only the lag length, which we have increased to L = 4. Referring to the summary presented in Table 7, we find that the overall performance of the procedure relative to our chosen benchmark is virtually unchanged 28 , verifying the robustness of our conclusions.
28 Since there are a total of 27 individual parameter cases, the percentage shifts correspond to changes in only a single binary relation for both |µ i mdn − θ i true | < |µ i kde − θ i true | and σ i mdn < σ i kde .