A new use of importance sampling to reduce computational burden in simulation estimation
 2.1k Downloads
 19 Citations
Abstract
Simulation estimators (Lerman and Manski 1981; McFadden, Econometrica 57(5):995–1026, 1989; Pakes and Pollard, Econometrica 57:1027–1057, 1989) have been of great use to applied economists and marketers. They are simple and relatively easy to use, even for very complicated empirical models. That said, they can be computationally demanding, since these complicated models often need to be solved numerically, and these models need to be solved many times within an estimation procedure. This paper suggests methods that combine importance sampling techniques with changesofvariables to address this caveat. These methods can dramatically reduce the number of times a particular model needs to be solved in an estimation procedure, significantly decreasing computational burden. The methods have other advantages as well, e.g. they can smooth otherwise nonsmooth objective functions and can allow one to compute derivatives analytically. There are also caveats—if one is not careful, they can magnify simulation error. We illustrate with examples and a small MonteCarlo study.
Keywords
Simulation estimators Importance sampling MonteCarlo studyJEL Classifications
C13 C16 C631 Introduction
Simulation methods such as Simulated Maximum Likelihood (SML) (Lerman and Manski 1982) and the Method of Simulated Moments (MSM) (McFadden 1989; Pakes and Pollard 1989) have great value to applied economists and marketers estimating structural models. However, use of these methods is still limited by computational constraints. This is because one often needs to numerically solve the equilibria of these models many times within an estimation procedure. Often such numeric procedures are CPU intensive. Examples include 1) complicated static equilibrium problems, e.g. discrete games or complicated auction models, and 2) dynamic programming with large state spaces or significant amounts of heterogeneity. In these estimation procedures, straightforward simulation involves solving the equilibrium of such a model numerous times, typically once for every simulation draw; for every observation; for every parameter vector that is ever evaluated in a numeric optimization procedure. If one has N observations, performs S simulation draws per observation, and numeric optimization requires R function evaluations, estimation requires “solving” the model N ∗ S ∗ R times. This can be unwieldy for these complicated problems.
We suggest using a change of variables and importance sampling to help alleviate this problem. Importance sampling (Kloek and Van Dijk 1978) is a very simple numerical integration technique that has been used for various purposes in the prior literature, e.g. for reducing levels of simulation error (e.g. Berry et al. 1995), to smooth (e.g. McFadden 1989, Gasmi et al. 1991) or to aid in Bayesian estimation (e.g. Geweke 1989). We show that importance sampling can be also be used to dramatically reduce the number of times a complicated model needs to be solved within SML and MSM estimation procedures. With this change of variables and importance sampling one only needs to solve the model N ∗ S times or S times, instead of N ∗ S ∗ R times. Since R can be quite large, particularly as the dimension of the parameter vector becomes large, this can lead to very significant time savings.
There are also a number of additional benefits of our importance sampling procedure. As illustrated in a simple model by McFadden (1989), importance sampling procedures can smooth objective functions that would be nonsmooth using straightforward simulation techniques. Second, the objective functions generated by our procedure will often have analytic first (and second) derivatives. This can also generate time savings and increase accuracy when one is using derivativebased optimization procedures, because one does not need to use numerical derivatives. Lastly, the importance sampling technique is easily parallelizable, which can generate increased time savings with modern multiprocessor computers.
There is another strand of literature on computational methods in dynamic structural models, starting with Hotz and Miller (1993), and continuing with Hotz et al. (1994), Aguirregabiria and Mira (2002, 2007), JofreBonet and Pesendorfer (2003), Bajari et al. (2007a), Pakes et al. (2007), Pesendorfer and SchmidtDengler (2008), and Bajari et al. (2008). These are useful and popular techniques, and they can usually reduce computation time by more than our methods (since they often never require explicit solution of equilibrium strategies). But we believe that our suggested techniques can still be valuable, since they are very straightforward to use in models that allow for lots of unobserved heterogeneity (e.g. consumer or firmspecific, timeinvariant unobservables). In contrast, the HMrelated techniques can become considerably harder to use when there is more unobserved heterogeneity. Our importance sampling approach is actually more related to the approaches suggested by Keane and Wolpin (1994) and Rust (1987). We discuss how they relate and how the two approaches can be complementary.^{1}
Since a prior working paper version of this paper (Ackerberg 2001), a number of researchers have used our procedure successfully, e.g. Hartmann (2006), Pantano (2008), Bajari et al. (2009a), Goettler and Clay (2009). Also, Bajari et al. (2007b) (BFR) (and Bajari et al. 2009a) have suggested a related set of techniques. Their techniques can reduce computational burden of numerically challenging dynamic programming problems in a way similar to our importance sampling based techniques. Both their approach and our approach essentially presolve the dynamic programming model a fixed number of times, and input these presolved solutions into an optimization problem. Deciding which is preferable for a given situation might depend on whether one prefers to work with parameterized continuous distributions, or with more discrete nonparametric approximations. Our suggested techniques/models are more naturally parametric (of course, these parametric specifications can be chosen flexibly), while the BFR techniques are more naturally nonparametric. An advantage of the BFR technique is that in many cases, computation of the parameter estimates reduces to a linear programming problem. Numerically, these are easier and more reliably solved than the nonlinear programming problems generally used in structural estimation (and in our suggested techniques). One the other hand, because it is naturally nonparametric, one may need considerable amounts of data to use the BFR approach effectively, especially with high dimensional unobserved heterogeneity.
While these ideas can potentially be very powerful in reducing computational burden, the use of importance sampling is not without an important caveat. Importance sampling can dramatically change variance properties of simulation based estimators (either increasing or decreasing simulation error). It has the potential to significantly increase levels of simulation error, particularly with high dimensional unobserved heterogeneity. This can be much more dangerous than straightforward simulators that don’t involve importance sampling. Often, standard simulation estimators have a very useful property—there is a natural bound on the level of simulation error relative to the level of sampling error. Importance sampling simulators do not generally have such a bound. In some cases, the variance of the simulation error can (in theory) be infinite. Hence, one needs to be very careful in applying this methodology. We discuss these issues in depth through the paper, particularly in Section 5, and these issues are also illustrated in the MonteCarlo experiments in Section 6.
2 The basic MSM estimator
Note that this simulation procedure can be thought of as a data generating procedure. Each draw ε _{ is } generates simulated dependent variables y _{ is }. Moments of these simulated y _{ is }’s are then compared to moments of the observed y _{ i }’s in the data. This illuminates how general this estimation procedure is. To estimate the model, one only needs to be able to simulate data according to the model.
3 Importance sampling and a change of variables to reduce computational burden
A significant caveat of the above simulation procedure is that f(x _{ i },ε _{ is },θ) may be timeconsuming to compute, often requiring numeric methods. The problem is that the arguments of f(x _{ i },ε _{ is },θ) change across observations (x _{ i }), across simulation draws (ε _{ is }), and across different parameter vectors (θ). Hence, in an estimation procedure, f(x _{ i },ε _{ is },θ) typically needs to be evaluated N ∗ S ∗ R times—once for each observation (N), for each simulation draw (S), for each parameter vector ever evaluated by one’s optimization routine (R denotes this total number of function evaluations needed for optimization). This is particularly problematic as the number of parameters increases since R can increase quickly in the number of parameters. This paper shows how importance sampling and a change of variables can be used to significantly reduce the number of times that f(x _{ i },ε _{ is },θ) needs to be evaluated.

Property (CS)—“Constant Support”: The econometric model y _{ i } = f(x _{ i }, ε _{ i },θ) can be expressed as \(y_{i}=\widetilde{f}(u(x_{i},\epsilon _{i},\theta ))\), where the “changeofvariables” vectorvalued function u(x _{ i },ε _{ i },θ) satisfies:
 I)
∀ x _{ i },θ, the function u(x _{ i },ε _{ i },θ) and the distribution p(ε _{ i }x _{ i },θ) are such that one can analytically (or quickly) compute the change of variables density of the random vector u _{ i } = u(x _{ i },ε _{ i },θ)
 II)
∀ x _{ i }, the support of the random vector u _{ i } = u(x _{ i }, ε _{ i },θ) does not depend on θ.
For a model to satisfy Property (CS) it needs to be able to be expressed in a form where the finite set of parameters enter the model only through functions u which have I) easily computable changeofvariable densities and II) supports that do not depend on θ. For I) to hold, the u functions will generally need to have fairly simple structure. In contrast, the \(\widetilde{f}\) function will typically be complicated, since we are concerned with cases where the original function f is complicated or time consuming to evaluate. So in a sense, rewriting y _{ i } = f(x _{ i },ε _{ i },θ) as \(y_{i}=\widetilde{f} (u(x_{i},\epsilon _{i},\theta ))\) divides a complicated original model f into a model with two components, one simple (u) and one complicated (\( \widetilde{f}\)).
Note that for II) to hold, one needs a sufficient amount of heterogeneity in one’s model. For example, one cannot have an element of θ enter u through an element (recall that u is a vector) that does not depend on some of the unobservables ε _{ i }, e.g.. u(x _{ i },θ) and u(θ). The support of such a deterministic function would necessarily depend on θ, contradicting II).
Many commonly used econometric models satisfy Property (CS)—this is exhibited in examples later. We will also discuss cases where it may not be satisfied and show how one can either 1) still benefit from computational savings using our technique, or 2) how an econometric model can be perturbed to satisfy Propert (CS).
The most important aspect of Eq. 3 for our purposes is that when θ changes, the u _{ is }’s can be held constant. As a result, \(\widetilde{f}\) does not need to be recomputed when θ changes. The only components of Eq. 3 that need to be reevaluated as θ changes are the numerators of the importance sampling weights, i.e. p(u _{ is }x _{ i },θ). Unlike recomputing the complicated f (or \(\widetilde{f}\)), recomputing this change of variables density is typically not computationally burdensome.^{4} In summary, an estimation procedure using \(\widetilde{Ef}_{i}(\theta )\) only needs to compute the complicated part of the model N ∗ S times, rather than N ∗ S ∗ R times with the conventional simulator \(\widehat{Ef}_{i}(\theta )\).
Under appropriate regularity conditions, the estimator using the simulator \( \widetilde{Ef}_{i}(\theta )\) can be shown to be consistent and asymptotically normal using, e.g. Theorems 3.1 and 3.3 of Pakes and Pollard (1989). However, it is very important to note that these regularity conditions will tend to be considerably stronger than what is necessary when using the conventional simulator \(\widehat{Ef}_{i}(\theta )\). The problem is that importance sampling can dramatically affect the precision of simulation (both positively and negatively). The variance of the simulator \(\widetilde{Ef}_{i}(\theta )\) will depend on the variance of the importance sampling weights \(\frac{p(u_{is}\mid x_{i},\theta )}{g(u_{is}\mid x_{i})}\). This variance depends on the behavior of p and g in the tails, and can be infinite even if \(\widetilde{f}\) is bounded. So, for example, it is possible that to obtain the theoretical result, one may need to artificially bound the support of u _{ i } (e.g. \(u_{i}\in \lbrack R,R]\) for some large R ) in order to theoretically bound the simulation error. This is in stark contrast with the conventional simulator \(\widehat{Ef}_{i}(\theta )\), for which the level of simulation error is naturally bounded (if f is bounded) because the importance sampling weights are implicitly always equal to 1.
These additional regularity conditions are not just a theoretical curiosity. They have important practical implications. Conventional pure frequency simulators like \(\widehat{Ef}_{i}(\theta )\) often have the property that the simulation inflates the asymptotic variance of the parameter estimates by proportion \(\frac{1}{S}\) (Pakes and Pollard 1989; McFadden 1989). Importance sampling simulators do not have this property—the additional variance imparted by the simulation depends on the choice of g and can be significantly greater than \(\frac{1}{S}\). As a result, one needs to pay particular attention to the issue of simulation error, much more so than with conventional simulators.^{5} We discuss this issue in more detail in Section 5.3, including ideas on how to choose g. For more on the general problems and issues that can arise using importance sampling simulators, see Geweke (1989).
Additional computational savings are possible if one chooses an importance sampling density g(u _{ i }x _{ i }) that is the same across observations, i.e. g(u _{ i }). In this case, the draws u _{ is }’s also can be held constant across observations i. Hence, in this case h only needs to be computed S times. One caveat here is that using the same simulation draws across observations may limit the extent to which simulation error averages out across observations, and may change its asymptotic properties.
Note that \(\widetilde{Ef}_{i}(\theta )\) is not much harder to program than \( \widehat{Ef}_{i}(\theta )\)—one only needs some additional density calculations. In addition, for a similar reason as noted by Bajari et al. (2007b), the S necessary calculations of \(\widetilde{f}\) are done before the search procedure. Hence, the calculations can easily be parallelized to take advantage of modern multiprocessor computers.
Lastly note that there is some intuition behind our alternative simulator \( \widetilde{Ef}_{i}(\theta )\). As θ changes, rather than holding each of the ε _{ is } and their implicit weights (\(\frac{1}{S})\) constant, this procedure holds the u _{ is } constant and varies the “weights” \(\left( \frac{1}{S}\frac{ p(u_{is}\mid x,\theta )}{g(u_{is})}\right) \) on each of the draws. Put another way, rather than changing our “simulated observations” when we change θ, we change the weight which we put on each “simulated observation”. This avoids needing to recompute the complicated \(\widetilde{ f}\) for new simulated observations. Another way to think about this estimator is that the estimator uses the simulation draws to in a sense “span” parameter space. The simulator is approximating the objective function at different parameter values by reweighting these simulation draws in different ways.
3.1 Application to simulated maximum likelihood problems
Our importance sampling methodology can also be applied to Simulated Maximum Likelihood (SML) estimation procedures. SML estimators are not quite as general as MSM estimators. One reason is that straightforward SML is often a bit complicated with continuous outcome variables (although measurement error methods suggested by Keane and Wolpin 2000 can address this issue). However, SML is typically more efficient and often easier to use than MSM in panel settings, and thus has frequently been applied to models with discrete outcome variables, e.g. dynamic programming discrete choice models (e.g. Erdem and Keane 1996; Ackerberg 2003; Crawford and Shum 2005; Hendel and Nevo 2006; and Hartmann 2006).
4 Examples
We next provide 3 simple examples of applications of our importance sampling simulator. This is useful in that it shows the wide range of problems to which it can be applied. They also aid in interpreting Property (CS) and help illustrate important caveats of the simulator described in the next section.
4.1 Example 1: a dynamic programming problem
This example is similar to Hartmann (2006), who applies the importance sampling simulator in his empirical work. We also use this example in our MonteCarlo experiments below.
As parameters change, the u _{ is }’s need not change. As such, the conditional likelihood function \(\widetilde{f}(c_{i}\mid p_{i},u_{is})\) (and thus the dynamic programming problem) only needs to be computed N ∗ S times—once for each simulation draw for each individual. As described above, one could reduce the number of needed dynamic programming solutions to S by choosing a g(u _{ is }x _{ i }) that is the same across observations (i.e. does not depend on x _{ i }) and using the same simulation draws u _{ is } for each individual (see Section 6 for an example of this).
As discussed further at the end of Section 5.1, our techniques have an important side benefit in these sorts of problems. Specifically, one can run many alternative specifications (more specifically, alternative specifications with different sets of x _{ ij }’s entering the \(\big(\big\{ \alpha _{ij}\big\} _{j=1}^{J},\beta _{i},\gamma _{i}\big)\)) without having to resolve new dynamic programming problems. As long as the g(u _{ is }x _{ i }) and the simulation draws are held constant, one can try alternative sets of x _{ ij }’s without having to resolve the \(\widetilde{f}\)’s. The only thing that needs to change across the specifications is the p(u _{ is }x _{ i },θ)’s. This can allow researchers to try alternative specifications or investigate robustness in a very computationally cheap way.
4.2 Example 2: a discrete game
4.3 Example 3: an asymmetric auction model
X _{ ij } are auctionbidder specific factors that are common knowledge to all bidders and observed by the econometrician, η _{ ij } is an auctionbidder specific factor that is common knowledge to all firms but unobserved by the econometrician, and λ _{ ij } is bidder j’s private value in auction i, observed only by bidder j. Suppose that η _{ ij }∼iid \(N(0,\sigma _{\eta }^{2})\), λ _{ ij }∼iid \( N(0,\sigma _{\lambda {i}}^{2})\), and \(\sigma _{\lambda{i}}^{2}\sim iid\) \(\ln N(\mu _{\lambda },\sigma _{\lambda }^{2})\). Note that the acrossbidder variance of the private value component, \(\sigma _{\lambda i}^{2},\) is allowed to vary across different auctions i. A more parsimonious specification might restrict this variance to be identical across auctions. The additional heterogeneity has been added to the model to help satisfy Property (CS) (see Section 5.1 for further discussion).
5 Discussion
5.1 Satisfying or partially satisfying property (CS)
In our three examples, we were able to find change of variables functions u that satisfied Property (CS). However, in some models this may not be the case. One common example of this is when there are parameters in one’s model that 1) do not vary unobservably across the population and 2) do not enter into “index” functions that have at least one unobservable component that varies across the population. In Example 1, we did not consider estimation of the discount factor (it was implicitly assumed to be known). Suppose that one did want to estimate the discount factor δ, and furthermore wanted to assume that all consumers have the same discount factor δ _{ i } = δ. Such a model would not easily satisfy Property (CS). The problem is that in this case, it will be hard to find a change of variables function u that summarizes the impact of the discount factor parameter on the model, that has a constant support, and that has a easily computable change of variables density. From a more intuitive perspective, the problem here is that since there is no heterogeneity in δ across the population, one cannot easily “span” δ space using the simulation draws (so we cannot learn anything about the likelihood function when, e.g., δ = 0.8 from solutions to the model when, e.g., δ = 0.9).^{18}
A first approach is to apply the importance sampling approach to only a subset of parameters. Suppose that the parameter vector can be divided into components θ _{1} and θ _{2}. Suppose that the model can be expressed in a form where the θ _{2} parameters enter through change of variables functions u that have constant support, but that this cannot be done with the θ _{1} parameters. Then \(\widetilde{f}\) will need to be recomputed as θ _{1} changes (though not as θ _{2} changes). In Example 1 with a homogeneous discount factor parameter that needs to be estimated, the discount factor would be in θ _{1}, the rest of the parameters in θ _{2}.^{19}
In these situations where Property (CS) is partially satisfied, a first option is to use derivative based optimization methods. In computing first derivatives, \(\widetilde{f}\) needs to be recomputed only when elements of θ _{1} are perturbed. This will reduce the computational time of computing these derivatives by approximately \(\frac{\dim (\theta _{1})}{\dim (\theta )}\) relative to straightforward simulation using numeric derivatives. A second alternative is to use a nested search algorithm. On the outside, one searches over θ _{1}; on the inside, over θ _{2}. During the inside search algorithm, one needs not recompute f’s. As these nested search algorithms are generally inefficient, this approach may only be reasonable if the dimension of θ _{1} is small.
A second general approach to satisfying Property (CS) is to note that if a coefficient is heterogeneous across the population (and has a constant support, e.g. normals, lognormals, or functions of such variables), it will automatically satisfy Property (CS). Therefore, if one allows heterogeneity in the discount factor across the population, e.g. \(\delta _{i}=\frac{\exp (\delta +\sigma _{\delta }\mu _{i})}{1+\exp (\delta +\sigma _{\delta }\mu _{i})}\) where μ _{ i } has support (− ∞ , ∞) (as in an early version of Hartmann 2006),^{20} Property (CS) can be satisfied by including \(u_{i}=\frac{ \exp (\delta +\sigma _{\delta }\mu _{i})}{1+\exp (\delta +\sigma _{\delta }\mu _{i})}\) as an element of the change of variables function. Strictly speaking, this is not a generalization of the model with δ _{ i } = δ, since the variance of μ _{ i } needs to bounded away from 0 to apply the importance sampling simulator.^{21} It is interesting, however, that the importance sampling method in a sense works better with \(\mathit{more}\) unobserved heterogeneity in one’s model.^{22} This contrasts with methods for estimating dynamic programming problems related to HM, which tend to be harder to apply when there is unobserved heterogeneity (though these methods do have other advantages, e.g. being able to estimate structural parameters without even solving a single dynamic programming problem).
Importantly, note that while having all coefficients in one’s model be heterogeneous across the population (as in the dynamic models considered by Bajari et al. 2007b) is often is a sufficient condition for Property (CS) to hold, it is not a necessary condition. In our method, coefficients need not be heterogeneous across the population as long as they enter the model through changeofvariables functions that include sufficient heterogeneity. This turns out to be very useful for introducing new individual level covariates into a model in a parsimonious way. Moreover, it also allows a researcher to estimate many alternative specifications (e.g. models with different sets of individual level covariates) without needing to compute new \(\widetilde{f}\) ’s.
Example 1 is again illustrative of this. Note that the consumer characteristics x _{ i } enter the model through index functions, \( x_{ij}^{\prime }\theta _{j}+\epsilon _{ij}\), where the parameter vectors θ _{ j } are homogeneous across observations i (i.e. the effect of a change in x _{ ij } on the mean of the taste distribution is the same across i). What is crucial is that these fixed parameters enter the model through index functions that contain at least some unobserved heterogeneity (and that admit a simple change of variables density).
This property of our estimator makes it very convenient to add new x _{ ij } ’s to the model, e.g. when an empirical researcher is trying alternative specifications. First, one can add a new x _{ ij } to the model by just introducing a single new parameter, not an entire new distribution. Second, one can add (or subtract) x _{ ij }’s to the model and reestimate the new model without having to recompute the \(\widetilde{f}\) ’s (as long as one continues to use the same importance sampling density and same simulation draws). We feel that this is a very important benefit, as it may allow researchers to experiment with more alternative specifications than they otherwise would.
5.2 Smoothness and analytic derivatives
As noted in the introduction, there are additional benefits of the importance sampling estimator. As McFadden (1989) originally noted in the case of a multinomial probit model, importance sampling can be used to smooth simulated objective functions. In many cases, f(x _{ i },ε _{ is },θ) is discontinuous in θ due to discreteness in one’s model. As a result, standard simulated objective functions have flats (areas with zero derivatives w.r.t. θ) and discontinuous jumps. In contrast, for distributions that are commonly used, p(u _{ is }x _{ i },θ) is typically smooth in θ. The change of variables and importance sampling technique essentially moves θ from inside f to inside p. Thus, it can convert an objective function that is discontinuous in θ to one that is continuous in θ.^{23} The discrete quantity game discussed above is an example one of these cases—note that \(\widehat{E}Q_{i}(\theta )\) is discontinuous in θ, while \(\widetilde{EQ}_{i}(\theta )\) is continuous in θ.^{24} Smoothness can be a big advantage in estimation. First, it allows one to use derivative based search algorithms, which are often faster than nonderivative based routines. Second, even nonderivative based routines can have significant problems trying to optimize discontinuous functions with flats and jumps.
A related advantage of importance sampling objective functions concerns derivatives with respect to θ. When f is complicated, the objective functions of most standard simulation estimators often do not have analytic derivatives. If one is using a derivative based optimization routine, this lack of analytic derivatives necessitates use of numerical methods to obtain derivative information. This can be timeconsuming and is potentially imprecise. In contrast, our importance sampling objective functions often do have analytic derivatives.
Note that these properties suggest that importance sampling can also be helpful in computing the derivatives necessary to estimate standard errors. With discontinuous objective functions (especially those with flats), calculating these derivatives can be unreliable. This is not a problem using an importance sampled simulator
5.3 Choice of g and simulation error
As mentioned earlier, one traditional use of importance sampling is to reduce the variance of simulation estimators. An appropriate choice of g can accomplish this goal. Unfortunately, if one is not careful, importance sampling can also dramatically increase the variance of simulation estimators. When performing the above change of variables and importance sampling procedure, one needs to be very aware of this issue. Unlike standard pure frequency simulators, where the simulation error is naturally bounded, this is not necessarily the case with importance sampling simulators. This can result in large amounts of simulation error if one is not careful.
Perhaps the most obvious choice for g is p itself at some arbitrary initial parameter vector θ ^{ init }, i.e. \(g(u_{i}\mid x_{i})=p(u_{i}\mid x_{i},\theta ^{init})\). This importance sampling simulator is then identical to the pure frequency simulator when evaluated at θ = θ ^{ init }. Of course, θ ^{ init } is generally not going to equal the true parameter vector θ _{0}. And with g based on θ ^{ init }, the effect of simulation error on estimates can be quite large if θ _{0} is far away from θ ^{ init }. This can be a significant issue in practice, both for efficiency, and for the reliability of the nonlinear search over θ. We have a few informal suggestions for minimizing these problems, but a general point to remember is that one needs to be much more careful with importance sampling simulators than with standard simulators because of these issues.
A first suggestion is to iterate the entire importance sampling estimation procedure multiple times. In other words, set g _{1}(u _{ i }x _{ i }) \( =p(u_{i}\mid x_{i},\theta ^{init})\) at some exogenously chosen θ ^{ init }, use the resulting simulator to form a simulated objective function, and optimize to obtain an estimate \(\widehat{\theta }^{1}\). Then iterate the entire estimation procedure by creating a new importance sampling distribution, \(g_{2}(u_{i}\mid x_{i})=p(u_{i}\mid x_{i},\widehat{ \theta }^{1})\), taking new simulation draws, and reestimating to obtain a new estimate \(\widehat{\theta }^{2}\).^{25} This iterating can obviously be continued. Each of \(\widehat{\theta } ^{1}\), \(\widehat{\theta }^{2}\), \(\widehat{\theta }^{3}\),... is a consistent estimator of θ _{0}. But the hope is that after one (or multiple) iterations, the g distribution will be closer to p(u _{ i }x _{ i },θ _{0}), and the simulation error in the estimates will be smaller. Of course, there is a computational cost to iterating, as the “complicated” functions \(\widetilde{f}\) need to be recomputed each time g is “reinitialized” and new draws are taken. But as we show in our MonteCarlo experiments, computational burden is still far less than that with a straightforward simulator.
There are some unanswered questions regarding such a iteration process that are beyond the scope of this paper, e.g. Is this iteration process guaranteed to converge starting at any θ ^{ init }?; Does it converge to the same \(\widehat{\theta }^{\infty }\) regardless of θ ^{ init }?;^{26} What are the asymptotic properties of \(\widehat{ \theta }^{\infty }\)? Another question is how to appropriately compute standard errors of such an estimator. Note that an estimated variance of \( \widehat{\theta }^{n}\) based on standard SML or MSM formulas is not exactly right, since it ignores the variation in the importance sampling density g _{ n } due to the estimation of \(\widehat{\theta }^{n1}\) from the prior iteration.
Suppose one starts their nonlinear search at θ ^{ init }. One could use importance sampling (with importance sampling density \(g(u_{i}\mid x_{i})=p(u_{i}\mid x_{i},\theta ^{init})\)) to compute the derivatives of the objective function without resolving \(\widetilde{f}\). Then for the “step” of the derivative based procedure, one could use standard simulators (this would require resolving \(\widetilde{f}\)). After the step, at the new θ ^{′}, importance sampling could again be used to compute derivatives (using \(g(u_{i}\mid x_{i})=p(u_{i}\mid x_{i},\theta ^{\prime })\) ). This could then be repeated. This is probably the most conservative way to use the importance sampling idea, since at every θ, one is using an importance sampling density g that is identical to p at that θ. Of course, there is also higher computation burden, as the time savings only applies to the derivative calculations.^{28} There is also an important caveat that such a search is not guaranteed to converge. The reason is that the derivative information is not exactly right. More precisely, in this procedure, one is optimizing the objective function based on \(\widehat{Ef_{i}}(\theta )\), but derivative information is coming from the objective function based on \(\widetilde{Ef}_{i}(\theta )\). These derivatives will be similar (they both converge to the true derivative of Ef _{ i }(θ) as S→ ∞), but are not numerically equivalent.
A third set of possibilities comes from the importance sampling literature. As noted by Geweke (1989), among others, it can help for an importance sampling density to have thick tails. This can prevent high levels of simulation error. Intuitively, a wide g means that the initial set of u ^{ s } points are spread out  thus they can be weighted to approximate behavior at a wider range of θ. One way to pick a wide g is to base g on a θ ^{ init } where the variance parameters are set relatively large. In the iterative procedures above, one might want to artificially inflate the variance related parameters when choosing g’s
Lastly, note that if one is particularly concerned with these issues, one can alternatively use “importance sampled objective functions” simply as a numeric tool to get “close” to the parameter estimates. The idea is to start by optimizing an objective function based on \(\widetilde{Ef}_{i}(\theta )\) that is easy to compute (perhaps updating g occasionally as suggested above), but eventually switch to an objective function based on a more standard simulator, e.g. \(\widehat{Ef_{i}}(\theta )\). Note that there are no implications of doing this on estimated standard errors, since the results using \(\widetilde{Ef}_{i}(\theta )\) are only used as starting values for optimizing based on \(\widehat{Ef_{i}}(\theta )\).
5.4 Comparison to discretation/randomization approaches
Keane and Wolpin (1994) and Rust (1997) (KW/R) suggest using randomization techniques to approximate \(V(p_{it},\nu _{it},c_{it1},\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\). Instead of discretizing the arguments of V in a deterministic way, one randomly chooses K points at which to approximate the value function (or alternativespecific value functions). After using such an approach to approximate V, simulation estimation can proceed by taking simulation draws \((\left\{ \alpha _{ijs}\right\} _{j=1}^{J},\beta _{is},\gamma _{is})\) (conditional on θ) to simulate the likelihood. Again, since these simulation draws will generally not equal the random points at which the value function has been approximated, one needs to use additional approximation (e.g. interpolation, polynomial approximation) in \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\) space to calculate the simulated likelihood.
Note that there are two sources of simulation error in the KW/R approach. One source comes from the random draws of the K points at which to solve V, and one source comes from the random draws of \((\left\{ \alpha _{ijs}\right\} _{j=1}^{J},\beta _{is},\gamma _{is})\) in computing the simulated likelihood function. One can think about the importance sampling approach as a modification of the KW/R approach that 1) only has one source of simulation error, and 2) doesn’t require additional approximation in \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\) space. The importance sampling approach does this because it allows one to hold the draws \((\left\{ \alpha _{ijs}\right\} _{j=1}^{J},\beta _{is},\gamma _{is})\) constant regardless of the value of θ. Hence, the importance sampling approach can use the same set of simulation draws \(\left\{ (\left\{ \alpha _{ijs}\right\} _{j=1}^{J},\beta _{is},\gamma _{is})\right\} _{s=1}^{S}\) both to solve for V and to simulate the likelihood. Thus, there is only one source of simulation error. This also implies there is no additional approximation necessary in \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\) space, since the points used in simulating the likelihood are exactly the points where V has been solved.^{29} Lastly, note that with the importance sampling approach, standard results on MonteCarlo integration imply that the importance sampling approach breaks the “curse of dimensionality” in the dimension of \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\).^{30}
5.5 Relation to Keane and Wolpin (2000, 2001)
Independently, in two empirical papers, Keane and Wolpin use an importance sampling procedure that is related to ours in order to solve problems of unobserved state variables. These papers analyze dynamic programming problems of educational choice (Keane and Wolpin 2001) and fertility/marriage choice (Keane and Wolpin 2000). In the first paper, where individuals schooling, work, and savings decisions are analyzed over a lifetime, a significant problem is that assets (a state variable) are not observed in some years of the data (there are other state variables, choice variables, and initial conditions, e.g. schooling and hours worked, that are also occasionally unobserved). To estimate this using standard methods would be exceedingly complex, as one would need to integrate out over very complicated conditional distributions of the missing data.
Their approach starts by simulating S unconditional (i.e. there are no predetermined variables) outcome paths—these are what they call their “simulated paths”. To create each of these paths, one needs to solve the simulated agent’s dynamic programming problem. If all outcome variables were discrete, one could in theory compute the likelihood for observation i by the proportion of “simulated paths” that match observation i’s path. Practically, since there are so many possible paths (and since some of the outcome variables are continuous), this would result in likelihood zero events. To mitigate this problem, Keane and Wolpin add measurement error to all outcome variables. This gives any observed path a positive likelihood and allows for estimation using SML.
What is similar to our paper is the fact that Keane and Wolpin use importance sampling while searching over θ. This means that as they change θ, there is no need to draw new simulated paths. Instead, one only needs to compute the likelihood of the original simulated paths at the new θ. This likelihood is much simpler that the original problem since the simulated paths have no missing data. The importance sampling also smooths the likelihood function in θ. However, unlike our procedure, it generally does require resolving S dynamic programming problems when θ changes.
6 MonteCarlo results
We chose N = 500 and T = 20. We consider the discount factor to be known. In fact, for computational reasons, we set the discount factor equal to 0, i.e. we assume consumers are myopic. Because this implies that there is no dynamic programming problem, it allows us to do more MonteCarlo repetitions than would otherwise be possible. Of course, this implies that actual measures of computational time are not particularly relevant. So the way we compare computational time across different procedures is by a simple count of the number of times the dynamic programming problem would need to have been solved within an estimation procedure (if in fact there was one to be solved). Given we are using this metric, we see no obvious reason why the relative performance of the various estimation algorithms would differ if the discount factor was nonzero, but it is certainly possible.
For our importance sampling densities g, we use p’s evaluated at some initial \(\theta ^{init}=\Big( \Big\{ \theta _{j}^{0}\Big\} _{j=1}^{J+2},\Big\{ \theta _{j}^{1}\Big\} _{j=1}^{J+2},\Big\{ \sigma _{j}\Big\} _{j=1}^{J+2}\Big) \). As discussed above, choice of this θ ^{ init } may be quite important. We do two separate experiments. In the first experiment, we use “good” starting values. Across montecarlo replications, starting values are randomly drawn from U(0,1) distributions centered at the true parameters. In this experiment, we only run the importance sampling estimation routine once for each replication.
In the second experiment, we use “bad” starting values. The θ ^{0} and θ ^{1} parameters are drawn from U( − 5,5), and the σ parameters are drawn from U(0.5,10.5)^{31} In this experiment, we iterate the importance sampling estimation routine as described in Section 5.3. At the end of each estimation routine iteration, we construct a new importance sampling density g, based on p at the current estimations, for use in the next estimation routine. Thus at each restart, we need to resolve the “dynamic programming problems”. We iterated the estimation routine in this way until it “converged” up to a numeric tolerance. Interestingly, it always converged, and the average number of estimation iterations was 17.24.
We compare our importance sampling routine to SML simulation based on a standard simulated likelihood Eq. 5. In the standard SML routine, we set S = 30, i.e. 30 simulation draws per observation i. This means that there are a total of N ∗ S = 15000 simulation draws and that the “dynamic programming problem” needs to be solved 15000 times for each likelihood function evaluation. In the importance sampling routine, we also use a total of 15000 draws. However, recall that with the importance sampling routine, if one chooses the same importance sampling density g across observations, the same simulation draws can be used for all observations. Since this seems to be the most efficient use of the draws, we do this—i.e. we use the same g(u _{ i }) and the same 15000 draws for all observations.^{32} This also requires 15000 “dynamic programming problem” solutions, but unlike the standard SML routine, these solutions do not have to be recalculated as θ changes.
MonteCarlo results
 1  2  3  4  5  

Standard SML  Importance Sampling  Importance Sampling  
“Good” Start Values  “Bad” Start Values  
1st Iteration  3rd Iteration  Convergence  
Parameters  Truth  Mean  SD  Mean  SD  Mean  SD  Mean  SD  Mean  SD 
Intercepts  
\(\uptheta^0_1\)  0  0.2366  0.0889  0.0014  0.1284  −0.4293  1.1052  −0.0685  0.3438  0.0930  0.1252 
\(\uptheta^0_2\)  0  0.2492  0.0847  0.0028  0.1270  −0.2865  1.2800  −0.0593  0.3745  0.0909  0.1308 
\(\uptheta^0_3\)  0  0.2452  0.0820  −0.0001  0.1245  −0.2833  1.3319  −0.0560  0.4202  0.0769  0.1355 
\(\uptheta^0_4\)  0  0.2219  0.0767  −0.0194  0.1204  −0.3684  1.3072  −0.0675  0.3535  0.0801  0.1304 
\(\uptheta^0_5\)  0  0.2520  0.0895  0.0106  0.1243  −0.5726  1.1853  −0.0825  0.3836  0.0927  0.1322 
\(\uptheta^0_6\)  0  0.2313  0.0833  −0.0233  0.1245  −0.5034  1.2058  −0.0695  0.3405  0.0974  0.1218 
\(\uptheta^0_7\)  0  0.2470  0.0780  0.0134  0.1332  −0.4626  1.0521  −0.0738  0.3234  0.0980  0.1295 
\(\uptheta^0_8\)  0  0.2308  0.0953  −0.0148  0.1120  −0.3787  1.1953  −0.0743  0.3517  0.0924  0.1235 
\(\uptheta^0_9\)  0  −0.0133  0.0628  −0.0050  0.0516  −0.0043  0.2904  −0.0056  0.0492  −0.0032  0.0474 
\(\uptheta^0_{10}\)  0  0.3942  0.0672  0.0743  0.0732  −0.4345  0.4803  −0.0169  0.0793  0.1068  0.0645 
Slopes  
\(\uptheta^1_1\)  1  0.8559  0.0662  0.8391  0.1066  1.4697  0.5612  0.9664  0.1083  0.8498  0.0924 
\(\uptheta^1_2\)  1  0.8414  0.0641  0.8250  0.0976  1.3624  0.5364  0.9327  0.1302  0.8247  0.0896 
\(\uptheta^1_3\)  1  0.8473  0.0702  0.8296  0.0992  1.3767  0.5923  0.9284  0.1461  0.8388  0.0800 
\(\uptheta^1_4\)  1  0.8569  0.0700  0.8546  0.0899  1.3811  0.5569  0.9466  0.1115  0.8489  0.0845 
\(\uptheta^1_5\)  1  0.8521  0.0673  0.8270  0.1024  1.4750  0.5597  0.9477  0.1275  0.8314  0.0825 
\(\uptheta^1_6\)  1  0.8525  0.0670  0.8506  0.0921  1.4863  0.6070  0.9572  0.1229  0.8389  0.0870 
\(\uptheta^1_7\)  1  0.8460  0.0717  0.8117  0.0920  1.4566  0.5093  0.9626  0.1017  0.8338  0.0853 
\(\uptheta^1_8\)  1  0.8483  0.0566  0.8444  0.0892  1.4197  0.5169  0.9401  0.1124  0.8339  0.0829 
\(\uptheta^1_9\)  1  0.8165  0.0669  0.9372  0.0524  1.5056  0.2489  1.0168  0.0658  0.9482  0.0522 
\(\uptheta^1_{10}\)  1  0.8971  0.0648  0.9017  0.0835  1.4227  0.3361  0.9539  0.0818  0.9115  0.0735 
\(\upsigma\) Terms  
\(\upsigma_1\)  1  0.5312  0.2243  0.9845  0.1162  3.4664  1.1010  1.2305  0.1508  0.9411  0.1033 
\(\upsigma_2\)  1  0.5227  0.2303  0.9694  0.1179  3.2377  1.1721  1.1783  0.2837  0.9118  0.2174 
\(\upsigma_3\)  1  0.4877  0.2281  0.9727  0.1174  3.2143  1.1595  1.1630  0.3458  0.9064  0.2840 
\(\upsigma_4\)  1  0.5411  0.2094  0.9893  0.1042  3.2114  1.2027  1.1904  0.2809  0.9269  0.1889 
\(\upsigma_5\)  1  0.5340  0.1895  0.9743  0.1163  3.4441  1.0813  1.2049  0.1562  0.9284  0.0962 
\(\upsigma_6\)  1  0.5277  0.1960  0.9953  0.1055  3.4007  1.2647  1.1941  0.2721  0.9150  0.2138 
\(\upsigma_7\)  1  0.5026  0.2106  0.9537  0.1238  3.3815  0.9892  1.2218  0.1340  0.9326  0.0900 
\(\upsigma_8\)  1  0.5294  0.1869  0.9898  0.1072  3.3104  1.0056  1.2174  0.1304  0.9322  0.0929 
\(\upsigma_9\)  1  0.8938  0.0502  0.9404  0.0434  1.6888  0.5982  0.9739  0.3635  0.8938  0.3268 
\(\upsigma_{10}\)  1  0.8478  0.0875  0.9607  0.0916  2.4352  0.6084  1.0619  0.2229  0.9426  0.2027 
# of Dynamic Program Solutions  45.6 million  15000  15000  45000  258600  
Average “True” LnLikelihood at Starting Values  −15609.24  −15609.24  −28551.32  −  −  
Average “True” LnLikelihood at Estimates  −15303.15  −15120.62  −16862.99  −15133.09  −15081.79 
In the second column are results from the importance sampling routine using the “good” starting values. In this case we do not iterate the estimation routine, so each estimation only requires 15000 “dynamic programming solutions” (which compared to 45.6 million is a speed improvement of over 3000 times). The parameter results also illustrate some small sample biases, though in most cases not nearly as big as in Column 1. Interestingly, the comparison of standard deviations of the estimated parameters across montecarlo repetitions is ambiguous between Columns 1 and 2. There is no clear winner—in some cases the Column 1 estimates are less variable, while in other cases the Column 2 estimates are.
The last row of the table computes the average (across montecarlo repetitions) “true” likelihood, evaluated at the point estimates from the two procedures. To compute the “true” likelihood, we simply used a standard simulator with far more simulation draws, i.e. 10000 per observation. Less negative values of this imply that the estimation procedure has found a “better” parameter vector according to the “true” likelihood. On average, the standard SML procedure generates a point estimate with a “true” likelihood of −15303.5, while the importance sampling procedure on average generates a point estimate with a “true” likelihood of − 15120.62. So the importance sampling procedure is not only far quicker, but by using far more draws also seems to be producing better estimates.
Columns 3 through 5 run the importance sampling estimator using the “bad” starting values. Here we iterated the estimation routine until convergence. In Column 3 we present results after the first iteration, in Column 4 we present results after the third iteration, and in Column 5 we present results after convergence (which took an average of 17.24 iterations). Since the “dynamic programming solutions” need to be resolved at each iteration, the number of “dynamic programming solutions” necessary for each column is just the number of estimation iterations times 15000.
Column 3 indicates how important starting values are if one is only going to run the importance sampling routine only once. The estimates are very imprecise and highly biased in comparison to even Column 1. The average “true” likelihood at the estimates is also much worse, − 16862.99. One needs to keep this in mind when using this procedure.
However, iterating seems to make these biases disappear quite quickly. In Column 4, after the 3rd iteration, the biases seem approximately the same level as the biases in Column 1, and the average “true” likelihood is considerable better, − 15133.09. At convergence in Column 5, the biases seem quite low. The average “true” likelihood at convergence is − 15081.79, better than all the other results, including the importance sampled estimates using the “good” starting values (though it interesting that in some dimensions, Column 1 or Column 2 perform better). Compared to the standard SML approach in Column 1, the converged results in Column 5 require approximately 0.006 times the number of the “dynamic programming solutions”. So even at the bad starting values, the iterated importance sampling method is both quicker and produces what are arguably better estimates.
7 Conclusion
This paper suggests a new use of importance sampling to reduce computational burden in simulation estimation of complicated models in economics and marketing. We show that combining a change of variables with importance sampling can reduce computational time by dramatically reducing the number of times that a complicated model needs to be solved or simulated in an estimation procedure. The technique is applicable to a wide range of models, including single or multiple agent dynamic programming problems, and complicated equilibrium problems such as discrete games or auction models. The technique is particularly amenable to allowing considerable amounts of unobserved heterogeneity in one’s model. We hope that this technique allows researchers to estimate models that allow for more unobserved heterogeneity, and, more generally, more realistic models. The technique is not without caveats though. In particular, special care must be taken, since misuse of importance sampling can potentially generate high levels of simulation error.
Footnotes
 1.
There are at least two other sets of work that address similar computationally burdensome equilibrium models. Imai et al. (2009) and Norets (2009) suggest computational techniques for Bayesian estimation of dynamic programming models. Judd and Su (2008) suggest treating computationally burdensome functions as the constraints of a numeric nonlinear programming approach. Computational burden can be decreased with this approach because the equilibrium constraints and the optimization problem are being solved simultaneously (in contrast to a “nested” fashion, which is typically computationally inefficient).
 2.
Note that the vector y _{ i } can contain higher order moments of the outcome variables (e.g. \(y_{i}^{2}, y_{1i}y_{2i}\), etc.).
 3.
Another nice property of these estimators is that the extra variance imparted on the estimates due to the simulation is relatively small—asymptotically it is 1/S. This means, e.g., that if one uses just 10 simulation draws, simulation increases the asymptotic variance of the parameter estimates by just 10%. It is important to note that this property will not hold for the importance sampling procedure suggested here.
 4.
For example suppose u(x _{ i },ε _{ i },θ) = f(x _{ i },θ) + ε _{ i } and that ε _{ i } is multivariate normal. Then the distribution of u _{ i } is also multivariate normal, and computation of p is trivial. Computation of p is also trivial for more flexible distributions, e.g. mixtures of normals.
 5.
Obviously, one cannot simply standard errors for the importance sampler by simply multiplying the normal GMM variance formulas (i.e. ignoring simulation error) by \(1+\frac{1}{S}\). Instead, to adjust the standard errors, one would need to formally estimate the variance in \(\widetilde{Ef} _{i}(\theta )\). This can be done using the variation in \(\widetilde{f} (u_{is})\frac{p(u_{is}\mid x_{i},\theta )}{g(u_{is}\mid x_{i})}\) across simulation draws.
 6.
As with the MSM version, if one uses the same g for all observations, one can use the same simulation draws for all observations. The only caveat is that in the SML case, f depends on y _{ i }. As a result, one would still need to compute S ∗ N different f’s. However, it is often the case that computing f(y _{ i }u _{ is }) for different y _{ i } (holding u _{ is } constant) is relatively easy. In a dynamic discrete choice problem, the solution to the dynamic programming problem only depends on u _{ is }, not the realization of y _{ i }. Thus, computing f(y _{ i }u _{ is }) for different y _{ i } (holding u _{ is } constant) does not require resolving the dynamic programming problem.
 7.
At this “faster than \(\sqrt{N}\)” rate, the asymptotic variance of the estimates is the same as the asymptotic variance of estimates if one could compute the integrals analytically (i.e. without simulating). Hence, standard MLE variance formulas can be used.
 8.
Gourieroux and Monfort (1991) show that if one uses the same simulation draws across observation, one needs the S to increase at a faster rate (faster than N).
 9.
Using the “alternative specific” value function methodology of Rust (1987), this is made considerably easier by the i.i.d. logit assumption on the ν _{ ijt }. It becomes even easier if one assumes consumers believe prices follow an i.i.d. process over time.
 10.
This is just a simple example. One can easily use nonlinear index functions, or more flexible distributions, e.g. mixtures of normals.
 11.
As discussed at the end of Section 5.1, our technique has the benefit that one can run many alternative specifications (more specifically, alternative specifications with different sets of x _{ ij }’s) without having to newly resolve dynamic programming problems.
 12.
Note that this likelihood ignores potential initial conditions problems (i.e. c _{ i0} is assumed constant). Pantano (2008) suggests a clever way to model initial conditions problems while using this importance sampling simulator.
 13.
If one wanted to ensure that costs are positive, one could use an alternative specification such as c(q _{ ij }) = (h _{1}(βx _{ ij } + ε _{ ij }) + h _{2}(αx _{ ij } + η _{ ij })q _{ ij })q _{ ij } where the h functions have only positive (but constant) support, e.g. exponential functions.
 14.
It is a perfect information game in the sense that all firms observe the unobservables of all other firms (as well as the market specific shock).
 15.
This normalization is different than what might typically be used (e.g. that the variance of one of the unobservables equals one) but is an identical model given that own profits depend negatively on other firms’ number of stores. Interestingly, this alternative normalization helps the model satisfy Property (CS). Bajari et al. (2009b) use a similar normalization in their application of the importance sampling simulator. This illustrates that when using the importance sampling simulator, it may be beneficial to carefully consider choice of normalization.
 16.
Note that u satisfies Property (CS). The support of the first two sets of elements is the real line, the support of the last element is the positive real line. One could also easily restrict the reservation values to be positive if one was so inclined.
 17.
p _{ C } is a log normal distribution and p _{ A } and p _{ B } are multivariate normal distributions.
 18.
A similar situation would arise in Example 2 if, e.g., \(\sigma _{\eta }^{2}=0 \), or in Example 3 if the variance of the private values were the same across auctions (i.e. \(\sigma _{\lambda }^{2}=0\)).
 19.
The simulator in this case would be \(\widetilde{L}_{i}=\frac{1}{S}\sum_{s} \widetilde{f}(c_{i}p_{i},u_{is},\theta _{1})\frac{p(u_{is}\mid x_{i},\theta _{2})}{g(u_{is}\mid x_{i})}\) , so changes in θ _{2} are adjusted for with importance sampling weights, changes in θ _{1} adjusted for with changes in \(\widetilde{f}\).
 20.
Obviously, the functional form is chosen to restrict the discount factor between 0 and 1.
 21.
And as noted previously, there may be large amounts of simulation error if the variance approaches 0. In practice, one should be careful to watch for these variances (e.g. σ _{ δ }) approaching zero during estimation. If they do, it may be best to switch to the alternative approach suggested next, i.e. applying the importance sampling approach to only a subset of the parameters.
 22.
This statement ignores two important caveats. First, additional unobserved heterogeneity might create identification problems (we ignore these in this paper by simply assuming identification). Second, increased dimensionality of the unobserved heterogeneity may generate higher levels of simulation error.
 23.
The 1999 working paper version of this work contained a number of more elaborate examples of how importance sampling can be used to smooth even very complicated economic models. For a copy please consult the author.
 24.
Note that Example 1 is not a good example of this smoothing property because the likelihood function there is already smooth due to the analytically integrated logit errors.
 25.
To make full use of past solutions of \(\widetilde{f}\), one could actually use both the old draws (from g _{1}(u _{ i }x _{ i })) and the new draws (from g _{2}(u _{ i }x _{ i })) when the estimation procedure is iterated. In this case, the g for the full set of draws would be a mixture of g _{1} and g _{2}. More generally, at the tth, iteration, one could use draws (and \(\widetilde{f}\) solutions) from all past iterations.
 26.
In our MonteCarlo experiments, we (very) casually investigated this and did find that the iterations always converged to the same parameter vector for a wide range of θ ^{ init }. But this is obviously far from a proof, this is only one example, and MonteCarlo generated data may be better behaved than actual data.
 27.
Of course, one would prefer to evaluate the denominator at θ rather than θ ^{ init }. But doing this would require resolving \(\widetilde{f }\).
 28.
This method is also convenient when a subset of the parameters do not satisfy Property (CS). One simply needs to recompute the f functions for numeric perturbations of the parameters in the subset.
 29.
If the other state variables (e.g. p _{ it },ν _{ it }, and c _{ it − 1}) are continuous, either approach would generally need approximations in these dimensions. Of course, if the v _{ it }’s (or p _{ it }) are i.i.d., they can be removed from the “effective” state space by working with alternative specific value functions (Rust 1987).
 30.
In contrast, it is not clear that the standard KW/R approach breaks the curse of dimensionality in \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\) space. Because \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\) are constant over time, their transition density does not satisfy Rust’s (1997) Assumption (A4). Of course, as proven by Rust, the KW/R approach can break the curse of dimensionality in state variables that evolve stochastically (and smoothly) over time. So, for example, if the KW/R approach is used for the state variable p _{ it }, and if the importance sampling simulator is used for \((\left\{ \alpha _{ij}\right\} _{j=1}^{J},\beta _{i},\gamma _{i})\); then Example 1 has no curse of dimensionality as the number of products J increases (since c _{ it − 1} takes a finite set of values and v _{ ijt }’s are i.i.d. logit errors).
 31.
The choice for the σ parameters was governed by 1) the fact that σ must be ≥ 0, and 2) the discussion above suggesting that in practice, it probably better to choose larger values for the importance sampling densities of variance related parameters.
 32.The way we construct a reasonable g that is the same across observations is to define g(u _{ i }) to be a mixture (with equal probabilities) of the 500 \(p\big(u_{i}\bigx_{i},\theta ^{init}\big)\) distributions (i.e. each of these 500 distributions depends on one of the x _{ i }’s in the sample. More precisely we use the mixture distribution:A simple way to take 15000 draws from this mixture distribution is to simply take 30 draws from \(p(u_{i}x_{i},\theta ^{init})\) for each of the 500 x _{ i }’s observed in the dataset.$$ g(u_{i})=\frac{1}{500}\sum_{i}p\big(u_{i}\bigx_{i},\theta ^{init}\big) $$
 33.
The 3040 includes those function evaluations necessary for derivative calculations. We used derivative based methods for all the optimization. Starting values for optimization with the standard SML routine were the “good” starting values—this number of function evaluations necessary would be slightly higher than 3040 using the “bad” starting values.
Notes
Acknowledgements
Thanks to Pat Bajari, Peter Davis, Gautam Gowrisankaran, Wes Hartmann, Mike Keane, Whitney Newey, Ariel Pakes, Juan Pantano, and particularly to Steve Berry for suggestions at an early stage. I would also like to thank 3 anonymous referees, the editor Peter Rossi, and participants at the 2000 Cowles Conference on Strategy and Decision Making, the MIT Econometrics Lunch, UCLA, and the 2000 SITE Conference on Structural Econometric Methods for helpful discussions. A 1999 version of this paper circulated under the title “Importance Sampling and the Method of Simulated Moments”, though the 2001 NBER working paper version has the current title. All errors are my own.
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
References
 Ackerberg, D. (2001). A new use of importance sampling to reduce computational burden in simulation estimation. NBER working paper #T273Google Scholar
 Ackerberg, D. (2003). Advertising, learning, and consumer choice in experience. Good markets: A structural empirical examination. International Economic Review, 44, 1007–1040.CrossRefGoogle Scholar
 Aguirregabiria, V., & Mira, P. (2002). Swapping the nested fixed point algorithm: A class of estimators for discrete Markov decision models. Econometrica 70, 1519–1543.CrossRefGoogle Scholar
 Aguirregabiria, V., & Mira, P. (2007). Sequential estimation of dynamic discrete games. Econometrica 75, 1–53.CrossRefGoogle Scholar
 Bajari, P. (1998a). Econometrics of the first price auction with asymmetric bidders. Mimeo, Stanford Univ.Google Scholar
 Bajari, P. (1998b). Econometrics of sealed bid auctions. In 1998 Proceedings of the Business and Economic Statistics Section of the American Statistical Association (pp. 41–49).Google Scholar
 Bajari, P., Benkard, C. L., & Levin, J. (2007a). Estimating dynamic models of imperfect competition. Econometrica 75, 1331–1370.CrossRefGoogle Scholar
 Bajari, P., Fox, J., & Ryan, S. (2007b). Linear regression estimation of discrete choice models with nonparametric random coefficient distributions. American Economic Review, 97, 459–463.CrossRefGoogle Scholar
 Bajari, P., Chernozhukov, V., Hong, H., & Nekipelov, D. (2008). Nonparametric and semiparametric analysis of a dynamic game model. Unpublished working paper.Google Scholar
 Bajari, P., Fox, J., Kim, K., & Ryan, S. (2009a). Discrete choice models with a nonparametric distribution of random coefficient. Mimeo, U. of MN.Google Scholar
 Bajari, P., Hong, H., & Ryan, S. (2009b). Identification and estimation of discrete games of complete information. Econometrica, in press.Google Scholar
 Berry, S. T. (1992). Estimation of a model of entry in the airline industry. Econometrica, 60, 889–917.CrossRefGoogle Scholar
 Berry, S., Levinsohn, J., & Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica, 63, 841–890.CrossRefGoogle Scholar
 Crawford, G., & Shum, M. (2005). Uncertainty and learning in pharmaceutical demand. Econometrica, 73, 1135–1174.CrossRefGoogle Scholar
 Davis, P. (2006). Estimation of quantity games in the presence of indivisibilities and heterogeous firms. Journal of Econometrics, 134, 187–214.CrossRefGoogle Scholar
 Erdem, T., & Keane, M. (1996). Decision making under uncertainty: Capturing dynamic brand choice processes in turbulent consumer goods markets. Marketing Science, 15, 1–20.CrossRefGoogle Scholar
 Gasmi, F., Laffont, JJ, & Vuong, Q. (1991). Econometric analysis of collusive behavior in a soft drink industry. Journal of Economics and Management Strategy, 1, 277–311.Google Scholar
 Geweke, J. (1989). Efficient simulation from the multivariate normal distribution subject to linear inequality constraints and the evaluation of constraint probabilities. Econometrica, 57, 1317–1339.CrossRefGoogle Scholar
 Goettler, R., & Clay, K. (2009). Tariff choice with consumer learning and switching costs. Mimeo, Chicago GSB.Google Scholar
 Gourieroux, C., & Monfort, A. (1991). Simulation based inference in models with heterogeneity. Annales d’Economie et de Statistique, 20/21, 69–107.Google Scholar
 Hansen, L. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50, 1029–1064.CrossRefGoogle Scholar
 Hartmann, W. (2006). Intertemporal effects of consumption and their implications for demand elasticity estimates. Quantitative Marketing and Economics, 4, 325–349.CrossRefGoogle Scholar
 Hendel, I., & Nevo, A. (2006). Measuring the implications of sales and consumer inventory behavior. Econometrica, 74, 1137–1173.CrossRefGoogle Scholar
 Hotz, V. J., & Miller, R. A. (1993). Conditional choice probabilities and the estimation of dynamic models. Review of Economic Studies 60, 497–529.CrossRefGoogle Scholar
 Hotz, V. J., Miller, R. A., Sanders, S., & Smith, J. (1994). A simulation estimator for dynamic models of discrete choice. Review of Economic Studies 61, 265–289.CrossRefGoogle Scholar
 Imai, S., Jain, N., & Ching, A. (2009). Bayesian estimation of dynamic discrete choice models. Econometrica, in press.Google Scholar
 JofreBonet, M., & Pesendorfer, M. (2003). Estimation of a dynamic auction game. Econometrica 71, 1443–1489.CrossRefGoogle Scholar
 Judd, K., & Su, C. (2008). Constrained optimization approaches to estimation of structural models. Mimeo, Stanford.Google Scholar
 Keane, M., & Wolpin, K. (1994). The solution and estimation of discrete choice dynamic programming models by simulation and interpolation. Review of Economics and Statistics, 76, 648–72.CrossRefGoogle Scholar
 Keane, M., & Wolpin, K. (2000). Estimating the effect of welfare on the education, employment, fertility and marraige decisions of women. Mimeo, UPenn.Google Scholar
 Keane, M., & Wolpin, K. (2001). The effect of parental transfers and borrowing constraints on educational attainment. International Economic Review, 42, 1051–1103.CrossRefGoogle Scholar
 Lerman, S., & Manski, C. (1981). On the use of simulated frequencies to approximate choice probabilities. In C. Manski & D. McFadden (Eds.) Structural analysis of discrete data with econometric applications. (pp. 305–319). Cambridge: MIT.Google Scholar
 Maskin, E., & Riley, J. (1996). Existence of equilibrium in sealed, high bid auctions. Mimeo.Google Scholar
 McFadden, D. (1989) A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica, 57(5), 995–1026.CrossRefGoogle Scholar
 Kloek, T., & Van Dijk, H. (1978): Bayesian estimation of equation system parameters: An application of integration by MonteCarlo. Econometrica, 46, 1–20.CrossRefGoogle Scholar
 Norets, A. (2009) Inference in dynamic discrete choice models with serially correlated unobserved state variables. Econometrica, in press.Google Scholar
 Pakes, A., Ostrovsky, M., & Berry, S. (2007). Simple estimators for the parameters of discrete dynamic games, with entry/exit examples. RAND Journal of Economics 38, 373–399.CrossRefGoogle Scholar
 Pakes, A., & Pollard, D. (1989). Simulation and the asymptotics of optimization estimators. Econometrica, 57, 1027–1057.CrossRefGoogle Scholar
 Pantano, J. (2008). Labor market stigma in a forward looking model of criminal behavior. Mimeo, Washington U, St. Louis.Google Scholar
 Pesendorfer, M., & SchmidtDengler, P. (2008). Asymptotic least squares estimators For dynamic games. Review of Economic Studies, 75, 901–928.CrossRefGoogle Scholar
 Rust, J. (1987). Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher. Econometrica, 55, 999–1033.CrossRefGoogle Scholar
 Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica, 65, 487–516.CrossRefGoogle Scholar
 Train, K. (2003). Discrete choice methods with simulation. Cambridge: Cambridge University Press.Google Scholar