Skip to main content

Fast estimation of matrix exponential spatial models

Abstract

The matrix exponential spatial specification (MESS) is an alternative to the spatial autoregressive-type (SAR-type) specifications with several attractive properties. The spatial dependence in the MESS-type models is formulated through a matrix exponential term, and the estimation of these models may require the computation of the matrix exponential terms many times in an estimation procedure. In the literature, it is well documented that the computation of the matrix exponential terms can pose challenges in terms of reliability, stability, accuracy, and efficiency. We propose a matrix-vector products approach based on the truncation of Taylor series expansion of the matrix exponential terms for the fast estimation of MESS-type models. We show how to efficiently implement this approach for a first-order MESS model, and provide extensive simulation evidence for its computational advantage over the default method utilized by a popular statistical software.

Introduction

Spatial econometric models account for potential (weak) cross-sectional dependence among observations located on some relevant space. The Cliff-Ord-type spatial models (i.e., the spatial autoregressive (SAR)-type) impose an autoregressive specification for the variable of interest using a weights matrix which quantifies the strength of a measure of nearness between spatial units (Anselin 1988; Cliff and Ord 1969, 1973; Whittle 1954). The spatial autoregressive specification ensures that the spatial dependence among spatial units decays at a geometric rate. The likelihood based estimation of these models requires calculation of matrix determinant terms (the Jacobian terms) in each pass of the numerical optimization scheme, and for large matrices that are not sparse this can create computational challenges. See for example LeSage and Pace (2009, Chapter 4) for a variety of approximation methods suggested in the literature to alleviate this computational problem.

An alternative to SAR-type models is the matrix exponential spatial specification (MESS) suggested by LeSage and Pace (2007). In the MESS-type models, the spatial dependence is formulated with a matrix exponential term of type, \(e^{\alpha A}=\sum _{i=0}^{\infty }(\alpha A)^{i}/i!\), where \(\alpha\) is a scalar spatial parameter, A is the \(n\times n\) spatial weights matrix and n is the number of spatial units. Therefore, the MESS-type models impose an exponential rate of decay for the cross-sectional dependence and have several features that make them more convenient for estimation (Chiu et al. 1996; Leonard and Hsu 1992). The likelihood based estimation is greatly simplified because the likelihood function does not involve any matrix determinant terms. Since matrix exponential terms are always invertible, there is no need to impose restrictions on the parameter space of the spatial parameters, i.e., the MESS-type models always have reduced forms. Moreover, when the model involves heteroskedasticity of an unknown form in the error terms, the maximum likelihood estimator (MLE) remains consistent provided that the weights matrices are commutative (Debarsy et al. 2015).Footnote 1

Despite the aforementioned advantages of the MESS-type models, the likelihood and generalized method of moments (GMM) based estimations of these models require the computation of matrix exponential terms by the optimization solvers in each iteration, which can be computationally costly. The literature has suggested several alternative ways to compute \(e^{\alpha A}\) such as Taylor series approximation, Padé approximation, ordinary differential equation methods, polynomial methods, matrix decomposition methods, splitting methods and Krylov space methods. Moler and Van Loan (1978, 2003) assess the effectiveness of nineteen methods according to the following attributes: (i) generality, (ii) reliability, (iii) stability, (iv) accuracy, (v) efficiency, (vi) storage requirements, (vii) ease of use, and (viii) simplicity. Although Moler and Van Loan (1978, 2003) state that “none (of the methods in their paper) are completely satisfactory”, they claim that a scaling and squaring method with either the rational Padé or Taylor approximants can be the most effective one to compute the matrix exponential terms.Footnote 2 Popular software such as Python, R, MATLAB and Mathematica have modules, packages and functions to compute matrix exponential of a given matrix. For example, MATLAB (function expm), Mathematica (function MatrixExp) and Python (function scipy.linalg.expm) use a scaling and squaring method combined with a Padé approximation for the computation of matrix exponential terms (Al-Mohy and Higham 2010; Higham 2005).

As pointed out by Moler and Van Loan (1978, 2003), all methods suggested in the literature for computing the matrix exponential terms are “dubious” in the sense that a sole method may not be entirely reliable for all applications.Footnote 3 In other words, a method that is effective for a particular application may not be reliable for another application. In the context of MESS-type models, the scaling and squaring method combined with the Padé approximation as implemented in MATLAB through its expm function can be highly costly in terms of computation time. For example, a typical Monte Carlo simulation designed for a MESS model that has spatial dependence in the dependent variable and the disturbance term (for short MESS(1, 1)) with 1000 resampling and a couple of different parameter combinations can take days or even weeks (see Sect. 4 on the details of the simulation setting). The expm function is also costly for estimating empirical applications with large sample sizes. For example, we show that the expm function for estimating the MESS (1, 1) model for an empirical illustration involving 3107 observations takes 1072 s by the the quasi maximum likelihood (QML) estimator, 4805 s by the GMM estimator and 47742 s by the Bayesian estimator (see Sect. 5 for the details).

In this paper, we propose a matrix-vector products method based on the truncation of Taylor series expansion of the matrix exponential terms for the fast estimation of MESS-type models. Our analysis on the estimation of MESS-type models indicates that the estimation requires the computation of a matrix exponential term as a vector, rather than the matrix exponential term in isolation. For example, the estimation of the MESS (1, 1) model requires the computation of terms such as \(e^{\alpha A}e^{\tau B}v\) and \(e^{\tau B}X\), where v is an \(n\times 1\) vector, X is an \(n\times k\) matrix, and \(\alpha\) and \(\tau\) are the spatial parameters. The matrix-vector products method provides approximations to \(e^{\alpha A}e^{\tau B}v\) and \(e^{\tau B}X\) in terms of matrix-vector products rather than providing approximations for \(e^{\alpha A}\) and \(e^{\tau B}\). In this paper, we show how this approach can be implemented for the QML, GMM and Bayesian estimation of the MESS(1, 1) model. Using our suggested approach in the context of the MESS (1, 1) model, we provide extensive simulation evidence on the computational time gains for three estimation methods (the QML, GMM and Bayesian methods).

In the literature, it is well known that the matrix-vector products approach can reduce the computational burden substantially. For example, Moler and Van Loan (2003) write “One of the most significant changes in numerical linear algebra in the past 25 years is the rise of iterative methods for sparse matrix problems, in which only matrix vector products are needed.” In particular, LeSage and Pace (2009) extensively use sparse matrix-vector operations for the likelihood and Bayesian estimation of spatial models to reduce the computational burden. The Krylov space methods [the twentieth method in Moler and Van Loan (2003)] suggested for the computation of matrix exponential terms also depends on the matrix-vector products approach. See, for example, Saad (1992), Gallopoulos and Saad (1992), Hochbruck and Lubich (1997), Sidje (1998), and related references. When A is sparse, in the first step of this approach, \(e^{\alpha A}v\) is approximated by an element of Krylov subspace \(K_m=\text {span}\{v, (\alpha A)v,\ldots ,(\alpha A)^mv\}\), where m, the dimension of the Krylov subspace, is small compare to n. The operations in the first step of this method involve only matrix-vector products [see Sidje (1998) for the implementation of this method].

The remainder of this paper is organized as follows. Section 2 presents the model under consideration and lays out briefly the details on the QML, GMM, and Bayesian MCMC estimation. This section also shows how the impact measures and their dispersion measures can be estimated. Section 3 provides the details on the computation of the matrix exponential terms using the matrix-vector products approach. We then show how the matrix-vector products approach can be applied to the QML, GMM and Bayesian estimation methods. Section 4 presents the setting for our Monte Carlo study and the simulation results. Section 5 illustrates the computational time advantage of the matrix-vector products method using a large dataset from the spatial econometric literature. We conclude in Sect. 6. Some simulation results are relegated to an appendix.

Model and estimation approaches

Model specification

We consider the following first order matrix exponential spatial model (for short MESS(1, 1))

$$\begin{aligned} e^{\alpha _0 W}y=X\beta _0+u, \quad e^{\tau _0 M}u=\epsilon , \end{aligned}$$
(2.1)

where \(Y=\left( y_{1},\ldots ,y_{n}\right) ^{'}\) is the \(n\times 1\) vector of observations on the dependent variable, X is the \(n\times k\) matrix of non-stochastic exogenous variables with the associated parameter vector \(\beta _0\), W and M are the \(n\times n\) spatial weights matrices of known constants with zero diagonal elements. The scalar parameters \(\alpha _0\) and \(\tau _0\) are called the spatial parameters. We call \(u=\left( u_{1},\ldots ,u_{n}\right) ^{'}\) as the \(n\times 1\) vector of regression disturbance terms and \(\epsilon =\left( \epsilon _{1},\ldots ,\epsilon _{n}\right) ^{'}\) as the \(n\times 1\) vector of disturbances (or innovations).

The matrix exponential terms \(e^{\alpha W}\) and \(e^{\tau M}\) in (2.1) are defined as \(e^{\alpha W}=\sum _{i=0}^{\infty }(\alpha W)^{i}/i!\) and \(e^{\tau M}=\sum _{i=0}^{\infty }(\tau M)^{i}/i!\), and are always invertable with the inverses \(e^{-\alpha W}\) and \(e^{-\tau M}\) (Chiu et al. 1996). Thus, the reduced form of the model always exists and is given by \(y=e^{-\alpha W}X\beta _0+e^{-\alpha W}e^{-\tau M}\epsilon\). On the other hand, in a SAR-type model, we need to restrict the parameter space of spatial parameters so that it has a reduced form. On the parameter space of spatial parameters in the SAR-type models, among others, see Elhorst (2014), Kelejian and Prucha (2010), Lee (2004), LeSage and Pace (2009). In the SAR-type counterpart of the MESS(1, 1) model, the matrix exponential terms \(e^{\alpha _0 W}y\) and \(e^{\tau _0 M}u\) in (2.1) are respectively replaced by \((I_n-\lambda _0 W)y\) and \((I_n-\rho _0 M)u\), where \(I_n\) is the \(n\times n\) identity matrix, and \(\lambda _0\) and \(\rho _0\) are scalar spatial parameters. Under the assumption that \(\Vert \lambda _0W\Vert <1\) for some matrix norm \(\Vert \cdot \Vert\), we have \((I_n-\lambda _0 W)^{-1}=\sum _{i=0}^{\infty }(\lambda _0W)^{i}\) (Horn and Johnson 2012). Thus, the SAR model imposes a geometric decay pattern of spatial dependence among spatial units, while the MESS(1, 1) model exhibits an exponential decay.Footnote 4

Maximum likelihood approach

Under the assumption that \(\epsilon _i\)’s are i.i.d normal with mean zero and variance \(\sigma ^2_0\), the log-likelihood function of the model can be expressed as

$$\begin{aligned} \ln L(\theta )=-\frac{n}{2}\ln (2\pi \sigma ^2)-\frac{1}{2\sigma ^2}(e^{\alpha W}y-X\beta )^{'}e^{\tau M^{'}}e^{\tau M}(e^{\alpha W}y-X\beta ), \end{aligned}$$
(2.2)

where \(\theta =(\alpha , \tau , \beta ^{'}, \sigma ^2)^{'}\). Note that (2.2) does not include the Jacobian terms, because \(\ln |e^{\alpha W}|=\ln \left( e^{\alpha \mathrm {tr}(W)}\right) =0\) and \(\ln |e^{\tau M}|=\ln \left( e^{\tau \mathrm {tr}(M)}\right) =0\), where \(|\cdot |\) denotes the determinant operator, and \(\mathrm {tr}(\cdot )\) is the trace operator. Let \(\psi =(\alpha ,\tau )^{'}\). Then, for a given value of \(\psi\), the first-order conditions of (2.2) with respect to \(\beta\) and \(\sigma ^2\) yield

$$\begin{aligned}&{\hat{\beta }}(\psi )=(X^{'}e^{\tau M^{'}}e^{\tau M}X)^{-1}X^{'}e^{\tau M^{'}}e^{\tau M}e^{\alpha W}y, \end{aligned}$$
(2.3)
$$\begin{aligned}&{\hat{\sigma }}^2(\psi )=\frac{1}{n}(e^{\alpha W}y-X{\hat{\beta }})^{'}e^{\tau M^{'}}e^{\tau M}(e^{\alpha W}y-X{\hat{\beta }})=\frac{1}{n}y^{'}e^{\alpha W^{'}}e^{\tau M^{'}}H(\tau )e^{\tau M}e^{\alpha W}y, \end{aligned}$$
(2.4)

where \(H(\tau )=I_n-e^{\tau M}X(X^{'}e^{\tau M^{'}}e^{\tau M}X)^{-1}X^{'}e^{\tau M^{'}}\). Then, ignoring constant terms, the concentrated likelihood function can be written as

$$\begin{aligned} \ln L^c(\theta )=-\frac{n}{2}\ln ({\hat{\sigma }}^2(\psi ))=-\frac{n}{2}\ln \left( \frac{1}{n}y^{'}e^{\alpha W^{'}}e^{\tau M^{'}}H(\tau )e^{\tau M}e^{\alpha W}y\right) . \end{aligned}$$
(2.5)

Thus, we can define the QMLE of \(\psi _0\) as

$$\begin{aligned} {\hat{\psi }}={{\,\mathrm{argmin}\,}}_{\psi }\left( y^{'}e^{\alpha W^{'}}e^{\tau M^{'}}H(\tau )e^{\tau M}e^{\alpha W}y\right) . \end{aligned}$$
(2.6)

Let \(\gamma =(\alpha , \tau , \beta ^{'})^{'}\) and \(\gamma _0=(\alpha _0, \tau _0, \beta ^{'}_0)^{'}\) be the true parameter vector. Then, QMLE \({\hat{\gamma }}\) has the following asymptotic normal distribution (Debarsy et al. 2015),Footnote 5

$$\begin{aligned} \sqrt{n}\left( {\hat{\gamma }}-\gamma _{0}\right) {\mathop {\rightarrow }\limits ^{d}} N\left( 0, \left. \lim _{n \rightarrow \infty } C^{-1} \Omega C^{-1}\right. \right) . \end{aligned}$$
(2.7)

Here, \(\Omega =2 \sigma _{0}^{2} C+\Omega _{1}\) with

$$\begin{aligned} C&=\frac{1}{n} \begin{pmatrix} \sigma _{0}^{2}{\text {tr}}\left( {\mathbb {W}}^{s} {\mathbb {W}}^{s}\right) +2\left( {\mathbb {W}} e^{\tau _{0} M} X \beta _{0}\right) ^{\prime }\left( {\mathbb {W}} e^{\tau _{0} M} X \beta _{0}\right) & * & * \\ \sigma _{0}^{2} {\text {tr}}\left( {\mathbb {W}}^{s} M^{s}\right) &\sigma _{0}^{2} {\text {tr}}\left( M^{s} M^{s}\right) & * \\ -2\left( e^{\tau _{0} M} X\right) ^{\prime } {\mathbb {W}} e^{\tau _{0} M} X \beta _{0} & 0 &2\left( e^{\tau _{0} M} X\right) ^{\prime }\left( e^{\tau _{0} M} X\right) \\ \end{pmatrix}, \end{aligned}$$
(2.8)
$$\begin{aligned} \Omega _{1}&=\frac{1}{n} \begin{pmatrix} \left( \mu _{4}-3 \sigma _{0}^{4}\right) {\text {vec}}_{\mathrm {D}}^{\prime }\left( {\mathbb {W}}^{s}\right) {\text {vec}}_{\mathrm {D}}\left( {\mathbb {W}}^{s}\right) +4 \mu _{3}\left( {\mathbb {W}} e^{\tau _{0} M} X \beta _{0}\right) ^{\prime } {\text {vec}}_{D}\left( {\mathbb {W}}^{s}\right) & * & * \\ 0 & 0 & * \\ -2 \mu _{3}\left( e^{\tau _{0} M} X\right) ^{\prime } {\text {vec}}_{\mathrm {D}}\left( {\mathbb {W}}^{s}\right) & 0 & 0\end{pmatrix}, \end{aligned}$$
(2.9)

where \({\mathbb {W}}=e^{\tau _0M}We^{-\tau _0M}\), \(\mu _{3}=\mathrm {E} \epsilon _{i}^{3}\), \(\mu _{4}=\mathrm {E} \epsilon _{i}^{4}\), \({\text {vec}}_{\mathrm {D}}(A)\) denotes a vector containing the diagonal elements of any square matrix A, and \(B^s=B+B^{'}\) for any \(n\times n\) matrix B.

GMM approach

When the error terms are homoskedastic, a set of moment functions that consists of linear and quadratic moment functions can be arranged such that the resulting GMME is as efficient as the QMLE under the normal case, and asymptotically more efficient than the QMLE under the non-normal case. In the case of heteroskedasticity of an unknown form, an optimal GMME (OGMME) can be defined such that it is also more efficient than the QMLE (Debarsy et al. 2015).

To define the best set of moment functions when the error terms are simply i.i.d, we introduce the following notations. Let Diag(a) be the \(n\times n\) matrix whose diagonal entries are the elements of the \(n\times 1\) vector a, and Diag(A) be the \(n\times n\) diagonal matrix whose diagonal entries are those of the \(n\times n\) matrix A. Let \(B^{(t)}=B-I_n\mathrm {tr}(B)/n\) for any \(n\times n\) matrix B, and \(X_n^*\) be the submatrix of X with the intercept term removed. Define \(P_{1}^*={\mathbb {W}}, P_{2}^*={\text {Diag}}({\mathbb {W}})\), \(P_{3}^{*}={\text {Diag}}\left( e^{\tau _{0} M} W X \beta _{0}\right) ^{(t)}\), \(P_{4}^{*}=M\), \(P_{l+4}^*={\text {Diag}}\left( e^{\tau _{0} M} X_{l}^{*}\right) ^{(t)}\) for \(l=1, \ldots , k^{*}\) and \(F^{*}=\left( F_1^{*}, F_{2}^{*}, F_{3}^{*}, F_{4}^{*}\right)\) with \(F_{1}^{*}=e^{\tau _{0} M} X^{*}, F_{2}^{*}=e^{\tau _{0} M} W X \beta _{0}, F_{3}^{*}=l_{n},\) and \(F_{4}^{*}={\text {vec}}_{D}\left( {\mathbb {W}}\right)\). Then, under the assumption that the error terms are i.i.d, the best set of moment functions suggested in Debarsy et al. (2015) takes the following form:

$$\begin{aligned} g^{*}(\gamma )=\frac{1}{n}\left( \epsilon ^{'}(\gamma )P_{1}^{*} \epsilon (\gamma ), \ldots , \epsilon ^{'}(\gamma )P_{k^{*}+4}^{*} \epsilon (\gamma ), \epsilon ^{'}(\gamma )F^{*}\right) ^{\prime }, \end{aligned}$$
(2.10)

where \(\epsilon (\gamma )=e^{\tau M}\left( e^{\alpha W}y-X\beta \right)\). Then, the best GMME (BGMME) is defined as

$$\begin{aligned} {\hat{\gamma }}_B={{\,\mathrm{argmin}\,}}_{\gamma }g^{*'}(\gamma )V^{*-1}g^{*}(\gamma ), \end{aligned}$$
(2.11)

where \(V^{*}=n \mathrm {E}\left( g^{*}\left( \gamma _{0}\right) g^{*^{\prime }}\left( \gamma _{0}\right) \right)\). It can be shown that

$$\begin{aligned} V^*=\frac{1}{n}\left( \begin{array}{cc}\frac{\sigma _{0}^{4}}{2} \omega ^{\prime } \omega +\frac{1}{4}\left( \mu _{4}-3 \sigma _{0}^{4}\right) \omega _{d}^{\prime } \omega _{d} & \frac{1}{2} \mu _{3} \omega _{d}^{\prime } F^{*} \\ \frac{1}{2} \mu _{3} F^{*\prime } \omega _{d} & \sigma _{0}^{2} F^{*\prime } F^{*}\end{array}\right) , \end{aligned}$$
(2.12)

where \(\omega _{d}=\left( {\text {vec}}_{D}\left( P_{1}^{*s}\right) , \ldots {\text {vec}}_{\mathrm {D}}\left( P^{*s}_{k^{*}+4}\right) \right)\), \(\omega =\left( {\text {vec}}\left( P^{*s}_{1}\right) , \ldots , {\text {vec}}(P^{*s}_{k^{*}+4})\right)\) and \(\text {vec}(A)\) denotes the vectorization of matrix A. Under some regularity conditions, it follows that (Debarsy et al. 2015)

$$\begin{aligned} \sqrt{n}\left( {\hat{\gamma }}_{B}^{}-\gamma _{0}\right) {\mathop {\rightarrow }\limits ^{d}} N\left( 0, \lim _{n \rightarrow \infty } (G^{*'} V^{*-1} G^{*})^{-1} \right) , \end{aligned}$$
(2.13)

where

$$\begin{aligned} G^*={\mathbb {E}}\left( \frac{\partial g^*(\gamma _0)}{\partial \gamma ^{'}}\right) =\frac{1}{n}\left( \begin{array}{ccc}\frac{\sigma _{0}^{2}}{2} \omega ^{\prime } {\text {vec}}\left( {\mathbb {W}}^{s}\right) & \frac{\sigma _{0}^{2}}{2} \omega ^{\prime } {\text {vec}}\left( M^{s}\right) & 0 \\ F^{*\prime } {\mathbb {W}} e^{\tau _{0} M} X \beta _{0} & 0 & -F^{*\prime } e^{\tau _{0} M} X\end{array}\right) . \end{aligned}$$
(2.14)

Note that the BGMME defined (2.11) is not feasible, since \(V^*\), \(P_1^*, \ldots , P_{k^*+4}^*\) and \(F^*\) are functions of the unknown parameters. In practice, an initial consistent estimator of \(\gamma _0\) can be used to replace the unknown parameters in these terms.Footnote 6 A feasible estimator formulated in this way can be shown to have the same asymptotic distribution as that of \({\hat{\gamma }}_B\) (Debarsy et al. 2015).

When the disturbance terms have heteroskedasticity of an uknown form, the matrices \(P_i\)’s used in the quadratic moment functions need to have zero diagonal elements. In the heteroskedastic case, an OGMME can be defined by using the following vector of moment functions,

$$\begin{aligned} {\hat{g}}_{d}(\gamma )= \frac{1}{n}\left( \epsilon ^{\prime }(\gamma )\left( \hat{{\mathbb {W}}}-{\text {Diag}}(\hat{{\mathbb {W}}})\right) \epsilon (\gamma ), \epsilon ^{\prime }(\gamma )M \epsilon (\gamma ), \epsilon ^{\prime }(\gamma )\left( \hat{{\mathbb {W}}} e^{{\hat{\tau }} M} X{\hat{\beta }}, e^{{\hat{\tau }}M} X\right) \right) ^{\prime }, \end{aligned}$$
(2.15)

where \(\hat{{\mathbb {W}}}=e^{{\hat{\tau }}M}We^{-{\hat{\tau }}M}\).

Bayesian approach

To complete the model specification, we need to specify the prior distributions for \(\alpha , \tau , \beta\) and \(\sigma ^2\). We assume the following prior distributions: \(\alpha \sim N(\mu _\alpha , V_\alpha )\), \(\tau \sim N(\mu _\tau , V_\tau )\), \(\beta \sim N(\mu _\beta , V_\beta )\), and \(\sigma ^2 \sim IG(a_0, b_0)\), where IG denotes the inverse-gamma distribution. The likelihood is given by

$$\begin{aligned} y|\alpha , \tau , \beta , \sigma ^2 \sim N(e^{-\alpha W}X\beta , \sigma ^2e^{-\alpha W}e^{-\tau M}e^{-\tau M^{'}}e^{-\alpha W^{'}}). \end{aligned}$$
(2.16)

The standard Bayesian analysis for a linear regression model can be used to obtain the conditional posterior distributions of \(\beta\) and \(\sigma ^2\). On the other hand, combining the likelihood function with the prior densities of spatial parameters indicates that the conditional posterior distributions of \(\alpha\) and \(\tau\) are non-standard. In the following algorithm, we suggest a Gibbs sampler that shows how to generate random draws from the joint posterior distribution \(p(\beta , \sigma ^2,\alpha , \tau |y)\).Footnote 7

Algorithm 1

  1. 1.

    Sampling step for \(\beta\):

    $$\begin{aligned} \beta |y, \alpha , \tau , \sigma ^2 \sim N({\hat{\beta }}, K_\beta ), \end{aligned}$$
    (2.17)

    where \(K_\beta =(V_\beta ^{-1}+X'e^{\tau M'}e^{\tau M}X/\sigma ^2)^{-1}\) and \({\hat{\beta }}=K_\beta (X'e^{\tau M'}e^{\tau M}e^{\alpha W}y/\sigma ^2+V_{\beta }^{-1}\mu _{\beta })\).

  2. 2.

    Sampling step for \(\sigma ^2\):

    $$\begin{aligned} \sigma ^2 | y, \alpha , \tau , \beta \sim IG({\hat{\sigma }}^2, K_{\sigma ^2}), \end{aligned}$$
    (2.18)

    where \({\hat{\sigma }}^2=a_0+\frac{n}{2}\) and \(K_{\sigma ^2}=b_0+\frac{1}{2}(e^{\alpha W}y-X\beta )'e^{\tau M'}e^{\tau M}(e^{\alpha W}y-X\beta )\).

  3. 3.

    Sampling step for \(\alpha\):

    $$\begin{aligned}&p(\alpha |y, \beta , \tau , \sigma ^2)\propto \exp \left( -\frac{1}{2}\left( \sigma ^{-2}(e^{\alpha W}y-X\beta )^{'}e^{\tau M^{'}}e^{\tau M}(e^{\alpha W}y-X\beta )\nonumber \right. \right. \nonumber \\&\left. \left. +V_{\alpha }^{-1}(\alpha ^2-2\mu _\alpha \alpha )\right) \right) , \end{aligned}$$
    (2.19)

    which is a non-standard distribution. We can use a random-walk Metropolis-Hastings algorithm to sample from this distribution (LeSage and Pace 2009). A candidate value \(\alpha ^{new}\) is generated according to

    $$\begin{aligned} \alpha ^{new}=\alpha ^{old}+c_{\alpha }\times N(0,1), \end{aligned}$$
    (2.20)

    where \(c_{\alpha }\) is the tuning parameter.Footnote 8 The candidate value \(\alpha ^{new}\) is accepted with probability

    $$\begin{aligned} {\mathbb {P}}(\alpha ^{new}, \alpha ^{old})=\text {min}\left( 1, \frac{p(\alpha ^{new}|y,\beta , \sigma ^2, \tau )}{p(\alpha ^{old}|y,\beta , \sigma ^2, \tau )}\right) . \end{aligned}$$
    (2.21)
  4. 4.

    Sampling step for \(\tau\):

    $$\begin{aligned}&p(\tau |y, \beta , \alpha , \sigma ^2)\propto \exp \left( -\frac{1}{2}\left( \sigma ^{-2}(e^{\alpha W}y-X\beta )^{'}e^{\tau M^{'}}e^{\tau M}(e^{\alpha W}y-X\beta )\nonumber \right. \right. \\&\quad \left. \left. +V_{\tau }^{-1}(\tau ^2-2\mu _\tau \tau )\right) \right) . \end{aligned}$$
    (2.22)

    We use the random-walk Metropolis-Hastings algorithm described in Step 3 to generate random draws from \(p(\tau |y, \beta , \alpha , \sigma ^2)\).

Remark 1

LeSage and Pace (2007) develop an efficient Bayesian estimation method for the MESS(1,0) model by assuming a normal-gamma prior for \(\beta\) and \(\sigma ^2\), and a normal distribution prior for \(\alpha\). By using some properties of the multivariate normal distribution and the inverse gamma distribution, LeSage and Pace (2007) analytically derive the marginal posterior distribution of \(\alpha\). Since the marginal distribution of \(\alpha\) is not in a known form, they suggest to use a univariate numerical integration method for the computation of posterior moments. Also, since the marginal posterior distributions of \(\beta\) and \(\sigma ^2\) depend on \(\alpha\), they suggest to compute the posterior moments of these parameters by numerical integration over \(\alpha\). Our approach presented in Algorithm 1 differs in two important ways. First, we use the random-walk Metropolis-Hastings algorithm suggested by LeSage and Pace (2009) to generate posterior draws for \(\alpha\). Second, we suggest independent prior distributions for \(\beta\) and \(\sigma ^2\), and use the Gibbs sampler in Algorithm 1 to generate posterior draws for \(\beta\) and \(\sigma ^2\).

Impact measures

The dispersion of parameter estimators can be estimated in different ways (Arbia 2020; Debarsy et al. 2015; Elhorst 2014; LeSage and Pace 2009; Taşpınar et al. 2018). In the case of the QMLE and GMME defined in Sects. 2.2 and 2.3, the closed-forms of variance-covariance matrices are available. Thus, we can use the plug-in method for these estimators. That is, the unknown parameters in these variance-covariance matrices can be replaced by the corresponding estimates obtained from consistent estimators. In the case of Bayesian approach, we can use the empirical standard deviations of the random draws generated through Algorithm 1 as the estimate for the standard errors of parameters.

According to the model in (2.1), the derivative of y with respect to the kth explanatory variable \(x_k\) gives the marginal effect \(e^{-\alpha _0 W}\beta _{0k}\), where \(\beta _{0k}\) is the kth element of the true coefficient vector \(\beta _0\). To ease the interpretation and presentation of this marginal effect, LeSage and Pace (2009) define three scalar measures for the marginal effect: the average direct impact, the average indirect impact, and the average total impact. The average direct impact is the average of the main diagonal elements of \(e^{-\alpha _0 W}\beta _{0k}\), the average indirect impact is the average of the off-diagonal elements of \(e^{-\alpha _0 W}\beta _{0k}\), and the total impact is the average of all elements of \(e^{-\alpha _0 W}\beta _{0k}\). For statistical inference, one needs to determine the dispersions of these scalar impact measures. In the Bayesian approach, a sequence of random draws obtained through Algorithm 1 can be used to generate a sequence of random draws for each impact measure. Then, the mean and the standard deviation calculated from each sequence of impact measures can be used for inference.

In the QML and GMM cases, the classical delta method can be used to determine the dispersions of the impact measures. The estimator of the average direct effect is given by \(\frac{1}{n}\mathrm {tr}(e^{-{\hat{\alpha }}W_n}{\hat{\beta }}_{k})\). Then, by the mean value theorem, we obtain

$$\begin{aligned}&\frac{1}{\sqrt{n}}\left( \mathrm {tr}(e^{-{\hat{\alpha }}W}{\hat{\beta }}_{k})-\mathrm {tr}(e^{-\alpha _0W}\beta _{0k})\right) \nonumber \\&\quad =A_{1}\times \sqrt{n}({\hat{\alpha }}-\alpha _0,{\hat{\beta }}_{k}-\beta _{0k})^{'}+o_p(1)\xrightarrow {d} N(0,\lim _{n\rightarrow \infty }A_{1}BA^{'}_{1}), \end{aligned}$$
(2.23)

where \(A_{1}=\left( -\frac{1}{n}\mathrm {tr}(e^{-\alpha _0 W}W\beta _{0k}),\frac{1}{n}\mathrm {tr}(e^{-\alpha _0 W})\right)\), B is the asymptotic covariance of \(\sqrt{n}({\hat{\alpha }}-\alpha _0,{\hat{\beta }}_{k}-\beta _{0k})\). So the asymptotic variance of direct effects can be estimated by \(\frac{1}{n}{\hat{A}}_{1}{\hat{B}}{\hat{A}}^{'}_{1}\), where \({\hat{A}}_{1}=\left( -\frac{1}{n}\mathrm {tr}(e^{-{\hat{\alpha }}W}W{\hat{\beta }}_{k}), \frac{1}{n}\mathrm {tr}(e^{-{\hat{\alpha }}W})\right)\), and \({\hat{B}}\) is the estimated asymptotic covariance of \(\sqrt{n}({\hat{\alpha }}-\alpha _0,{\hat{\beta }}_{k}-\beta _{0k})\). Applying the mean value theorem to the estimator of total effect \(\frac{1}{n}{\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}l_n\), where \(l_n\) is the \(n\times 1\) vector of ones, we obtain

$$\begin{aligned}&\frac{1}{\sqrt{n}}\left( {\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}l_n-\beta _{0k}l^{'}_ne^{-\alpha _0W}l_n\right) \nonumber \\&\quad =A_{2}\times \sqrt{n}({\hat{\alpha }}-\alpha _0,{\hat{\beta }}_{k}-\beta _{0k})^{'}+o_p(1)\xrightarrow {d}N(0,\lim _{n\rightarrow \infty }A_{2}BA^{'}_{2}), \end{aligned}$$
(2.24)

where \(A_{2}=\left( -\frac{1}{n}\beta _{k}l^{'}_ne^{-\alpha _0W}Wl_n,\frac{1}{n}l^{'}_ne^{-\alpha _0W}l_n\right)\). Thus, Var\((\frac{1}{n}{\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}l_n)\) can be estimated by \(\frac{1}{n}{\hat{A}}_{2}{\hat{B}}{\hat{A}}^{'}_{2}\), where \({\hat{A}}_{2}=\left( -\frac{1}{n}{\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}Wl_n,\frac{1}{n}l^{'}_ne^{-{\hat{\alpha }}W}l_n\right)\). Finally, applying the mean value theorem to the estimator of average indirect effects \(\frac{1}{n}\left( {\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}l_n-\mathrm {tr}(e^{-{\hat{\alpha }}W}{\hat{\beta }}_{k})\right)\), we obtain

$$\begin{aligned}&\frac{1}{\sqrt{n}}\left( \left( {\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}l_n-\mathrm {tr}(e^{-{\hat{\alpha }}W}{\hat{\beta }}_{k})\right) -\left( \beta _{0k}l^{'}_ne^{-{\hat{\alpha }}_0W}l_n-\mathrm {tr}(e^{-{\hat{\alpha }}_0W}\beta _{0k})\right) \right) \nonumber \\&\quad =(A_{2}-A_{1})\times \sqrt{n}({\hat{\alpha }}-\alpha _0,{\hat{\beta }}_{k}-\beta _{0k})^{'}+o_p(1)\nonumber \\&\qquad \xrightarrow {d}N(0,\lim _{n\rightarrow \infty }(A_{2}-A_{1})B(A_{2}-A_{1})^{'}). \end{aligned}$$
(2.25)

Then, an estimate of Var\(\left( \frac{1}{n}\left( {\hat{\beta }}_{k}l^{'}_ne^{-{\hat{\alpha }}W}l_n-\mathrm {tr}(e^{-{\hat{\alpha }}W}{\hat{\beta }}_{k})\right) \right)\) is given by \(\frac{1}{n}({\hat{A}}_{2}-{\hat{A}}_{1}){\hat{B}}({\hat{A}}_{2}-{\hat{A}}_{1})^{'}\).

The matrix-vector products method

Our analysis in Sect. 2 indicates that the estimation of MESS(1, 1) specifically requires the evaluation of \(e^{\tau M} e^{\alpha W}y\) and \(e^{\tau M}X\). In the case of the QMLE, the objective function in (2.6) is comprised of \(e^{\tau M} e^{\alpha W}y\) and the term \(H(\tau )\), which is a function of \(e^{\tau M}X\). For the BGMME in (2.11), the vector of best moment functions \(g^*(\gamma )\) contains the disturbances \(\epsilon (\gamma )\), which is a function of \(e^{\tau M} e^{\alpha W}y\) and \(e^{\tau M}X\). Our Algorithm 1 indicates that the Bayesian estimator also requires the evaluation of \(e^{\tau M} e^{\alpha W}y\) and \(e^{\tau M}X\) in each pass through the Gibbs sampler. In this section, we show how the matrix-vector products approach can be used to compute \(e^{\tau M} e^{\alpha W}y\) and \(e^{\tau M}X\) based on the truncation of Taylor series expansion of matrix exponential terms.

We start with \(e^{\tau M} e^{\alpha W}y\). By definition, we have \(e^{\tau M}e^{\alpha W}y=\sum _{i=0}^{\infty } \frac{\tau ^i M^i}{i!}\sum _{j=0}^{\infty } \frac{\alpha ^j W^j}{j!}y\). Truncating the Taylor series at the \((q+1)\)th order yields \(e^{\tau M}e^{\alpha W}y\approx \sum _{i=0}^{q} \frac{\tau ^i M^i}{i!}\sum _{j=0}^{q} \frac{\alpha ^j W^j}{j!}y\). Note that we can express \(\sum _{i=0}^q \frac{\tau ^i M^i}{i!}\sum _{j=0}^q \frac{\alpha ^j W^j}{j!}\) as

$$\begin{aligned} \sum _{i=0}^q \frac{\tau ^i M^i}{i!}\sum _{j=0}^q \frac{\alpha ^j W^j}{j!}=\sum _{i=1}^q \sum _{j=0}^{i-1}\left( \frac{\tau ^i \alpha ^j M^i W^j}{i!j!}+\frac{\tau ^j \alpha ^i M^j W^i}{i!j!}\right) +\sum _{i=0}^{q}\frac{\tau ^i \alpha ^i M^i W^i}{(i!)^2}. \end{aligned}$$
(3.1)

Let \(\text {Diag}(a_1,a_2,\ldots ,a_n)\) be the \(n\times n\) diagonal matrix with diagonal elements \(\{a_1,a_2,\ldots ,a_n\}\). Then, using (3.1), we can express \(e^{\tau M} e^{\alpha W}y\) in the following way,

$$\begin{aligned} e^{\tau M}e^{\alpha W}y&\approx \sum _{i=1}^q \sum _{j=0}^{i-1}\frac{\tau ^i \alpha ^j M^i W^j}{i!j!}y+\sum _{i=1}^q \sum _{j=0}^{i-1}\frac{\tau ^j \alpha ^i M^j W^i}{i!j!}y+\sum _{i=0}^{q}\frac{\tau ^i \alpha ^i M^i W^i}{(i!)^2}y \nonumber \\&=Y_1D_1\nu _1(\alpha , \tau )+Y_2D_2\nu _2(\alpha , \tau )+Y_3D_3\nu _3(\alpha , \tau ), \end{aligned}$$
(3.2)

where \(Y_1\) and \(Y_2\) are \(n\times \frac{q(q+1)}{2}\) matrices, \(Y_3\) is an \(n \times (q+1)\) matrix, \(D_1\) and \(D_2\) are \(\frac{q(q+1)}{2} \times \frac{q(q+1)}{2}\) matrices, \(D_3\) is an \((q+1) \times (q+1)\) matrix, \(\nu _1(\alpha , \tau )\) and \(\nu _2(\alpha , \tau )\) are \(\frac{q(q+1)}{2} \times 1\) vectors, and \(\nu _3(\alpha , \tau )\) is an \((q+1) \times 1\) vector. The terms in (3.2) are

$$\begin{aligned}&Y_1=\left[ My, M^2y, M^2Wy, M^3y, M^3Wy, M^3W^2y, \ldots , M^qy, M^qWy, \ldots , M^qW^{q-1}y\right] , \end{aligned}$$
(3.3)
$$\begin{aligned}&Y_2=\left[ Wy, W^2y, MW^2y, W^3y, MW^3y, M^2W^3y, \ldots , W^qy, MW^qy, \ldots , M^{q-1}W^qy\right] , \end{aligned}$$
(3.4)
$$\begin{aligned}&Y_3=\left[ y, MWy, M^2W^2y, \ldots , M^qW^qy\right] , \end{aligned}$$
(3.5)
$$\begin{aligned}&D_1=D_2=\text {Diag}\left( \frac{1}{0!1!}, \frac{1}{0!2!},\frac{1}{1!2!},\ldots ,\frac{1}{0!q!},\ldots ,\frac{1}{(q-1)!q!}\right) , \end{aligned}$$
(3.6)
$$\begin{aligned}&D_3=\text {Diag}\left( \frac{1}{(0!)^2},\frac{1}{(1!)^2},\frac{1}{(2!)^2},\ldots ,\frac{1}{(q!)^2}\right) , \end{aligned}$$
(3.7)
$$\begin{aligned}&\nu _1(\alpha , \tau )=\left[ \tau , \tau ^2, \tau ^2\alpha , \tau ^3, \tau ^3\alpha , \tau ^3\alpha ^2, \ldots , \tau ^q, \tau ^q\alpha , \ldots , \tau ^q\alpha ^{q-1}\right] ^{'}, \end{aligned}$$
(3.8)
$$\begin{aligned}&\nu _2(\alpha , \tau )=\left[ \alpha , \alpha ^2, \alpha ^2\tau ,\alpha ^3, \alpha ^3\tau , \alpha ^3\tau ^2,\ldots , \alpha ^q, \alpha ^q\tau ,\ldots , \alpha ^q\tau ^{q-1}\right] ^{'}, \end{aligned}$$
(3.9)
$$\begin{aligned}&\nu _3(\alpha , \tau )=\left[ 1, \tau \alpha , \tau ^2 \alpha ^2, \tau ^3 \alpha ^3, \ldots , \tau ^q \alpha ^q\right] ^{'}. \end{aligned}$$
(3.10)

Next, we show how the matrix-vector products approach can be used to get an approximation of \(e^{\tau M}X\). Let \(X=[X_1,X_2,\ldots ,X_k]\), where \(X_i\) is the ith column. Then, we can write \(e^{\tau M}X\) as

$$\begin{aligned} e^{\tau M}X&=\left[ e^{\tau M}X_1, e^{\tau M}X_2,\ldots , e^{\tau M}X_k\right] \approx {\mathbb {X}}D_4\mu (\tau ), \end{aligned}$$
(3.11)

where \({\mathbb {X}}\) is an \(n \times k(q+1)\) matrix, \(D_4\) is an \(k(q+1) \times k(q+1)\) matrix and \(\mu (\tau )\) is an \(k(q+1) \times k\) matrix. It can be shown that

$$\begin{aligned}&{\mathbb {X}}=\left[ X_1, MX_1, M^2X_1,\ldots , M^qX_1, X_2, MX_2, M^2X_2, \ldots , M^qX_2, \ldots , X_k, MX_k, M^2X_k, \ldots ,\ M^qX_k\right] , \end{aligned}$$
(3.12)
$$\begin{aligned}&D_4=I_k\otimes \text {Diag}\left( \frac{1}{0!},\frac{1}{1!},\frac{1}{2!},\ldots ,\frac{1}{q!} \right) , \end{aligned}$$
(3.13)
$$\begin{aligned}&\mu (\tau )=I_k\otimes \left[ 1,\tau ,\tau ^2,\ldots ,\tau ^q\right] ^{'}. \end{aligned}$$
(3.14)

Next, we discuss how the matrix-vector products approach can be applied to the estimators given in Sect. 2. Recall that the QMLE is defined by \({\hat{\psi }}={{\,\mathrm{argmin}\,}}_{\psi }\left( y^{'}e^{\alpha W^{'}}e^{\tau M^{'}}H(\tau )e^{\tau M}e^{\alpha W}y\right)\), where the objective function requires the evaluation of \(e^{\tau M}e^{\alpha W}y\) and \(e^{\tau M}X\) in each iteration. Using the matrix-vector products approach, we can avoid the evaluation of these terms in each iteration of the optimization routine. We can define \({\mathbb {X}}\), \(Y_i\), and \(D_j\) for \(i\in \{1,2,3\}\) and \(j\in \{1,2,3,4\}\), and then supply these terms as the inputs of the objective function in the optimization solver. In this way, we can avoid the computation of these terms in each iteration.

In the case of GMM approach, the vector of moment functions in (2.10) contains the disturbance terms \(\epsilon (\gamma )=e^{\tau M}e^{\alpha W}y-e^{\tau M}X\beta\), which can also be expressed in the matrix-vector products approach by using (3.2) and (3.11). When implementing (2.11), similar to the case of QMLE, we can define \({\mathbb {X}}\), \(Y_i\), and \(D_j\) for \(i\in \{1,2,3\}\) and \(j\in \{1,2,3,4\}\), and then supply these terms as inputs for the objective function to the optimization solver.

Finally, in the case of Bayesian approach, we need to work with expressions in Algorithm 1. Before implementing the Gibbs sampler described in Algorithm 1, we can compute the required terms for the matrix exponential terms and then pass these terms to the sampler. That is, we can define \({\mathbb {X}}\), \(Y_i\), and \(D_j\) for \(i\in \{1,2,3\}\) and \(j\in \{1,2,3,4\}\), and then supply these terms to the Gibbs sampler. Thus, we can avoid computation of these terms in each pass of the Gibbs sampler.

Remark 2

The matrix-vector products approach is general enough and can be easily adjusted for some other MESS models. For example, the expressions for two special cases, MESS(1, 0) and MESS(0, 1), can be simply obtained by setting M and W to the zero matrix, respectively. Similarly, the Durbin versions of MESS(1, 1), MESS(1, 0) and MESS(0, 1) can also be estimated by defining independent variables appropriately. The matrix-vector products approach also applies to the MESS models with the panel data by arranging the terms involving the matrix exponential terms appropriately. For higher order MESS models, or MESS(pq), similar equations to (3.1)–(3.14) can be derived. For example, for the MESS(2, 2), we have

$$\begin{aligned}&e^{\tau _1 M_1+\tau _2M_2}e^{\alpha _1W_1+\alpha _2W_2}y \approx \sum _{i=0}^{q}\frac{1}{i!}\left( \tau _1M_1+\tau _2M_2\right) ^i\sum _{j=0}^{q}\frac{1}{j!}\left( \alpha _1W_1+\alpha _2W_2\right) ^jy\nonumber \\&\quad =\sum _{i=1}^q\sum _{j=0}^{i-1}\left( \frac{1}{i!j!}\sum _{k_1=0}^i\sum _{k_2=0}^j\left( {\begin{array}{c}i\\ k_1\end{array}}\right) \left( {\begin{array}{c}j\\ k_2\end{array}}\right) \tau _1^{i-k_1}\tau _2^{k_1}\alpha _1^{j-k_2}\alpha _2^{k_2}M_1^{i-k_1}M_2^{k_1}W_1^{j-k_2}W_2^{k_2}\right. \nonumber \\&\qquad +\left. \frac{1}{i!j!}\sum _{k_1=0}^j\sum _{k_2=0}^i\left( {\begin{array}{c}j\\ k_1\end{array}}\right) \left( {\begin{array}{c}i\\ k_2\end{array}}\right) \tau _1^{j-k_1}\tau _2^{k_1}\alpha _1^{i-k_2}\alpha _2^{k_2}M_1^{j-k_1}M_2^{k_1}W_1^{i-k_2}W_2^{k_2}\right) \nonumber \\&\qquad +\sum _{i=0}^q\sum _{k_1=0}^i\sum _{k_2=0}^i\left( {\begin{array}{c}i\\ k_1\end{array}}\right) \left( {\begin{array}{c}i\\ k_2\end{array}}\right) \tau _1^{i-k_1}\tau _2^{k_1}\alpha _1^{i-k_2}\alpha _2^{k_2}M_1^{i-k_1}M_2^{k_1}W_1^{i-k_2}W_2^{k_2}\nonumber \\&\quad ={\mathbb {Y}}_1{\mathbb {D}}_1{\omega }_1(\alpha ,\tau )+{\mathbb {Y}}_2{\mathbb {D}}_2{\omega }_2(\alpha ,\tau )+{\mathbb {Y}}_3{\mathbb {D}}_3{\omega }_3(\alpha ,\tau ), \end{aligned}$$
(3.15)

where \({\mathbb {Y}}_l\), \({\mathbb {D}}_l\) and \({\omega }_l\) for \(l\in \{1,2,3\}\) can be defined accordingly. In the case of \(e^{\tau _1 M_1+\tau _2M_2}X\), we have

$$\begin{aligned}&e^{\tau _1 M_1+\tau _2M_2}X\approx \left[ \sum _{i=0}^{q}\frac{1}{i!}(\tau _1M_1+\tau _2M_2)^iX_1,\ldots ,\sum _{i=0}^{q}\frac{1}{i!}(\tau _1M_1+\tau _2M_2)^iX_k \right] \nonumber \\&\quad =\left[ \sum _{i=0}^{q}\frac{1}{i!}\sum _{j=0}^i\left( {\begin{array}{c}i\\ j\end{array}}\right) (\tau _1M_1)^{i-j}(\tau _2M_2)^{j}X_1,\ldots , \sum _{i=0}^{q}\frac{1}{i!}\sum _{j=0}^i\left( {\begin{array}{c}i\\ j\end{array}}\right) (\tau _1M_1)^{i-j}(\tau _2M_2)^{j}X_k \right] \nonumber \\&\quad ={\mathbb {X}}\,{\mathbb {D}}_4\kappa (\tau ), \end{aligned}$$
(3.16)

where

$$\begin{aligned}&{\mathbb {X}}=\left[ X_1, M_1X_1,M_2X_1,M_1^2X_1,M_1M_2X_1,M_2^2X_1,\ldots , M_1^qX_1,M_1^{q-1}M_2X_1,\right. \\&\quad \ldots M_1M_2^{q-1}X_1,M_2^qX_1,\\&\quad \ldots ,\\&\quad X_k, M_1X_k, M_2X_k,M_1^2X_k, M_1M_2X_k, M_2^2X_k, \\&\quad \left. \ldots , M_1^qX_k, M_1^{q-1}M_2X_k,\ldots ,M_1M_2^{q-1}X_k, M_2^qX_k\right] ,\\&{\mathbb {D}}_4=I_k\otimes \text {Diag}\left( \frac{\left( {\begin{array}{c}0\\ 0\end{array}}\right) }{0!},\frac{\left( {\begin{array}{c}1\\ 0\end{array}}\right) }{1!}, \frac{\left( {\begin{array}{c}1\\ 1\end{array}}\right) }{1!},\right. \\&\left. \quad \ldots ,\frac{\left( {\begin{array}{c}q\\ 0\end{array}}\right) }{q!},\frac{\left( {\begin{array}{c}q\\ 1\end{array}}\right) }{q!},\ldots ,\frac{\left( {\begin{array}{c}q\\ q\end{array}}\right) }{q!}\right) ,\\&\kappa (\tau )=I_k\otimes \left[ 1,\tau _1,\tau _2,\tau _1^2,\tau _1\tau _2,\tau _2^2, \ldots , \tau _1^q,\right. \\&\quad \left. \ldots , \tau _1^{q-1}\tau _2,\ldots ,\tau _1\tau _2^{q-1},\tau _2^q\right] ^{'}. \end{aligned}$$

Remark 3

The matrix-vector products approach based on a large q value can provide a better accuracy, but it can increase the computation time. We may determine a satisfactory q value from the inverse error analysis for a scaling and squaring method with Taylor series approximation (the third method in Moler and Van Loan (1978, 2003)). Let \(T_q(\alpha W)=\sum _{j=0}^q \frac{\alpha ^j W^j}{j!}\). Since \(e^{\alpha W}=\left( e^{2^{-m}\alpha W}\right) ^{2^m}\), we may consider the approximation \(\left( T_q(2^{-m}\alpha W)\right) ^{2^m}\). By Corollary 1 in Moler and Van Loan (1978, 2003), if \(\Vert 2^{-m}\alpha W\Vert _{\infty }\le 0.5\), then \(\left( T_q(2^{-m}\alpha W)\right) ^{2^m}=e^{\alpha W+E}\), where

$$\begin{aligned} \frac{\Vert E\Vert _{\infty }}{\Vert \alpha W\Vert _{\infty }}\le \left( \frac{1}{2}\right) ^{q-3}\frac{1}{q+1}\approx {\left\{ \begin{array}{ll} 1.5\times 10^{-5}\quad \,\,\text {if}\quad q=15\\ 3.6\times 10^{-7}\quad \,\,\text {if}\quad q=20\\ 2.4\times 10^{-10}\quad \text {if}\quad q=30 \end{array}\right. } . \end{aligned}$$

This inverse error analysis indicates that a value of \(q=15\) can be satisfactory.

Remark 4

In terms of reliability, stability and accuracy, researchers should bear in mind the choice of q in the Taylor series expansion of the matrix exponential terms. Remark 3 indicates that \(q=15\) can be satisfactory for the estimation of MESS-type models in economics, which is also confirmed by the results from our simulations and the empirical illustration for the MESS(1, 1) model. In terms of efficiency and storage requirements, the representation in the form of equation (3.2) requires calculating and storing terms in the form of equations (3.3)–(3.10). Here the largest of these terms are (3.3) and (3.4), which are of size \(nq(q+1)/2\). Note also that in (3.3)–(3.5) terms are computed sequentially such that in each pass a matrix-vector product is calculated. Therefore, (3.3) and (3.4) are computed in \(O(n^2 q(q+1)/2)\) operations for dense W and M. In terms of simplicity, note again that the representation in the form of equation (3.2) needs to be derived and the terms in this representation need to be defined before the estimation. This can be tedious for higher order MESS-type models, but we think that the computational advantage of the matrix-vector products method outweighs this cost.

Monte Carlo simulations

In this section, we will investigate the implications of using the matrix-vector products method with the truncated Taylor series approximation versus the scaling and squaring algorithm with the Padé approximation (the expm function in MATLAB R2020b). To this end, we will explore the properties of the two competing methods in terms of computation time as well as their effects on the finite sample properties of the estimators described in Sect. 2.

We consider the following data generating process,

$$\begin{aligned} e^{\alpha W}y=\beta _1X_1+\beta _2X_2+u, \quad e^{\tau W}u=\epsilon , \end{aligned}$$
(4.1)

where the elements of \(X_1\) and \(X_2\) are independently drawn from \(U(0,\sqrt{12})\) and N(0, 1), respectively. For the spatial weights matrix \(W=(w_{ij})\), we consider two cases, the rook contiguity and queen contiguity. To this end, n spatial units are randomly allocated into \(\sqrt{n}\times \sqrt{n}\) square lattice graph. In the rook contiguity case, \(w_{ij} = 1\) if the j’th observation is adjacent (left/right/above or below) to the i’th observation on the graph. In the queen contiguity case, \(w_{ij}=1\) if the j’th observation is adjacent to, or shares a border with the i’th observation. The weights matrices are then row normalized. We set \((\beta _1, \beta _2)^{'}=(2,1)^{'}\), and let \(\alpha\) and \(\tau\) take values from \(\{-2, -0.2, 0.2, 2\}\). The disturbance terms are generated according to \(\epsilon _i\sim \text {i.i.d.}\, N(0,1)\). We consider two sample sizes, \(n=169\) and \(n=361\). For the QMLE and GMME, we set the number of repetitions to 1000. In the case of Bayesian estimator, we choose the following the priors: \(\alpha \sim N(0, 10)\), \(\tau \sim N(0, 10)\), \(\beta \sim N(0_{2\times 1}, I_2)\), and \(\sigma ^2 \sim IG(6/2, 4/2)\). We set the number of repetitions to 100, the number of draws to 1500 and burn-ins to 500. We use the matrix-vector products approach with \(q=15\) in all cases.

We use the matrix-vector products method (denoted as mvp) and the scaling and squaring algorithm with the Padé approximation (denoted as expm) to obtain the parameter estimates, and the corresponding bias, root mean squared error (RMSE) and the coverage rate. We also compute the impact measures, which include the average direct effect, the average total effect and the average indirect effect, and their respective bias, RMSE and empirical coverage. We report the total computation time in seconds over 1000 resamples.Footnote 9 In the case of the QMLE, the computation time for \({\hat{\psi }}=({\hat{\alpha }}, {\hat{\tau }})^{'}\) includes the time to compute the estimates using the concentrated log-likelihood function in (2.5). The computation time for \({\hat{\beta }}\) includes the time for \({\hat{\psi }}\) and for \({\hat{\beta }}\), which is computed using (2.3). The time for impact measures are consequently the sum of the computation time for \({\hat{\psi }}\), \({\hat{\beta }}\) and respective measures. In the case of BGMME, an initial GMM estimation is carried out to construct \(V^{*}\), \(F^{*}\) and \(P^{*}\)’s. We use the following set of moments for the initial stage, \((W\epsilon (\gamma ), M\epsilon (\gamma ), WX, X)^{'}\epsilon (\gamma )\). Thus, the computation time for the BGMME includes both stages for the estimation. The time for the impact measures are then computed by adding on corresponding computation time for impact measures. In the case of Bayesian estimation, the computation time includes the time for collecting 2000 draws including the burn-ins. Finally, the computation time for the impact measures are computed similarly to those in the cases of the QMLE and GMME.

We focus on the simulation results provided in Tables 1, 2, 3, 4 and 5.Footnote 10 Tables 1, 2 and 3 report the simulation results for the QMLE case. The mvp method reduces the computation time by approximately 98 to \(99\%\) compared to the expm method for different sample sizes, while providing the same estimates for the parameters and the impact measures. In all cases, we obtain the same values for bias, RMSE and coverage rates under both methods. When \(n=169\), the computation time for the mvp approach is about \(2\%\) of the computation time for the expm method. For example, when \(\alpha = -2\) and \(\tau = -2\), the bias, RMSE and coverage of \({\hat{\alpha }}\) using both methods are 0.003, 0.037 and 0.923, respectively. However, the computation time is 606.5 s for the expm method, and 11.8 s for the mvp method. This means that on average, each computation takes 0.6065 s using the expm method, and 0.0118 s using the mvp method. For \(n=361\), the bias, RMSE and coverage of \({\hat{\alpha }}\) are again the same for both methods, but the computation time is 4168.4 s for the expm method, and 30.7 s for the mvp method, leading to an average running time of 4.1684 and 0.0307 s, respectively. These results show the matrix-vector products method reduces the computational burden significantly, while maintaining the finite sample properties of the estimators. Table 2 presents the results for \({\hat{\beta }}=({\hat{\beta }}_1, {\hat{\beta }}_2)^{'}\). The findings are similar to those from Table 1. While providing a similar performance in term of the finite sample properties of the QMLE, the mvp method reduces the computation time by about \(98\%\) to \(99\%\) compared to the scaling and squaring algorithm. Table 3 presents the results for the average direct effect estimates for \(X_1\) and \(X_2\). There is no difference between the two competing methods in terms of bias, RMSE and coverage, but we again observe that the mvp method is computationally more efficient than the expm method.

Table 4 presents the results for \({\hat{\alpha }}\) and \({\hat{\tau }}\) in the GMM estimation case. The findings are very similar to the findings in the QML estimation case. The mvp method reduces the computation time by about \(95\%\) for \(n=169\) and \(97\%\) for \(n=361\) compared to the expm method. For example, when \(n=169\), the computation time for \({\hat{\gamma }}\) is 1407.9 s using the expm method, and 60.6 s using the mvp method. It means that on average, one resample takes 1.4079 s to run for the expm method, and 0.0606 s to run for the mvp method. When \(n=361\), they are respectively 8663.3 s and 219.6 s, leading to an average time of 8.6633 s and 0.2196 s to run each resample.

Table 5 presents the results for \({\hat{\alpha }}\) and \({\hat{\tau}}\) in the Bayesian estimation case. The mvp method again provides the same computational performance over the expm method, while maintaining the same finite sample properties. It takes only about \(1\%\) when \(n=169\) and less than \(1\%\) when \(n=361\) of the total computation time for the expm method. For example, for \(\alpha =-2\), \(\tau =-2\) and \(n=169\), it takes 2779.4 s for the expm method, and 20.4 s for the mvp method to compute \({\hat{\alpha}}\). Thus, on average it takes respectively 27.794 s and 0.204 s to collect draws for one resample. When \(n=361\), the computation time for \({\hat{\alpha }}\) using the expm method increases by approximately sevenfold to 19741.3 s. But, the computation time for the mvp method only increases by about \(20\%\) to 25.9 s. Thus, it takes an average of 197.413 s to collect draws for each resample using the expm method, and 0.259 s using the mvp method.

Table 1 The QMLE results for \({\hat{\alpha }}\) and \({\hat{\tau }}\)
Table 2 The QMLE results for \({\hat{\beta }}_1\) and \({\hat{\beta}}_2\)
Table 3 The QMLE results of average direct effects for \(X_1\) and \(X_2\)
Table 4 The GMME results for \({\hat{\alpha }}\) and \({\hat{\tau }}\)
Table 5 The Bayesian results for \({\hat{\alpha }}\) and \({\hat{\tau }}\)

An empirical illustration

In this section, we illustrate the computational time advantage of the matrix-vector products method using an empirical application. To this end, we use an example from Pace and Barry (1997) on the US presidential election in 1980. The dataset contains variables on the election results and county characteristics for 3107 US counties. In our model, the outcome variable is the (logged) proportion of voting age population that voted in the election (\(Y=\text {Vote}\)). The explanatory variables include the log percentage of population with a twelfth grade or higher education (\(X_1=\text {Educ}\)), the log percentage of population with homeownership (\(X_2=\text {Homeowners}\)), and the log per capita income (\(X_3=\text {Income}\)). We consider the following MESS(1, 1) specification

$$\begin{aligned} e^{\alpha W}Y=\beta _0 l_n+\beta _1X_1+\beta _2X_2+\beta _3X_3+u, \quad e^{\tau W}u=\epsilon . \end{aligned}$$
(5.1)

The spatial weights matrix W is the contiguity based weights matrix constructed using the latitude and longitude of the counties. We estimate this specification using the QML, GMM and Bayesian methods. For the Bayesian method, the same priors as those in the MC simulations are used. We use the expm and mvp methods to compute the estimation results and record the corresponding computation times. In the case of mvp method, the truncation order q is set to 15.

The parameter estimation results are summarized in Table 6. While the mvp method provides the same parameter estimates and standard errors (in parenthesis) as those obtained from the expm method, the computation time is significantly smaller for the mvp method. In the QML case, the computation time is 1072.4 s for the expm method, but only 6.0 s for the mvp method. In the GMM case, the computation time takes 4805.8 s for the expm method, and 21.1 s for the mvp method. For the Bayesian estimation, the number of draws is set to 1500 and the first 500 draws are discarded as burn-ins. The results show that the difference in terms of computation time is the biggest for the Bayesian estimation case, costing 47742.3 s (13.26 h) for the expm method, and 11.4 s for the mvp method. Overall, these results show that the matrix-vector products method is not only useful in Monte Carlo simulations, where resamples are drawn for hundreds or thousands of times, but also useful in empirical applications with large sample sizes.

Table 6 The parameter estimates of the presidential election voting example

To investigate the impact measures, we also compute the average direct effects, average total effects and average indirect effects discussed in Sect. 2.5, the measures of dispersion, and computation times. The results are summarized in Table 7. The reported dispersion measures are calculated by the delta method for the QML and GMM methods. In the case of Bayesian estimators, the standard deviation of posterior draws is used. The results show that the corresponding impact measures are very similar across the QML, GMM and Bayesian methods. For example, the average direct effect estimate for Educ is 0.320 using the QML method, 0.305 using the GMM method and 0.320 for the Bayesian method. However, in terms of computation time, the mvp method takes much less time than the expm method. For example, the computation times for the expm method are 1088.2, 4822.3 and 71584.0 s for the QML, GMM and Bayesian methods, respectively. On the other hand, the computation times for the mvp method are 22.0, 37.6 and 24491.0 s for the mvp method for the QML, GMM and Bayesian methods, respectively.Footnote 11

Table 7 The impact measures of the presidential election voting example

Conclusion

The MESS-type models provide a class of alternatives to SAR-type models with attractive properties. However, the estimation of these models requires the computation of the matrix exponential terms in each iteration of a numerical optimization scheme, which can be computationally costly. In this paper, for the calculation of the matrix exponential terms, we propose a matrix-vector products method based on the truncation of Taylor series expansion of matrix exponential terms. Because the estimation of MESS-type models requires the computation of the matrix exponential terms as vectors, rather than the matrix exponential terms in isolation, our approach provides an efficient alternative to the default method available in several popular statistical software. The results from our extensive simulation study and empirical illustration confirm the computational time gains for three estimation methods for MESS-type models (i.e., the QML, GMM and Bayesian methods) using the matrix-vector products method.

Notes

  1. 1.

    See Debarsy et al. (2015) for the formal results on on the maximum likelihood (ML) and generalized method of moments (GMM) estimation of the MESS models.

  2. 2.

    On the scaling and squaring method with either the rational Padé or Taylor approximants, see also Higham (2005) and Bader et al. (2019). Sidje (1998) provides an extensive package named ExpoKIT (both Fortran and MATLAB versions are available) to compute matrix exponential terms with the Krylov subspace method using the Arnoldi process approximation.

  3. 3.

    The computation of matrix exponential terms may be necessary for many applications from different fields. For example, MATLAB uses its expm function in its Control Toolbox, System Identification Toolbox, Neural Net Toolbox, Mu-Analysis and Synthesis Toolbox, Model Predictive Control toolbox, and Simulink.

  4. 4.

    On the properties of MESS and SAR type models, see Debarsy et al. (2015), Kelejian and Prucha (2010), Lee (2004), LeSage and Pace (2009), LeSage and Pace (2007).

  5. 5.

    Note that under heteroskedasticity of an unknown form, the QMLE is still consistent and has an asymptotically normal distribution when W and M commute, i.e., \(WM=MW\) (Debarsy et al. 2015).

  6. 6.

    Among other consistent estimators, the following initial GMME can be used: \({\hat{\gamma }}={{\,\mathrm{argmin}\,}}_{\gamma }g^{'}(\gamma )g(\gamma )\), where \(g(\gamma )=(W\epsilon (\gamma ), M\epsilon (\gamma ), WX, X)^{'}\epsilon (\gamma )\) and \(\epsilon (\gamma )=e^{\tau M}\left( e^{\alpha W}y-X\beta \right)\).

  7. 7.

    We use \(p(\cdot )\) to denote the relevant density function, and we omit X in the conditional sets for the simplicity of exposition.

  8. 8.

    The tuning parameter is determined during the estimation such that the acceptance rate falls between 40 and \(60\%\).

  9. 9.

    We use a MacBook Pro 2016 with 2.4GHz intel core i7 processor and 8 GB 1867 MHz LPDDR3 memory to run our simulations.

  10. 10.

    Some additional results are given in Tables 8, 9, 10, 11, 12, 13, 14, 15, 16 and 17 of Appendix A. The simulation results in these additional tables also attest that our suggested mvp method is computationally more efficient than the expm method.

  11. 11.

    Note that the impact measures consist of traces of \(n\times n\) matrices. For example, the average direct effect equals \(\frac{1}{n}\mathrm {tr}(e^{-{\hat{\alpha }}W_n}{\hat{\beta }}_{k})\) for \(k=1, 2\) and 3, so we need to compute the trace of an \(n\times n\) matrix for each draw through Algorithm 1 in the Bayesian method. This is why the time to compute the impact measures is relatively longer in the case of Bayesian method. However the time to sample the draws for the parameters is much shorter as shown in Table 6.

References

  1. Al-Mohy AH, Higham NJ (2010) A new scaling and squaring algorithm for the matrix exponential. SIAM J Matrix Anal Appl 31(3):970–989

    Article  Google Scholar 

  2. Anselin L (1988) Spatial econometrics: methods and models. Springer, New York

    Book  Google Scholar 

  3. Arbia G et al (2020) Testing impact measures in spatial autoregressive models. Int Reg Sci Rev 43(1–2):40–75

    Article  Google Scholar 

  4. Bader P, Blanes S, Casas F (2019) Computing the matrix exponential with an optimized Taylor polynomial approximation. Mathematics 7(12):1174

    Article  Google Scholar 

  5. Chiu TYM, Leonard T, Tsui K-W (1996) The matrix-logarithmic covariance model. J Am Stat Assoc 91(433):198–210

    Article  Google Scholar 

  6. Cliff AD, Ord JK (1969) The problem of spatial autocorrelation. In: Scott AJ (ed) London Papers in Regional Science 1. Studies in Regional Science. Pion, London, pp 25–55

    Google Scholar 

  7. Cliff AD, Ord JK (1973) Spatial autocorrelation. Pion, London

    Google Scholar 

  8. Debarsy N, Jin F, Lee L (2015) Large sample properties of the matrix exponential spatial specification with an application to FDI. J Econom 188(1):1–21

    Article  Google Scholar 

  9. Elhorst JP (2014) Spatial econometrics: from cross-sectional data to spatial panels. Briefs in Regional Science. Springer, Berlin Heidelberg

    Book  Google Scholar 

  10. Gallopoulos E, Saad Y (1992) Efficient solution of parabolic equations by Krylov approximation methods. SIAM J Sci Stat Comput 13(5):1236–1264

    Article  Google Scholar 

  11. Higham NJ (2005) The scaling and squaring method for the matrix exponential revisited. SIAM J Matrix Anal Appl 26(4):1179–1193

    Article  Google Scholar 

  12. Hochbruck M, Lubich C (1997) On Krylov subspace approximations to the matrix exponential operator. SIAM J Numer Anal 34(5):1911–1925

    Article  Google Scholar 

  13. Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press (ISBN: 9780521839402)

  14. Kelejian HH, Prucha IR (2010) Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances. J Econom 157(1):53–67

    Article  Google Scholar 

  15. Lee L-F (2004) Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 72(6):1899–1925

    Article  Google Scholar 

  16. Leonard T, Hsu JSJ (1992) Bayesian inference for a covariance matrix. Ann Stat 20(4):1669–1696

    Article  Google Scholar 

  17. LeSage J, Pace RK (2007) A matrix exponential spatial specification. J Econom 140(1):190–214

    Article  Google Scholar 

  18. LeSage J, Pace RK (2009) Introduction to spatial econometrics. CRC Press, Florida

    Book  Google Scholar 

  19. Moler C, Van Loan C (1978) Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev 20(4):801–836

    Article  Google Scholar 

  20. Moler C, Van Loan C (2003) Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev 45(1):3–49

    Article  Google Scholar 

  21. Pace RK, Barry R (1997) Quick computation of spatial autoregressive estimators. Geogr Anal 29(3):232–247

    Article  Google Scholar 

  22. Saad Y (1992) Analysis of some Krylov subspace approximations to the matrix exponential operator. SIAM J Numer Anal 29(1):209–228

    Article  Google Scholar 

  23. Sidje RB (1998) Expokit: a software package for computing matrix exponentials. ACM Trans Math Softw 24(1):130–156

    Article  Google Scholar 

  24. Taşpınar S, Doğan O, Vijverberg WPM (2018) GMM inference in spatial autoregressive models. Econom Rev 37(9):931–954

    Article  Google Scholar 

  25. Whittle P (1954) On stationary processes in the plane. Biometrika 41(3/4):434–449

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ye Yang.

Appendices

Appendix

A Additional simulation results

In this appendix, we provide some additional simulation results on the performance of both methods. Tables 8 and 9 include the simulation results for the indirect and total effects of \(X_1\) and \(X_2\) for the QMLE case. Tables 10, 11, 12 and 13 provide additional results for the GMME. The remaining tables, Tables 14, 15, 16 and 17, include additional simulation results for the Bayesian estimator. Overall, the simulation results in these tables attest that the mvp method is computationally more efficient than the expm method.

Table 8 The QMLE results of average indirect effects for \(X_1\) and \(X_2\)
Table 9 The QMLE results of average total effects for \(X_1\) and \(X_2\)
Table 10 The GMME results for \({\hat{\beta }}_1\) and \({\hat{\beta }}_2\)
Table 11 The GMME results of average direct effects for \(X_1\) and \(X_2\)
Table 12 The GMME results of average indirect effects for \(X_1\) and \(X_2\)
Table 13 The GMME results of average total effects for \(X_1\) and \(X_2\)
Table 14 The Bayesian results for \({\hat{\beta }}_1\) and \({\hat{\beta }}_2\)
Table 15 The Bayesian results of average direct effects for \(X_1\) and \(X_2\)
Table 16 The Bayesian results of average indirect effects for \(X_1\) and \(X_2\)
Table 17 The Bayesian results of average total effects for \(X_1\) and \(X_2\)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Doğan, O. & Taşpınar, S. Fast estimation of matrix exponential spatial models. J Spat Econometrics 2, 9 (2021). https://doi.org/10.1007/s43071-021-00015-2

Download citation

Keywords

  • Matrix exponential
  • MESS
  • QML
  • GMM
  • Bayesian
  • Inference
  • Impact measures

JEL Classification

  • C13
  • C21
  • C31