Estimating dynamic spatial panel data models with endogenous regressors using synthetic instruments

Fingleton, Bernard

doi:10.1007/s10109-022-00397-3

Estimating dynamic spatial panel data models with endogenous regressors using synthetic instruments

Original Article
Open access
Published: 14 October 2022

Volume 25, pages 121–152, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Geographical Systems Aims and scope Submit manuscript

Estimating dynamic spatial panel data models with endogenous regressors using synthetic instruments

Download PDF

Bernard Fingleton ORCID: orcid.org/0000-0001-7359-643X¹

2683 Accesses
5 Citations
Explore all metrics

Abstract

The paper applies synthetic instruments, initially developed for cross-sectional regression, to estimate dynamic spatial panel data models. These have two main advantages. First, instruments correlated with endogenous variables and yet independent of the errors are difficult to find. Not only are synthetic instruments normally exogenous, but they are usually strongly correlated with endogenous variables, and thus help to avoid the problem of weak instruments. Secondly, they help to reduce instrumental variables proliferation, which is a common result of standard methods of avoiding endogeneity bias. As demonstrated by Monte Carlo simulation, instrument proliferation causes bias in the Sargan–Hansen J test statistic, which is an important indicator of instrument validity and hence estimation consistency. It is also associated with a downward bias in parameter standard error estimates. The paper shows the results of applying synthetic instruments across a variety of different specifications and data generating processes, and it illustrates the method with real data leading to more reliable inference of causal impacts on the level of employment across London districts.

Does a Carbon Tax Reduce CO2 Emissions? Evidence from British Columbia

Article Open access 13 April 2022

Dynamic linkages between financial development, economic growth, urbanization, trade openness, and ecological footprint: an empirical account of ECOWAS countries

Article 06 April 2024

An overview of structural equation modeling: its beginnings, historical development, usefulness and controversies in the social sciences

Article Open access 09 January 2017

1 Introduction

Progress in spatial econometrics has been stimulated by the increasing availability of spatio-temporal data, as shown by Anselin et al. (2008), Elhorst (2014), Pesaran (2015) and Baltagi (2021) among others. Static spatial panel data methods are long-established, but more recently dynamic spatial panel data models have come to the fore as highly informative and relevant (see for example Parent and LeSage 2012; Yu and Lee 2008; Elhorst 2014). The predominant approach in the spatial panel model literature has been an assumption that the regressors are exogenous, apart from spatial and/or temporal lag terms relating to the endogenous variable^{Footnote 1} and typically estimation has focused on maximum likelihood and related methods. This presents a challenge in the presence of endogenous regressors. The focus in this paper is on dynamic spatial panel models in which endogeneity extends beyond spatial and temporal lags of the dependent variable. Baltagi et al. (2019) give extensive Monte Carlo simulations demonstrating the consistency of the estimates produced by a spatial version of ‘difference generalized method of moments (GMM)’, an approach which easily accommodates endogenous regressors.^{Footnote 2}

The main contribution of the paper is via the approach of Le Gallo and Páez (2013), who advocated the use of synthetic instruments to eliminate endogeneity bias in cross-sectional regressions, which is here extended to GMM estimation of dynamic spatial panel models. The intuition behind synthetic instruments is that because they are based on the invariant topology of the geographic space, typically they are exogenous to the spatio-temporal process being modelled. Specifically, because the endogenous variable(s) invariably has a non-random spatial distribution, some inherent dimensions of the topology will often be quite strongly correlated with the endogenous variable. Synthetic instruments deriving from these inherent dimensions typically possess hard-to-find ideal properties of instrumental variables, namely exogeneity and yet correlation with the endogenous variable.

In this paper, synthetic instruments help resolve some pitfalls of GMM estimation arising when there is an overabundance of instrumental variables. This can be exacerbated with spatial data, where the suite of instruments might be enhanced by the use of the spatial lags of variables in addition to variables per se (Kelejian and Prucha 1998, 1999; Pace et al. 2012; Baltagi et al. 2019). Moreover, instruments are often weak, with negligible correlation with endogenous variables. In contrast, synthetic instruments are invariably strong, with typically very high correlations. By replacing weak instruments with fewer synthetic instruments, one is likely to obtain more reliable inference. In particular, reducing the number of instruments eliminates problems relating to the crucial Sargan–Hansen test of overidentifying restrictions. An associated problem is the downwardly biased parameter standard errors associated with two-step GMM estimation, which causes upward bias in t-ratios. To remedy this, two related finite sample corrections are reported: the well-known Windmeijer correction and the ‘HKL’ double correction (Hwang et al. 2022), which also allows for overidentification bias. Finally as an illustration, synthetic instruments and standard error corrections are applied to real data to provide a stronger inferential basis to published research.

2 A dynamic spatial panel model

Consider first the estimation of the simple dynamic model given by Eq. (1),

$${y_{{it}}} = \gamma {y_{{it - 1}}} + \rho {\sum\limits_{{j = 1}}^{N} {w_{{ij}} {y_{{it}}} }} + \theta {\sum\limits_{{j = 1}}^{N} {w_{{ij}} {y_{{it - 1}}} }} + {\beta _{1}} {x_{{it}}} + {\beta _{2}} {\tilde{x}_{{it}}} + {\varepsilon _{{it}}} ;\quad i = 1,...,N,t = 1,...,T $$

(1)

In which there are $N$ regions/locations/individuals and $T$ times, $x$ is an exogenous variable, $\tilde{x}$ is an endogenous variable, $w_{ij}$ is the $i,j$ th element of an exogenous time-invariant $N$ by $N$ connectivity matrix ${\mathbf{W}}_{N}$, $\gamma ,\rho ,\theta ,\beta_{1}$ and $\beta_{2}$ are parameters to be estimated. The error term is compound, thus

$$\varepsilon_{it} = \mu_{i} + \nu_{it}$$

where $\mu_{i}$ is a set of individual effects, one for each of the $N$ regions, controlling for unobserved time-invariant heterogeneity across regions or locations. The term $\nu_{it}$ varies both by region and by time and represents other, unpredictable, random effects. The assumption is that each $\mu_{i}$ and $\nu_{it}$ is a random draw from independent and identically distributed distributions thus $\mu_{i} \sim iid(0,\sigma_{\mu }^{2} )$ and $\nu_{it} \sim iid(0,\sigma_{\nu }^{2} )$ with $\mu_{i}$ and $\nu_{it}$ independent of each other and among themselves. Given $\sigma_{\mu }^{2} > 0$, there is interregional heterogeneity with $\mu_{i}$ capturing unmodelled individual effects such as physical geography and also regional variation in unobserved effects.

A more general specification written in matrix terms is

$${\mathbf{y}}_{t} = {\mathbf{B}}_{N}^{ - 1} {\mathbf{C}}_{N} {\mathbf{y}}_{t - 1} + {\mathbf{B}}_{N}^{ - 1} {\mathbf{x}}_{t} \beta_{1} + {\mathbf{B}}_{N}^{ - 1} {\tilde{\mathbf{x}}}_{t} \beta_{2} + {\mathbf{B}}_{N}^{ - 1} {{\varvec{\upvarepsilon}}}_{t}$$

(2)

in which ${\mathbf{B}}_{N} = \left( {{\mathbf{I}}_{N} - \rho {\mathbf{W}}_{N} } \right)$, ${\mathbf{C}}_{N} = \left( {\gamma {\mathbf{I}}_{N} + \theta {\mathbf{W}}_{N} } \right)$, ${\mathbf{B}}_{N} ,{\mathbf{C}}_{N}$ and identity matrix ${\mathbf{I}}_{N}$ are matrices of dimension $N$ by $N$, $\rho ,\gamma$ and $\theta$ are scalar coefficients, ${\mathbf{y}}_{t}$ is an $N$ by 1 vector, ${\mathbf{x}}_{t}$ is an $N$ by $k_{1}$ matrix of exogenous regressors, $\beta_{1}$ is a $k_{1}$ by 1 vector of coefficients, ${\tilde{\mathbf{x}}}_{t}$ is an $N$ by $k_{2}$ matrix of endogenous regressors, $\beta_{2}$ is a $k_{2}$ vector of coefficients, and vector ${{\varvec{\upvarepsilon}}}_{t} = {{\varvec{\upmu}}} + {{\varvec{\upnu}}}_{t}$ is a $N$ by 1 compound error term. Standard assumptions are that ${\mathbf{B}}_{N}$ is a non-singular matrix and ${\mathbf{W}}_{N} ,{\mathbf{B}}_{N}^{ - 1}$ and the regressors are uniformly bounded in absolute value. The model satisfies stationarity conditions only if the maximum absolute characteristic root of ${\tilde{\mathbf{A}}} = {\mathbf{C}}_{N} {\mathbf{B}}_{N}^{ - 1}$ is less than one (Elhorst 2001, 2014), Parent and LeSage (2011, 2012) and Debarsy et al. (2012).

Consistent estimation of Eq. (2) by maximum likelihood is challenged by the presence of the endogenous variables, and assumptions regarding initial conditions and on how $T$ and $N$ tend to infinity. Bond (2002) argues that the distribution of the dependent variable depends in a non-negligible way on what is assumed about the distribution of the initial conditions. For example, the initial condition could be stochastic or non-stochastic, correlated or uncorrelated with the individual effects, or have to satisfy stationarity properties. Different assumptions about the nature of the initial conditions will lead to different likelihood functions, and the resulting ML estimators can be inconsistent when the assumptions on the initial conditions are misspecified. Hsiao (2003, pp. 80–135) has more details. GMM makes weaker assumptions about the initial conditions, indeed according to Baltagi (2013), p. 158, the ‘GMM estimator requires no knowledge concerning the initial conditions’, and naturally accommodates the presence of endogenous regressors. The paper therefore focusses on GMM which is very well documented in the literature so only a brief summary is provided here.

Following Arellano and Bond (1991), estimation of linear GMM panel data regressions involves first differences^{Footnote 3} to avoid dynamic panel bias (Nickell 1981), eliminating the individual effects $\mu_{i}$ which would otherwise be correlated with the spatial lag and time lag of the dependent variable. First differencing Eq. (2) gives

$$\Delta {\mathbf{y}}_{t} = \gamma \Delta {\mathbf{y}}_{t - 1} + \rho {\mathbf{W}}_{N} \Delta {\mathbf{y}}_{t} + \theta {\mathbf{W}}_{N} \Delta {\mathbf{y}}_{t - 1} + \Delta {\mathbf{x}}_{t} \beta_{1} + \Delta {\tilde{\mathbf{x}}}_{t} \beta_{2} + \Delta {{\varvec{\upvarepsilon}}}_{t}$$

(3)

Because of the presence of endogenous variables, instrumental variables are required for consistent estimation. Typically, instruments that are correlated with endogenous variables and yet independent of the errors are difficult to find. One solution is to use lags of regressors already present in the model, but typically, for IV estimation generally, more lags means less data. However, the usual instrument set for difference GMM, namely HENR instruments (after Holtz-Eakin, Newey and Rosin 1988), avoid this by zeroing out missing observations while including separate instruments for each time period. So with HENR one has one instrument per variable, time period and lag distance. This amounts to $(T - 2)(T - 1)/2$ instruments for each endogenous variable, since endogenous variables are contemporaneously correlated with the errors, provided the $\nu_{it}$ are not serially autocorrelated of order one, regressors lagged by two periods are appropriate in order to satisfy the orthogonality conditions relating to instruments and differenced errors. Arellano and Bond (1991) provide a test for serial correlation, the $m_{2}$ test statistic, which tests for second order serial correlation in the first differenced residuals, and which is assumed to be asymptotically normal under the null of zero correlation. Additionally, following the approach adopted by Baltagi et al. (2019), given data with identifiable spatial locations, spatially weighted earlier time-lagged levels of the dependent and explanatory variables are also potentially viable instruments. Accordingly, Baltagi et al. (2019) set out moments equations thus

$$E\left( {y_{il} \Delta \nu_{it} } \right) = 0{\text{ hence}}\begin{array}{*{20}c} {\sum\limits_{i} {y_{il} \Delta \nu_{it} = 0} } & {} \\ \end{array} , \, \forall i, \, l = 0,1,...,t - 2,t = 2,3,...,T$$

(4)

$$E\left( {{\mathbf{w}}_{i} {\mathbf{y}}_{l} \Delta \nu_{it} } \right) = 0\begin{array}{*{20}c} {{\text{ hence }}\sum\limits_{i} {\sum\limits_{i \ne j} {w_{ij} y_{il} } \Delta \nu_{it} = 0} } & {} \\ \end{array} \forall i, \, l = 0,1,...,t - 2,t = 2,3,...,T$$

(5)

where ${\mathbf{w}}_{i} = \left( {w_{i1} ,...,w_{iN} } \right)$ is a 1 by $N$ vector which corresponds to the $i^{\prime}th$ row of ${\mathbf{W}}_{N}$. Similar expressions giving additional moments equations involve the lagged endogenous regressors, for regressor $j$, $\tilde{x}_{j,il}$ and ${\tilde{\mathbf{x}}}_{j,l}$ replaces $y_{il}$ and ${\mathbf{y}}_{l}$ in Eqs. (4) and (5). With regard to strictly exogenous regressors, there is no feedback from the dependent variable. In this case, the moments conditions are

$$E\left( {x_{j,im} \Delta \nu_{it} } \right) = 0, \, \forall i,j, \, m = 1,...,T; \quad t = 2,...,T$$

(6)

$$E\left( {{\mathbf{w}}_{i} {\mathbf{x}}_{j,m} \Delta \nu_{it} } \right) = 0, \, \forall i,j, \, m = 1,...,T; \quad t = 2,...,T \,$$

(7)

Baltagi et al. (2019) also introduce a spatial moving average (SMA) error dependence process, but for simplicity the Monte Carlo simulations assume that the errors are spatially independent.^{Footnote 4}

HENR instruments lead to quadratic growth in the number of instruments with respect to $T$, so there is the possibility of an overabundance of instruments. One can limit the number of lags applying to endogenous variables in the moments equations. A second solution is to collapse the instrument matrix so that there is one instrument for each variable and lag distance, rather than one for each time period, variable, and lag distance, which amounts to adding together columns of the instrument matrix. Also, both solutions can be combined. Collapsing means that one replaces the set of instruments, one for each period, into one column. Therefore, the set of moments equations given by Eqs. (4) and (5) are replaced by

$$\sum\limits_{i,t} {y_{it - 2} \Delta \nu_{it} } = 0$$

(8)

$$\sum\limits_{i,t} {\sum\limits_{i \ne j} {w_{ij} y_{it - 2} } \Delta \nu_{it} } = 0$$

(9)

with similar expressions for the other endogenous regressors.

However, these approaches have limitations, limiting lags alone may not solve the instrument proliferation problem, depending on the context. Collapsing, by omitting time variation, will tend to give less precise estimates. Alternatively, strictly exogenous variables can be introduced as a single column in the matrix of instruments for each exogenous variable, thus producing far fewer instruments than by using the moments conditions of Eqs. (6) and (7). These are referred to these as IV-style instruments rather than HENR-type instruments.

3 Consequences of instruments proliferation

3.1 Parameter standard errors

With numerous instruments, the estimated asymptotic standard errors of the efficient, two-step, GMM estimator are downward biased in small samples. Windmeijer (2005) corrects for the bias which results from estimating the optimal weight matrix used in the second step of linear two-step GMM. The optimal weight matrix is the inverse of the covariance of the sample moments leading to the smallest covariance matrix for the GMM estimator. The bias results from the weight matrix being evaluated at estimated, rather than the true values of parameters. However, additional bias may also occur because of overidentification bias, which affects the finite sample bias in the GMM estimator itself. Hwang et al. (2022)^{Footnote 5} also correct for overidentification bias.

3.2 The Sargan–Hansen J test statistic

Theoretically, the moments conditions require that instruments should be orthogonal to the error term. However, with more instruments than variables to be instrumented, one has overidentification and a problem of not being able to exactly satisfy all the moments equations simultaneously. The solution is to attempt to satisfy the moments as closely as possible, and the success of this is given by the outcome of Sargan–Hansen’s J test (Sargan 1958; Hansen 1982), as defined by Eq. (10), which tests the null hypothesis of joint validity of the moments conditions under overidentification. Though it is robust to non-sphericity of the errors, it can be greatly weakened by instrument proliferation (Anderson and Sørenson 1996; Bowsher 2002; Roodman 2009a,b).

The $J$ test statistic is given by

$$J = {\mathbf{S}}_{1} {\mathbf{AS^{\prime}}}_{2} /N$$

(10)

$$\begin{gathered} {\mathbf{S}}_{1} = \sum\limits_{i = 1}^{N} \Delta \nu^{\prime}_{i2} {\mathbf{Z}}_{i} \hfill \\ {\mathbf{S}}_{2} = \sum\limits_{i = 1}^{N} {{\mathbf{Z}}_{i}^{\prime } \Delta } \nu_{i2} \hfill \\ {\mathbf{A}} = \left( {\frac{1}{N}\sum\limits_{i = 1}^{N} {{\mathbf{Z}}_{i} \Delta \nu_{i1} \Delta \nu_{i1}^{\prime } {\mathbf{Z}}_{i} } } \right)^{ - 1} \hfill \\ \hfill \\ \end{gathered}$$

In ${\mathbf{S}}_{1}$ and ${\mathbf{S}}_{2}$, $\Delta \nu_{i2} \,$ are differenced second-step errors, ${\mathbf{Z}}$ is the matrix of instruments, comprised of $N$ ${\mathbf{Z}}_{i}$ s, each of dimension $\left( {T - 2,p} \right)$ where $p$ is the number of instruments. Under the null hypothesis that the moments conditions are valid, $J$ is distributed as $\chi_{p - k}^{2}$, where $k$ is the number of estimated parameters and $p > k$, so if $J$ exceeds the relevant critical value of $\chi_{p - k}^{2}$, some or all of the moments conditions are not supported by the data. The preliminary (one-step) consistent estimator giving the differenced first-step errors is based on

$$\ {{\mathbf{A_{1}}}} = \left( {\frac{1}{N}\sum\limits_{i = 1}^{N} {{\mathbf{Z}}_{i}^{\prime } {\mathbf{HZ}}_{i} } } \right)^{ - 1} \hfill $$

(11)

In the above, ${\mathbf{H}}$ is a $(T - 2,T - 2)$ matrix of 2 s on the main diagonal and − 1 s on the adjacent upper and lower diagonals.

4 Synthetic instruments

Normally, it is difficult to find legitimate external exogenous instruments that correlate with the endogenous variables and which are yet unrelated to the error term. Typically, instruments that correlate closely with endogenous variables tend to also correlate with the errors, but trying to avoid correlation with the errors commonly leads to instruments that are weak and irrelevant with respect to the endogenous variable. A solution to the problem is based on the spatial filtering literature (notably Griffith 1988, 1996, 2000, 2003; Getis and Griffith 2002; Boots and Tiefelsdorf 2000; Patuelli et al. 2006) which is the basis for the construction of synthetic instruments, as advocated by Le Gallo and Páez (2013) for cross-sectional regression. Synthetic instruments have the properties of ideal instruments, because normally they are well correlated with the endogenous variables and yet independent of the errors. In fact, the synthetic instruments are the fitted values resulting from regressing the endogenous regressors and their spatial lags on weighted linear combinations of subsets of orthogonal eigenvectors deriving from a symmetrical $N$ by $N$ contiguity matrix^{Footnote 6}${\mathbf{M}}_{N}$, in which the $m_{ij}^{{}} ,i = 1,...,N,j = 1,...N$, take the values

$$\begin{gathered} m_{ij}^{{}} = 1{\text{ if}}\; \, i\;{\text{ and}}\; \, j \, \;{\text{are }}\;{\text{neighbours}} \hfill \\ m_{ij}^{{}} = 0{\text{ otherwise}} \hfill \\ \end{gathered}$$

${\mathbf{M}}_{N}$ simply reflects the spatial connectivity of $N$ regions and this is normally unaffected by the data under analysis. Likewise, the eigenvectors are exogenous, in other words not determined by ${\mathbf{y}}_{t}$, and so are an appropriate basis for synthetic instruments.

The effectiveness of the eigenvectors as instruments derives from the fact that each one represents a different orthogonal latent map pattern and so it is likely that one or more will correlate strongly with a non-randomly spatially distributed endogenous variable. Following Griffith (2000) and much related literature, we first consider the Moran Coefficient spatial autocorrelation index $MC_{t}$ which measures the spatial autocorrelation in ${\mathbf{y}}$ at time $t$ as given by

$$MC_{t} = \frac{N}{{{\mathbf{1^{\prime}}}_{N} {\mathbf{M}}_{N} {\mathbf{1}}_{N} }}\frac{{{\mathbf{y^{\prime}}}_{t} {\mathbf{P}}_{N} {\mathbf{y}}_{t} }}{{{\mathbf{y^{\prime}}}_{t} \left( {{\mathbf{I}}_{N} - {\mathbf{1}}_{N} {\mathbf{1}}_{N} ^{\prime}/N} \right){\mathbf{y}}_{t} }}$$

(12)

where

$${\mathbf{P}}_{N} = \left( {{\mathbf{I}}_{N} - {\mathbf{1}}_{N} {\mathbf{1}}_{N} ^{\prime}/N} \right){\mathbf{M}}_{N} \left( {{\mathbf{I}}_{N} - {\mathbf{1}}_{N} {\mathbf{1}}_{N} ^{\prime}/N} \right)$$

in which ${\mathbf{I}}_{N}$ is an $N$ by $N$ identity matrix, and ${\mathbf{1}}_{N}$ denotes an $N$ by 1 vector of ones. The $N$ by $N$ matrix ${\mathbf{P}}_{N}$ yields $N$ ‘orthogonal’ eigenvectors^{Footnote 7}${\mathbf{E}}_{i} ,i = 1,...,N$, each of which is of dimension $N$ by 1. Replacing ${\mathbf{y}}_{t}$ in Eq. (12) by ${\mathbf{E}}_{i}$ measures the spatial autocorrelation for eigenvector ${\mathbf{E}}_{i}$. So, each of the eigenvectors of ${\mathbf{P}}_{N}$ can be understood as a distinctive map pattern, with a separate $MC$, ranging from strongly positive to strongly negative autocorrelation, given by the different ${\mathbf{E}}_{i}$ values distributed across the regions implied by the connectivity matrix ${\mathbf{M}}_{N}$.

The synthetic instruments actually employed are weighted linear combinations of subsets of the ${\mathbf{E}}_{i}$ which are referred to in the literature as spatial filters (see Griffith 2003; Boots and Tiefelsdorf 2000). In the empirical analysis below, we use iterative regression to identify which subset of the ${\mathbf{E}}_{i}$ are appropriate for a given endogenous regressor, with the regression coefficient estimates giving the weights to apply in the weighted linear combination.

Applying such a spatial filter to even a completely spatially random variable will tend to find a significant relationship and a moderately strong correlation between the random variable and the synthetic instrument because it is the outcome of a search through many candidate ${\mathbf{E}}_{i}$ s, but with a spatially organised variable, for example, a quadratic trend surface defined by its geographic coordinates, the outcome will typically be a much more significant correlation.^{Footnote 8} Because spatio-temporal panel data are unlikely to be randomly distributed and almost invariably spatially organised in some way, the spatial filter can be used to obtain a synthetic instrument which is highly correlated with an endogenous variable, and thus, one has a way to generate a relevant instrumental variables that are unrelated to the error term in the data, and yet which are highly correlated with endogenous variables which are related to the error term. This is very helpful, because relevant and exogenous instrumental variables are difficult to find. As noted by Le Gallo and Páez (2013), working in the context of cross-sectional data, ‘Synthetic variables, being artificial map patterns derived from the spatial configuration of the system, provide a near ideal solution—as long as spatial partitioning is not codetermined with other variables, which is typically the case’.

The aim is to obtain spatial filters for endogenous regressors $\tilde{x}_{itk} ;i = 1,...,N,t = 1,...,T,k = 1,...,k_{2}$. The approach adopted involves iteratively fitting regressions in which the dependent variable is the $k^{\prime}th$ endogenous variable ${\tilde{\mathbf{x}}}_{tk}$ and the independent variables are the eigenvectors ${\mathbf{E}}_{i}$. The outcome is the isolation of the relevant subset of ${\mathbf{E}}_{i}$ and their relative weights as given by estimated regression coefficients. So then for each endogenous variable we can form a weighted linear combination of the subset so as to give an appropriate synthetic instrument. This can be summarised thus

(1)
Set up empty vectors ${\mathbf{z}}_{k}; \quad k = 1,...,k_{2}$ to ultimately contain synthetic instrument for variable $k$
(2)
Use data for panel at time $t = 1$
(3)
$k = 1$
(4)
Set $N$ by 1 vector ${\mathbf{V}} = 1$
(5)
$j = 1$
(6)
For variable $k$ and eigenvector $j$, regress ${\tilde{\mathbf{x}}}_{tk}$ on ${\mathbf{E}}_{j}$
(7)
If the regression coefficient $\beta$ is significantly different from 0,${\mathbf{V}} = [{\mathbf{V}} + \beta {\mathbf{E}}_{j} ]$
(8)
$j = j + 1$, if $j \le N$ go to 6)
(9)
Regress ${\tilde{\mathbf{x}}}_{tk}$ on ${\mathbf{V}}$ to obtain ${\mathbf{\hat{\tilde{x}}}}_{tk}$
(10)
Append ${\mathbf{\hat{\tilde{x}}}}_{tk}$ to the column matrix so that ${\mathbf{z}}_{k} = \left[ {{\mathbf{z}}_{k} ;{\mathbf{\hat{\tilde{x}}}}_{tk} } \right]$
(11)
$k = k + 1$, if $k \le k_{2}$ go to 4)
(12)
$t = t + 1,$ if $t \le T$ go to 3)

The $NT$ by 1 vectors ${\mathbf{z}}_{k} ,k = 1,...,k_{2}$ are then used as external synthetic instruments in GMM estimation.

5 The Monte Carlo simulations

5.1 The basic DGP

Our simulations assume that there are four regressors, though these are subsequently extended to give a spatial Durbin type of specification. At its simplest, our data generating process (DGP) is based on a version of Eqs. (2) and (3) thus

$$\begin{gathered} y_{it} = \gamma y_{it - 1} + \rho \sum\limits_{j = 1}^{N} {w_{ij} y_{it} } + \theta \sum\limits_{j = 1}^{N} {w_{ij} y_{it - 1} } + \beta_{1} x_{1it} + \beta_{2} x_{2it} + ... \hfill \\ \beta_{3} \tilde{x}_{3it} + \beta_{4} \tilde{x}_{4it} + \varepsilon_{it}; \quad i = 1,...,N,t = 1,...,T \hfill \\ \varepsilon_{it} = \mu_{i} + \nu_{it} \hfill \\ \end{gathered}$$

(13)

The aim is to devise a design that captures all sources of endogeneity, the ultimate outcome of which is $\tilde{x}_{3it}$ and $\tilde{x}_{4it}$ being correlated with $\varepsilon_{it}$. The approach adopted is similar to Liu and Saraiva (2015), but in the context of compound errors, so that endogeneity occurs because of correlation between the regressors and $\nu$ and hence $\varepsilon$. In this simple initial case, the DGP draws from the Gaussian multivariate distribution,

$$\left( \begin{gathered} \begin{array}{*{20}c} {\begin{array}{*{20}c} \nu \\ {\tilde{\nu }} \\ \end{array} } \\ {{\mathbf{x}}_{1} } \\ {{\mathbf{x}}_{2} } \\ {{\tilde{\mathbf{x}}}_{3} } \\ \end{array} \hfill \\ {\tilde{\mathbf{x}}}_{4} \hfill \\ \end{gathered} \right)\sim N\left( {\left( \begin{gathered} \begin{array}{*{20}c} {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \\ 0 \\ 0 \\ 0 \\ \end{array} \hfill \\ 0 \hfill \\ \end{gathered} \right)\left[ {\begin{array}{*{20}c} {\sigma_{\nu }^{2} } & {p_{1} } & 0 & 0 & 0 & 0 \\ {p_{1} } & 1 & 0 & 0 & {p_{2} } & {p_{3} } \\ 0 & 0 & 1 & {p_{4} } & 0 & 0 \\ 0 & 0 & {p_{4} } & 1 & 0 & 0 \\ 0 & {p_{2} } & 0 & 0 & 1 & 0 \\ 0 & {p_{3} } & 0 & 0 & 0 & 1 \\ \end{array} } \right]} \right)$$

(14)

in which the leading diagonal of the covariance matrix contains the variances and the off-diagonals, $p_{1} ,p_{2} ,p_{3}$ and $p_{4}$, are the covariances between the random variables. The set-up in Eq. (14) indicates that the exogenous variables are unrelated to other variables, except that they are correlated with each other via $p_{4}$. The endogenous regressors ${\tilde{\mathbf{x}}}_{3}$ and ${\tilde{\mathbf{x}}}_{4}$ are correlated with $\tilde{\nu }$ which is a separate equation system, but $\tilde{\nu }$ is correlated with the remainder error component $\nu$ (with $\tilde{\nu }$ separate from $\nu$, there is the option of different $\tilde{\nu }$ s for different endogenous regressors, as applied subsequently). The outcome is a set of $NT$ by 1 random vectors. The individual error component $\mu_{i}$ is generated via a univariate normal distribution with zero mean and variance $\sigma_{\mu }^{2}$. Combining the error components $\mu_{i}$ and $\nu_{it}$ gives $\varepsilon_{it}$, so that $\tilde{x}_{3it}$ and $\tilde{x}_{4it}$ are correlated with $\varepsilon_{it}$. Note that we apply a similar approach in the context of non-spatial data and the spatial Durbin specification subsequently.

Given true values of the various parameters, and drawing in each replication from the multivariate normal distribution provides numerous realisations of $y_{it} ,i = 1,...,N;t = 1,...,T$. Note that the draws leading to a maximum absolute characteristic root of ${\tilde{\mathbf{A}}}$ equal to or greater than 1 are rejected, so the simulation data sets are all dynamically stable and stationary. These data are the basis of estimates of the model parameters. The aim is to compare the resulting estimates with the true values of the parameters of Eq. (13).

In practice, for the purposes of simulation, various alternative true parameter values have been considered, but the results presented subsequently for the DGP are based on $\sigma_{\mu }^{2} = 0.2,0.8$ and $\sigma_{\nu }^{2} = 0.8,0.2$. The simulations thus encompass low and high individual heterogeneity, and low and high levels of remainder variance. Also, $p_{1} = 0.5,p_{2} = 0.75$, $p_{3} = 0.25$ and $p_{4} = 0.3$, so $\tilde{x}_{3it}$ is strongly endogenous, and $\tilde{x}_{4it}$ is weakly endogenous. Also, it is assumed that $\gamma = 0.75,\rho = 0.3,\theta = - 0.2$ and $\beta_{1} = 4,\beta_{2} = 3,\beta_{3} = 2$,$\beta_{4} = 1$.

The reported outcomes for this simple specification are based on a ‘r ahead and r behind’ connectivity matrix of Kelejian and Prucha (1999), which is subsequently row-standardised. Assuming r = 5 means that each row of spatial matrix ${\mathbf{W}}_{N}\, (i.e. \, w_{ij} ,{\text{ with }}i = 1,...,N,j = 1,...,N)$ has up to 10 connections (five ahead and five behind each with equal weights), with zeros elsewhere and on the main diagonal. Additionally, we subsequently consider results based on a dense ${\mathbf{W}}_{N}$ matrix in the context of the spatial Durbin specification.

Results are reported for 100 replications, which nullifies aberrant outcomes and is sufficient to highlight the main traits in the simulation. In each replication, initial 51 simulation outcomes of $x_{1it} ,x_{2it} ,\tilde{x}_{3it} {\text{ and }}\tilde{x}_{4it}$ are ignored in order to minimise any potential effect of initial values at $t = - 50$ of zero (i.e. simulation outcomes for $t = - 50, - 49,...,0$ are discarded). Also, $T = 10 \,$ and there are $N = 100 \,$ regions.

5.2 Results for simple spatial DGP

The results obtained depend of various set-ups regarding instruments. The idea is that we wish to show the impact of having fewer instruments on estimation as well as the effect of different error distribution assumptions. The largest number of instruments is given by applying the standard solution to the existence of endogenous variables, namely HENR instrumentation. Temporally lagged values of the endogenous variables $y_{it}$ and $Wy_{it}$ together with temporally lagged values of $Wy_{it - 1,} \tilde{x}_{3it} ,\tilde{x}_{4it}$ and their spatial lags $W^{2} y_{it - 1} ,W\tilde{x}_{3it} ,W\tilde{x}_{4it}$, with full HENR instrumentation amount to $8(T - 1)(T - 2)/2$ = 288 instruments. In addition, there are four exogenous IV-style instruments equal to the exogenous variables and their spatial lags, namely $x_{1it} ,x_{2it} ,Wx_{1it} ,Wx_{2it}$. The result is 288 + 4 = 292 instruments overall.

One side-issue relating to the existence of many instruments is the possibility that some are almost collinear. One could drop them from the instrument set, but this would change the number of degrees of freedom available for the Sargan–Hansen J statistics. Moreover, the definition of collinearity is somewhat subjective^{Footnote 9} and eliminating collinear instruments has minimal impact on outcomes.

Synthetic instruments are usually strongly correlated. For example, on the basis of 100 replications the mean correlation between $\tilde{x}_{4it}$ and its spatial lag $W\tilde{x}_{4it}$ and their respective synthetic instruments is 0.6687 and 0.7205. Applying them to the endogenous regressors and their spatial lags, $\tilde{x}_{3it} ,W\tilde{x}_{3it}$ and $\tilde{x}_{4it} ,W\tilde{x}_{4it}$, together with the exogenous regressors and their spatial lags gives eight IV-style instruments. Full HENR instrumentation for $y_{it,} Wy_{it,} Wy_{it - 1,} W^{2} y_{it - 1}$, adds $4(T - 1)(T - 2)/2$ = 144 instruments, giving a total of 152 instruments overall. There is considerable reduction in the number of instruments resulting from both collapsing and using synthetic instruments. Collapsing the standard HENR instrumentation for $y_{it} ,Wy_{it} ,Wy_{it - 1}$ and $W^{2} y_{it - 1}$ gives $4(T - 2)$ = 32 instruments. Adding the aforementioned eight IV-style instruments gives an overall total of $4(T - 2) + 8$ = 40 instruments. Collapsing standard HENR instrumentation for $y_{it}$ alone gives $(T - 2)$ = 8 instruments. Using synthetic instruments for $Wy_{it} ,Wy_{it - 1}$ and $W^{2} y_{it - 1}$, together with $\tilde{x}_{3it} ,W\tilde{x}_{3it} ,\tilde{x}_{4it} ,W\tilde{x}_{4it}$ creates seven IV-style instruments, and there are four more from the exogenous variables and their spatial lags $x_{1it} ,x_{2it} ,Wx_{1it} ,Wx_{2it}$. Combined, there are 19 instruments in total.

The results obtained show remarkably little negative consequence as a result of reducing the number of instruments, plus the significant benefit of an unbiased Sargan–Hansen J statistic and less biased standard errors. Tables 1 and 2 show the mean parameter estimates resulting from 100 Monte Carlo simulations according to the number of instruments, true parameter values, and error distribution assumptions.^{Footnote 10} It is clear that the estimates closely approximate the true values. Table 3 summarises the outcomes, showing the mean absolute parameter bias, obtained by averaging across the mean absolute bias for each parameter, and the average of the mean RMSE’s, again averaging across the mean RMSE of each parameter. Bias tends to be smaller with larger $\sigma_{\nu }^{2}$, but the opposite is true for RMSE. There is little variation in either as the number of instruments varies. In Tables 4 and 5, for each table cell relating to each parameter, we give the average of the mean simulation outcomes averaging across classic two-step, Windmeijer and HKL standard errors. Also, the mean classic two-step, Windmeijer and HKL standard errors are obtained by taking the average of the mean standard errors, averaging across parameter standard errors. The tables indicate that standard errors tend to rise as the number of instruments falls, though this is confounded by the fact that with 152 and 292 instruments the weight matrix is not symmetric positive definite and a generalised inverse is applied, so the standard errors cannot be guaranteed to be accurate. Compared with the uncorrected classic two-step standard errors, the corrections due to Windmeijer and the HKL correction are associated with larger standard errors. Tables 4 and 5 also indicate that standard errors are higher when remainder variance is high and individual variance is low.

Table 1 Mean parameter estimates: $\sigma_{\mu }^{2} = 0.8$$\sigma_{\nu }^{2} = 0.2$

Estimating dynamic spatial panel data models with endogenous regressors using synthetic instruments

Abstract

Similar content being viewed by others

Does a Carbon Tax Reduce CO2 Emissions? Evidence from British Columbia

Dynamic linkages between financial development, economic growth, urbanization, trade openness, and ecological footprint: an empirical account of ECOWAS countries

An overview of structural equation modeling: its beginnings, historical development, usefulness and controversies in the social sciences

1 Introduction

2 A dynamic spatial panel model

3 Consequences of instruments proliferation

3.1 Parameter standard errors

3.2 The Sargan–Hansen J test statistic

4 Synthetic instruments

5 The Monte Carlo simulations

5.1 The basic DGP

5.2 Results for simple spatial DGP

5.3 Results with non-spatial data

5.4 Results for Spatial Durbin specification with dense connectivity matrix

6 Example with real data

7 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Computing

Spatial specification

Non-spatial specification

Spatial Durbin

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation