1 Introduction

The spatial dimension of economic and other activities of interest is often vital for understanding processes at work. Spatial spillover effects, i.e. impacts of an observational unit i on other units \(j \ne i\), are commonplace in theory, but can pose considerable challenges for econometric modelling. Spatial econometrics allows researchers to analyse and control for this dimension in a parsimonious way. Spatial models are an important part of the empirical toolkit of many disciplines, in particular of economics. They have successfully been used to investigate the determinants of growth (Acemoglu et al. 2019; Crespo Cuaresma et al. 2014; Lesage and Fischer 2008; Panzera and Postiglione 2021), the drivers of land use change (Arima et al. 2011; Chakir and Le Gallo 2013; Kuschnig et al. 2021), international trade (Behrens et al. 2012; Krisztin and Fischer 2015; Yang et al. 2017), and many more pressing issues.

Bayesian approaches to spatial models allow for flexible specifications, that loosen restrictive assumptions and impose structure where needed. This allows for models that better reflect reality and natively account for uncertainty. Related developments in the field of spatial econometrics include Bayesian model averaging and model selection (LeSage and Parent 2007; Lesage and Fischer 2008; Crespo Cuaresma and Feldkircher 2013; Pfarrhofer and Piribauer 2019), hierarchical and mixture modelling (Cornwall and Parent 2017; Dong and Harris 2015; Dong et al. 2015; Lacombe and McIntyre 2016), the flexible treatment of connectivity (Debarsy and LeSage 2018, 2020; Han and Lee 2016; Krisztin and Piribauer 2021), and models with limited dependent variables (Krisztin et al. 2021; LeSage 2000). However, Bayesian approaches are rarely used for applied spatial econometrics. This is due, in part, to a lack of dedicated software for Bayesian spatial modelling.

The flexibility of Bayesian modelling and the peculiarity of spatial econometrics challenges good software implementations. General-purpose software for Bayesian modelling, such as JAGS (Plummer 2003) and Stan (Carpenter et al. 2017), are not optimised for these models and can be computationally inefficient (see Wolf et al. 2018). Bivand and Piras (2015) first document a lack of software for Bayesian spatial econometric methods, with the notable exception of the MATLAB (The MathWorks Inc 2021) Econometrics Toolbox by LeSage (1999). At the time of writing, some toolbox routines were ported to the spatialreg package (Bivand and Piras 2015), the R-INLA (Lindgren and Rue 2015) package allows for marginal inference (also see Bivand et al. 2015; Gómez-Rubio et al. 2021), and there exist many implementations of specific (e.g. Dong et al. 2016), or related models (e.g. Morris et al. 2019). However, comprehensive free and open source software for spatial econometric models remains scarce—particularly for the R (R Core Team 2021) ecosystem.

In this paper, I present a layered object-oriented software architecture that bridges the flexibility of Bayesian modelling with the necessities of spatial econometrics. I provide an implementation in bsreg (Kuschnig 2021), an R package for estimating spatial econometric models. The package provides routines for common models and, thanks to its architecture, can readily be adapted and extended. A combination of two distinct programming and user interfaces allows for a universal underlying structure, while accommodating existing R workflows. The R6 (Chang 2021) object-oriented system is used for the underlying structure; outputs are provided in a standard format, integrating into established procedures with base, third-party, or custom functionality. This duality of a flexible, object-oriented structure and a familiar user interface makes the proposed software architecture maintainable, extensible, and convenient to use.

I demonstrate the flexibility of Bayesian spatial modelling and bsreg by reproducing a frequentist analysis of cigarette demand. Halleck Vega and Elhorst (2015) propose a model with local spatial lags, where distance-based connectivities are parameterised—I present a Bayesian variant of their approach. I find that Bayesian posterior densities yield much improved measures of uncertainty. Uncertainty in non-linear models is characterised by heavy tails that are not reflected in conventional measures of uncertainty. I also find that estimates by Halleck Vega and Elhorst (2015) of the average spillover effects were inflated considerably for distance-based connectivities. This is due to a scaling issue that, at the time of writing, also exists in the widely used spatialreg package. This suggests that (a) Bayesian methods are suitable for flexibly extending spatial models, (b) yield improved insights into uncertainty, and (c) the complexities of spatial models, including their interpretation, present a dangerous source of errors for ad-hoc, untested approaches.

The remainder of this paper is structured as follows. First, I introduce relevant econometric methods to the reader. This provides a foundation to discuss the software architecture and implementation. I start by sketching out spatial econometrics, and refer the interested reader to Anselin (2013) and LeSage and Pace (2009) for further information. Then, I introduce Bayesian inference and simulation methods for Bayesian estimation, where I also present a Bayesian approach to the parameterised connectivity model of Halleck Vega and Elhorst (2015). For more information on Bayesian inference and computation, I refer to Lancaster (2004), Gelman et al. (2013), and Gamerman and Lopes (2006). In Sect. 4, I present the proposed software architecture, discuss the implementation of the bsreg package and demonstrate its use. In Sect. 5, I estimate several spatial models of cigarette demand and discuss similarities and differences of the Bayesian approach to previous results by Halleck Vega and Elhorst (2015). Section 6 concludes and provides an outlook for future work.

2 Spatial econometric models

Consider a standard linear regression model

$$\begin{aligned} {\mathbf {y}} = {\mathbf {X}} \beta + {\mathbf {e}}, \end{aligned}$$
(1)

where \({\mathbf {y}} \in {\mathbb {R}}^N\), \({\mathbf {X}} \in {\mathbb {R}}^{N \times K}\), and \({\mathbf {e}} \in {\mathbb {R}}^N\) is an error term with mean zero. The idea behind spatial econometric models is to extend this model with spatial information by using neighbouring values. A comprehensive spatial econometric model, also termed the ‘general nested model’, is given by

$$\begin{aligned} {\mathbf {y}}&= \lambda _1 {\mathbf {W}} {\mathbf {y}} + {\mathbf {X}} \beta + {\mathbf {W}} {\mathbf {X}} \theta + {\mathbf {u}}, \\ {\mathbf {u}}&= \lambda _2 {\mathbf {W}} {\mathbf {u}} + {\mathbf {e}}. \end{aligned}$$

where \({\mathbf {W}} \in {\mathbb {R}}^{N \times N}\) is a connectivity matrix that imposes a neighbourhood structure, with elements \({\mathbf {w}}_{ij} > 0\) for neighbours i and j, where \(i \ne j\), and 0 otherwise, which is usually scaled in some way (see Anselin 2013, for further information on connectivity matrices), and variables are assumed to be centred for notational convenience.

This model includes three spatial lags that are the building blocks of spatial econometric models. These are the spatial autoregressive lag \({\mathbf {W}} {\mathbf {y}}\), the spatially lagged regressors \({\mathbf {W}} {\mathbf {X}}\), and the spatially lagged error \({\mathbf {W}} {\mathbf {u}}\). In practice, this comprehensive model sees little use and models with one or two of the spatially lagged terms are preferred. I briefly introduce the most prominent ones below.

The spatial lag of \({\mathbf {X}}\) (SLX) model is given by

$$\begin{aligned} {\mathbf {y}} = {\mathbf {X}} \beta + {\mathbf {W}} {\mathbf {X}} \theta + {\mathbf {e}}. \end{aligned}$$
(2)

This model is notable for two main reasons. First, it allows for local spillovers from neighbour to neighbour that are captured in the parameter \(\theta\). Second, for a given connectivity matrix, the SLX model is linear in parameters. We can express it in the form of a standard linear model (Eq. (1)) by letting \({\mathbf {X}} = \left[ {\mathbf {X}},\, {\mathbf {W}}{\mathbf {X}} \right]\). This means that the SLX model is straightforward to estimate, while yielding insights into spillover effects. For a more in-depth discussion of the SLX model, I refer to LeSage and Pace (2009) and Halleck Vega and Elhorst (2015).

The next notable models are the spatial autoregressive (SAR) model and the spatial error model (SEM). We can express the SAR model in two useful ways, which are

$$\begin{aligned} {\mathbf {y}}&= \lambda {\mathbf {W}} {\mathbf {y}} + {\mathbf {X}} \beta + {\mathbf {e}}, \end{aligned}$$
(3)
$$\begin{aligned} \left( {\mathbf {I}} - \lambda {\mathbf {W}} \right) {\mathbf {y}}&= {\mathbf {X}} \beta + {\mathbf {e}}, \end{aligned}$$
(4)

and the SEM in the following way

$$\begin{aligned} {\mathbf {y}}&= {\mathbf {X}} \beta + {\mathbf {u}}, \qquad {\mathbf {u}} = \lambda {\mathbf {W}} {\mathbf {u}} + {\mathbf {e}}, \end{aligned}$$
(5)
$$\begin{aligned} \left( {\mathbf {I}} - \lambda {\mathbf {W}} \right) {\mathbf {y}}&= \left( {\mathbf {I}} - \lambda {\mathbf {W}} \right) {\mathbf {X}} \beta + {\mathbf {e}}. \end{aligned}$$
(6)

Both the SEM and SAR model induce non-linear spatial filters \(\left( {\mathbf {I}} - \lambda {\mathbf {W}} \right)\), in which the strength of connectivity is captured by a single spatial parameter, \(\lambda\). Non-linearity can be a challenge for estimation; suitable modes of estimation are the generalised method of moments (Kelejian and Prucha 1998, 1999), maximum likelihood (Lee 2004), and Bayesian methods (LeSage and Pace 2009). The SAR model’s autoregressive parameter captures global spillovers, which occur across all units. This offers additional insights, but complicates interpretation (as discussed below). The SEM allows for spatial dependence of shocks; this is usually interpreted as spatial heterogeneity as opposed to a spatial spillover (see e.g. LeSage and Pace 2009).

Three further spatial econometric models combine two of the three spatially lagged terms. They are the spatial Durbin model (SDM), spatial Durbin error model (SDEM), and the spatial autoregressive combined (SAC) model. Arguably the most notable one is the SDM—a combination of the spatial terms in the SLX and SAR models. Here I will just note that the SDM has convenient properties that lead LeSage and Pace (2009) to argue for considering it as the ‘default’ spatial model. See Elhorst (2010) for a discussion of this notion and Halleck Vega and Elhorst (2015) for a different argument, favouring the SLX model. Note that this discussion is shaped by the previously dominant role of the SAR model, which is sometimes referred to as the ‘spatial lag model’.

2.1 Interpretation of spatial econometric models

Partial effects of an explanatory variable k are directly captured by the respective coefficient \(\beta _k\) in the standard linear model. This is not generally the case for models with spatial lags of the dependent and explanatory variables, such as the SLX and SAR models. In these models, partial effects are (1) matrices, and (2) depend on the spatial parameters and connectivity matrix.

Partial effects of a variable k in a SAR involve the spatial filter and are given by

$$\begin{aligned} \frac{\partial {\mathbf {y}}}{\partial {\mathbf {x}}_k} = \left( {\mathbf {I}} - \lambda {\mathbf {W}} \right) ^{-1} {\mathbf {I}} \beta _k = {\mathbf {S}}^{-1} {\mathbf {I}} \beta _k, \end{aligned}$$

whereas in an SLX model, they are given by

$$\begin{aligned} \frac{\partial {\mathbf {y}}}{\partial {\mathbf {x}}_k} = {\mathbf {I}} \beta _k + {\mathbf {W}} \theta _k = \begin{pmatrix} \beta _k &{} w_{1 2} \theta _j &{} \dots &{} w_{1 N} \theta _k \\ w_{2 1} \theta _k &{} \beta _k &{} \dots &{} w_{2 N} \theta _k \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ w_{N 1} \theta _k &{} w_{N 2} \theta _k &{} \dots &{} \beta _k \end{pmatrix}. \end{aligned}$$

The resulting matrices can be interpreted using the summary measures of average total, direct, and indirect effects (LeSage and Pace 2009). These are given by

$$\begin{aligned} \mathrm {E}_k^{tot}&= N^{-1} \sum _i \sum _j \left[ \frac{\partial {\mathbf {y}}}{\partial {\mathbf {x}}_k}\right] _{i j}, \\ \mathrm {E}_k^{dir}&= N^{-1} \sum _i \left[ \frac{\partial {\mathbf {y}}}{\partial {\mathbf {x}}_k}\right] _{i i}, \\ \mathrm {E}_k^{ind}&= \mathrm {E}_k^{tot} - \mathrm {E}_k^{dir}. \end{aligned}$$

For the SAR model, these summary measures involve the inverse of \({\mathbf {S}}\), which can be prohibitive to compute. Average direct effects in the SLX model, i.e. the average trace of the partial effect matrix, are directly given by \(\beta _k\). In certain situations, e.g. when using a row-stochastic connectivity matrix \({\mathbf {W}}\), the average indirect effect is also given by \(\theta _k\). However, this is not the case in general. The average indirect effect in the SLX model is given by \(s_{\theta } \theta _k\), where \(s_{\theta }\) is a scaling factor given by the sum of all elements of \({\mathbf {W}}\).

3 Bayesian inference

In Bayesian statistics, probability expresses a degree of belief. Bayes’ theorem is used as a tool to update probabilities in the light of new information and is given by

$$\begin{aligned} p(\theta | {\mathcal {D}}) \propto p({\mathcal {D}} | \theta ) p(\theta ), \end{aligned}$$

where \(\theta\) denotes a set of unknown quantities, \({\mathcal {D}}\) denotes observed quantities, p is the probability density of a probability distribution, and \(\propto\) reads as ‘is proportional to’. The posterior probability, \(p(\theta | {\mathcal {D}})\), is obtained by updating the prior information, \(p(\theta )\), with the likelihood, \(p({\mathcal {D}} | \theta )\). The notion of sequentially updating information—e.g. in a hierarchical setup—is central to Bayesian inference and provides a rich and intuitive modelling framework.

3.1 Closed form inference and the linear model

A Bayesian model is built on distribution assumptions. Consider, for example, the standard linear regression model in Eq. (1). Assuming that errors are distributed normally, with mean zero and constant variance \(\sigma ^2\), the model can be expressed as

$$\begin{aligned} {\mathbf {y}} \sim {\mathcal {N}}_N ({\mathbf {X}} \beta , \sigma ^2 {\mathbf {I}}_N). \end{aligned}$$
(7)

where \({\mathcal {N}}_q(\mu , \mathbf {\Sigma })\) denotes the density of a q-dimensional Normal distribution with mean \(\mu\) and covariance \(\mathbf {\Sigma }\), \({\mathbf {I}}_N\) is the identity matrix, the unknown quantities are \(\theta = \left( \beta , \sigma ^2 \right)\), and the observed ones are \({\mathcal {D}} = \left( {\mathbf {y}}, {\mathbf {X}} \right)\). The likelihood of this model is given by

$$\begin{aligned} p({\mathbf {y}} | \beta , \sigma ^2) = (2 \pi \sigma ^2)^{-N/2} \exp \left\{ -\frac{1}{2\sigma ^2} \left( {\mathbf {y}} - {\mathbf {X}}\beta \right) ' \left( {\mathbf {y}} - {\mathbf {X}}\beta \right) \right\} . \end{aligned}$$
(8)

Next, prior distributions have to be specified for the unobserved quantities, e.g.

$$\begin{aligned} p(\beta | \sigma ^2)&\sim {\mathcal {N}}_K (b_0, \sigma ^2 {\mathbf {B}}_0), \nonumber \\ p(\sigma ^2)&\sim {\mathcal {G}}^{-1} (c_0, C_0), \end{aligned}$$
(9)

where \({\mathcal {G}}^{-1}(a, b)\) refers to the density of an inverted Gamma distribution with shape a and scale b. This prior is often termed the conjugate Normal Inverse-Gamma prior.

Bayesian inference proceeds by combining the priors and likelihood into the joint posterior, \(p(\beta , \sigma ^2 | {\mathcal {D}})\). In this setup, the prior is conjugate, meaning that the posterior is from the same family of probability distributions as the prior. As a result, there is a closed form expression for the joint posterior and knowledge of its probability distribution can be used for inference or to draw samples directly from the posterior.

Conjugate priors and the closed form inference they allow play an important role in Bayesian modelling. However, they are limited to special settings and cannot be used for arbitrarily flexible models implied by Bayes’ theorem. In general, posteriors are not of a well-known form and another approach is needed for inference.

3.2 Posterior simulation and the SLX(\(\delta\)) model

Consider adapting the conjugate Normal Inverse-Gamma prior in Eq. (9) so that the prior covariance of \(\beta\) is independent of \(\sigma\). The resulting prior is often termed the independent Normal Inverse-Gamma prior and is given by

$$\begin{aligned} p(\beta )&\sim {\mathcal {N}}_K (b_0, {\mathbf {B}}_0),\nonumber \\ p(\sigma ^2)&\sim {\mathcal {G}}^{-1} (c_0, C_0). \end{aligned}$$
(10)

This small adaptation means that the joint posterior is no longer available in closed form. Samples from the posterior distribution can still be obtained using Markov chain Monte Carlo methods. Closed forms of the conditional posteriors \(p(\beta | {\mathcal {D}}, \sigma ^2)\) and \(p(\sigma ^2 | {\mathcal {D}}, \beta )\) are available, and Gibbs sampling can be used to obtain dependent draws from the posterior. These dependent samples are subject to some autocorrelation, meaning that their effective size, or utility for inferential purposes, will be lower than that of an independent sample of equal size. A Gibbs sampling algorithm for this case works as follows.

  1. 0.

    Let i be zero and set an initial value \(\beta _{(i)}\).

  2. 1.

    Draw \(\sigma ^2_{(i + 1)}\) from the conditional posterior \(p(\sigma ^2 | {\mathcal {D}}, \beta _{(i)})\).

  3. 2.

    Draw \(\beta _{(i + 1)}\) from the conditional posterior \(p(\beta | {\mathcal {D}}, \sigma _{(i + 1)})\).

  4. 3.

    Increment i and go to Step 1 if more samples are desired.

Algorithm 1: Gibbs sampler for the independent Normal Inverse-Gamma prior

Assuming that \(\beta _{(0)}\) from Step 0 is a valid draw from \(p(\beta | {\mathcal {D}})\) it is easy to show that subsequent draws are valid draws from the posterior. For an arbitrary initial value \(\beta _{(0)}\), the Gibbs sampler converges to this state under mild conditions. The stationary distribution of the Gibbs sampler will mirror the joint posterior and draws from it can be used for inference. A number of the initial samples is often discarded as burn-in, to ensure that draws are from the stationary distribution and to limit the influence of the initial value.

For many models of interest, the conditional posteriors are not conveniently available. Consider the SLX model with parameterised connectivity (Halleck Vega and Elhorst 2015), which we will term the SLX(\(\delta\)) model. Departing from the Bayesian linear model in Eq. (7), we can express this model as

$$\begin{aligned} {\mathbf {y}} \sim {\mathcal {N}}_N ({\mathbf {Z}}(\delta ) \beta , \sigma ^2 {\mathbf {I}}_N), \end{aligned}$$
(11)

where \({\mathbf {Z}}(\delta ) = \left[ {\mathbf {X}},\, \mathbf {\Psi }(\delta ){\mathbf {X}}\right]\) and \(\mathbf {\Psi }(\delta )\) is a connectivity function with parameter \(\delta\) and some form of scaling. For fixed values \({\bar{\delta }}\), this model collapses to the standard SLX model (see Eq. (2)) with connectivity matrix \({\mathbf {W}} = \mathbf {\Psi }({\bar{\delta }})\). An example for \(\mathbf {\Psi }\) is a inverse-distance decay function with \(\psi _{ij} = d_{ij}^{-\delta }\) for \(i \ne j\) and 0 otherwise, where \(d_{ij}\) is some distance between observations i and j. In this case, a suitable prior for \(\delta\) is

$$\begin{aligned} p(\delta ) \sim \mathrm {G}^{-1} (d_0, D_0). \end{aligned}$$
(12)

In order to obtain samples from the joint posterior \(p(\delta , \beta , \sigma ^2 | {\mathcal {D}})\), the sampling approach from above needs to be adapted. First, note that conditional on knowing \(\delta\) the Gibbs sampling steps for \(\beta\) and \(\sigma ^2\) can be left unchanged—the hurdle is to obtain draws of \(\delta\). Gibbs sampling is not an option, since the conditional posterior of \(\delta\) has no well-known form. It is given by

$$\begin{aligned} p(\delta | {\mathcal {D}}, \beta , \sigma ^2) \propto \exp \left\{ -\frac{1}{2\sigma ^2} \left( {\mathbf {y}} - {\mathbf {Z}}(\delta )\beta \right) ' \left( {\mathbf {y}} - {\mathbf {Z}}(\delta )\beta \right) \right\} p(\delta ). \end{aligned}$$

An alternative is the Metropolis–Hastings (MH) algorithm, another MCMC method that does not rely on a well-known conditional posterior distribution. In an MH algorithm, a draw is proposed from an arbitrary proposal density \({\mathcal {Q}}(\delta ^\star | \delta _{(i)})\) and is accepted with a certain probability \(\mathcal {\alpha } (\delta ^\star | \delta _{(i)})\), such that draws will reflect the joint posterior. Note that the Gibbs sampler outlined above is a special case of the MH algorithm, where draws are proposed in such a way that they are always accepted. An MH sampler for \(\delta\) works as follows.

  1. 0.

    Let i be zero and set an initial draw \(\delta _{(i)}\).

  2. 1.

    Propose a value \(\delta ^\star\) using \({\mathcal {Q}}(\delta ^\star | \delta _{(i)})\).

  3. 2.

    Set \(\delta _{(i + 1)}\) to \(\delta ^\star\) with probability \({\alpha } (\delta ^\star | \delta _{(i)})\) and to \(\delta _{(i)}\) otherwise.

  4. 3.

    Increment i and go to Step 1 if more samples are desired.

Algorithm 2: Metropolis–Hastings sampler for a connectivity parameter \(\delta\)

The acceptance probability in Step 2 is composed of the ratio of the conditional posterior and the proposal density evaluated at the two values. It is given by

$$\begin{aligned} \mathcal {\alpha } (\delta ^\star | \delta _{(i)}) = \min \left( 1, \frac{p(\delta ^\star | {\mathcal {D}}, \beta , \sigma ^2)}{p(\delta _{(i)} | {\mathcal {D}}, \beta , \sigma ^2)} \frac{{\mathcal {Q}}(\delta _{(i)} | \delta ^\star )}{{\mathcal {Q}}(\delta ^\star | \delta _{(i)})} \right) . \end{aligned}$$

Assume for now that the second ratio cancels out, as is the case for a symmetric proposal density. Then the first ratio, i.e. the conditional posterior evaluated at \(\delta ^\star\) and at \(\delta _{(i)}\), implies that the sampler will always move to a more probable value \(\delta ^\star\) than the current \(\delta _{(i)}\); for a less probable value it will move with a certain probability. Samples obtained this way will converge to a stationary distribution mirroring the joint posterior under mild conditions. As in the case of the Gibbs sampler, it is common to discard a number of initial samples.

3.3 Sampling considerations and other spatial models

Bayesian estimation of models with non-linear spatial filters, such as the SAR model, works similarly to the approach described above. A common prior for the parameter \(\lambda\), which is constrained to the support \((-1, 1)\), is a Beta density

$$\begin{aligned} p(\frac{\lambda + 1}{2}) \sim \mathrm {Beta}(d_0, D_0). \end{aligned}$$

In such a model, sampling of the parameters \(\beta\) and \(\sigma ^2\) works as described above. An MH step can be used to sample \(\lambda\) from its conditional posterior. The conditional posterior of a SAR model, letting \({\mathbf {S}}(\lambda ) = \left( {\mathbf {I}} - \lambda {\mathbf {W}}\right)\), is given by

$$\begin{aligned} p(\lambda | {\mathcal {D}}, \beta , \sigma ^2) \propto |{\mathbf {S}}(\lambda )| \exp \left\{ -\frac{1}{2\sigma ^2} \left( {\mathbf {S}}(\lambda ){\mathbf {y}} - {\mathbf {X}}\beta \right) ' \left( {\mathbf {S}}(\lambda ){\mathbf {y}} - {\mathbf {X}}\beta \right) \right\} p(\lambda ). \end{aligned}$$

This likelihood can be prohibitive to compute (mainly due to the \(N \times N\) determinant) which is a central issue for Bayesian (and maximum likelihood) estimation (Bivand et al. 2013). The determinant generally has a computational complexity of \({\mathcal {O}}(n^3)\) that can be prohibitive for sampling. However, it can be computed efficiently using the eigenvalues of \({\mathbf {W}}\), or other approaches that include sparse matrix factorisations, and approximate methods (see e.g. Barry and Pace 1999; Pace et al. 2004; Smirnov and Anselin 2009). Efficient sampling is crucial and can be attained, e.g., by concentrating the likelihood with respect to \(\lambda\) and employing suitable simulation methods.

Bayesian simulation methods look to generate a usable number of effective draws from the posterior in as little time as possible. This requires balancing the computational cost of obtaining draws and the effective size of these draws. The Gibbs and MH algorithms are computationally cheap ways of producing dependent draws. Their efficiency in producing effective draws relies on rapid mixing, i.e. movement through the support of the posterior. A sampler that mixes slowly has high autocorrelation, hence producing a low number of effective draws. An important concept in simulation methods is the posterior exploration, i.e. how well a sampler explores the full posterior density. Exploration is a prerequisite for proper sampling, since any number of draws is irrelevant if they do not reflect the posterior.

An important advantage of the MH algorithm over Gibbs sampling is that it can be tuned. The MH sampling steps can be improved by adjusting the proposal density to achieve suitable acceptance rates. If the acceptance rate is too high or too low, the sampler will suffer from poor posterior exploration and converge to the stationary distribution very slowly. This can be diagnosed by a high degree of autocorrelation in the posterior samples. MH samplers are usually tuned to acceptance rates between 0.2 and 0.5 (see e.g. Bédard 2008; Sherlock and Roberts 2009, for research on optimal acceptance rates). There exist many more MCMC methods that can be very effective, e.g. ones based on Hamiltonian Monte Carlo. See Wolf et al. (2018) for a review of sampling methods in a spatial econometric context. Markov chain Monte Carlo simulation methods and increasing computational power play an important role in the prominence of Bayesian inference. They enable inference in countless settings that are analytically intractable and can readily be combined for modern flexible Bayesian modelling.

4 Software

The flexibility of Bayesian modelling can be a challenge for clean and efficient software implementations. On the one hand, general-purpose software may be inefficient in certain important cases. Adapting to corner cases leads to convoluted routines that introduce inefficiencies and are hard to maintain properly. On the other hand, more focused implementations are of limited use; necessary adaptations may be error-prone, inefficient to create, and hard to maintain.

In the following, I present a software architecture that bridges this gap. This software architecture builds on an object-oriented system with a layered interface. I provide an implementation in the bsreg (Kuschnig 2021) R (R Core Team 2021) package. The package follows this layered design, with an object-oriented base system that is accessed via an internal programming interface, and an idiomatic R user interface. This structure makes bsreg efficient and maintainable code-wise, flexible and extensible for advanced modelling, and accessible to users.

In this section, I introduce the software architecture and design philosophy behind it. Then, I discuss the technical implementation of bsreg to illustrate the software architecture. Finally, I demonstrate the use of the package. In the next section, I provide an application to real-world data, where I reproduce and extend an analysis of cigarette demand.

4.1 Design philosophy

MCMC methods of interest to us can be understood as a machine with a state—the parameters—and a set of rules to update this state—the sampling steps. To instantiate this machine some inputs are necessary—namely the data and prior settings, i.e. immutable parts of the model, and an initial state. See Fig. 1 for a visualisation of the concept. We can use this idealised sampler to construct a prototype in an object-oriented framework. This framework allows us to extend the prototype with additional functionality where needed.

Fig. 1
figure 1

An idealised Markov chain Monte Carlo sampler

The linear model in Eq. (7) with an independent Normal Inverse-Gamma prior (Eq. (10)) is a sensible starting point for conceptualising this approach. First, we require a method to instantiate our object. The necessary inputs are the priors—\(p(\beta )\) and \(p(\sigma ^2)\)—the data—\({\mathbf {y}}\) and \({\mathbf {X}}\)—and an initial state. Second, we need a method for updating the state. For our model setup, we can employ the Gibbs sampling algorithm described in Sect. 3. Third, we need to be able to access the current state.

Departing from this prototype, we can formulate a number of relevant setups. A simple example is an adaptation to the conjugate variant of the prior (Eq. (9)). The conjugacy implies that we can draw independent samples directly from the posterior. This means that we no longer require an initial state and that the update step can be simplified. Another simple example is the SLX model. It can be accommodated implicitly, where regressors already include their spatial lag, or explicitly, via an extra input argument for the connectivity matrix \({\mathbf {W}}\) that is used to lag regressors during instantiation.

The prototype can also be extended in more elaborate ways. For example, to the parameterised SLX(\(\delta\)) model, the SAR model, or to the SEM. For these models, the pattern of extension essentially remains the same. First, they require additional inputs for their respective parameters and connectivities. Previous sampling steps can be left unchanged, conditional on the new parameters (c.f. the alternative SAR and SEM formulations in Eqs. (4) and (6)). The new parameters can be updated using the MH algorithm, which itself could be expressed as an instance of an object-oriented MH class. These building blocks already allow us to express all common spatial models, with compound models like the SDM expressed via layered inheritance.

4.2 Technical implementation

The implementation of bsreg follows the object-oriented structure outlined above. For the object-oriented system, I rely on the third-party R6 package (Chang 2021) over more idiomatic, native systems.Footnote 1 This type of object-oriented system is somewhat alien to R, but the layered interface allows the package to remain idiomatic for practical purposes. The top-level user interface builds on base formats and the standard S3 system. This allows for customary usage patterns that are linked to the R6 system via an intermediate programming interface.

The implementation starts with an abstract ‘Base’ class that provides a skeleton for successors. Instantiating this class is done in a layered way. First, the ‘initialize()’ method sets up meta information and evaluates related inheritor methods that provide priors and other settings. Next, the ‘setup()’ method takes in observed data and evaluates related inheritor methods. These may take in observed quantities, additional settings, and set starting values. This layered structure, allows for the correct ordering of steps required for full instantiation. Then, update steps are summarised in a ‘sample()’ method that calls individual sampling steps and updates meta information using a ‘finalize()’ method. This base class also provides fields for accessing and storing the data, cache, meta information, and potential Metropolis–Hastings samplers.

The first functional inheritor is the ‘NormalGamma’ class, which implements the linear model with independent Normal Inverse-Gamma prior. Its ‘initialize()’ method has an argument for the prior mean and precision of \(\beta\), as well as the shape and rate of \(\sigma ^2\). In its ‘setup()’ method, prior parameters are adapted to fit the model dimensions and the known posterior shape is computed. Starting values can be supplied as additional arguments; otherwise, they are determined using least squares. The sampling steps for \(\beta\) and \(\sigma ^2\) follow the Gibbs sampling steps outlined in Sect. 3. The class also contains methods to access the parameters, residuals, and settings in their respective fields. The ‘NormalGamma’ class serves as prototype for all other classes. Among them, the ‘ConjugateNormalGamma’ class is a straightforward inheritor. Since its posterior parameters are known, the sampling steps can be adapted accordingly. Independent draws from the posteriors of \(\beta\) and \(\sigma ^2\) can then be obtained directly from their posteriors.

A spatial inheritor is the ‘SpatialLX’ class. It accommodates both the standard SLX and the SLX(\(\delta\)) model, depending on the connectivity and settings that are provided when instantiated. For the standard variant, the ‘setup()’ method requires only a connectivity matrix (or function and parameter) and optionally an indicator for regressors to lag spatially. Regressors and other cached values are modified accordingly, and the sampling proceeds as before. The more flexible SLX(\(\delta\)) variant, which considers uncertainty around the connectivity, makes some additional adaptations. First, settings and priors for the free \(\delta\) parameter must be provided to the ‘initialize()’ method. At this stage, functions for updating the parameter and cached values that depend on it, and an object for the MH sampling step are created internally. In the ‘setup()’ method, where observed quantities are provided, these are set up. The sampling step is extended with an MH step for the parameter \(\delta\). By using an object-oriented system for MH samplers, this can be achieved with very little code.

The next spatial inheritors accommodate the SEM and SAR model. The respective classes are ‘SpatialEM’ and ‘SpatialAR’. Both have ‘initialize()’ methods for setting priors and options and to create internal functions for updating parameters and dependent quantities, as well as the MH sampler. The ‘setup()’ methods are used to provide connectivities and set up the filtered quantities and MH sampler. Note that updating the spatial parameter \(\lambda\) affects other steps via the filtered dependent and/or explanatory variables. In my implementation, this works by redirecting access methods to the filtered quantities. Computation of the (log) determinant in the marginal likelihood of \(\lambda\) is facilitated by (a) a spectral decomposition of the connectivity, or (b) a grid-based approximation using LU decompositions. Additional options for panel settings, where connectivities are repeated in a block-diagonal structure, are available. This aspect of the sampler can be extended in several ways (see Bivand et al. 2013).

These are the core blocks that form a suitable foundation for estimating standard spatial econometric models. The linear classes and spatial inheritors can be used in a straightforward way. The composite SDM or SDEM can be created by adjusting the parent class of ‘SpatialAR’ or ‘SpatialEM’ to ‘SpatialLX’. At the moment, layered adjustment to filtered quantities, as used for the SAC model, is not supported. To simplify the link between this object-oriented core to the idiomatic and the accessible user interface, I use an intermediate programming interface. This interface provides constructor functions that allow dynamic creation of classes, allowing for composite models or different base classes. Additional functions help with the instantiation, make robustness checks, and provide sensible default values. Finally, helper functions for obtaining and working with a number of samples from the posterior facilitate Bayesian inference.

4.3 Usage and the interface

Useful software relies on good accessibility, i.e. a good user interface that simplifies interaction with the underlying code. The bsreg package achieves this with an R idiomatic user-interface on top of the programming interface and object-oriented structure. This top-level interface focuses on facilitating posterior inference and uses the established formula system and basic S3 methods. The outputs come in a standard format that allows users to stick with familiar procedures for analysis. There are dedicated functions in the style of ‘lm()’ that wrap spatial models of interest, which perform the instantiation, setup, and sampling in one swoop. In this way, bsreg enables near effortless Bayesian inference for spatial econometric models.

The interface builds on the ‘bm()’ function for Bayesian models and uses wrappers for specific specifications, like the classical linear, SLX, SLX(\(\delta\)), and SAR models, as well as the SEM, SDM, SDEM. These use the customary R formula interface, similarly to ‘lm()’ or the spatialreg package.

A linear model with default settings can be estimated as follows.

figure a

The first argument is a formula with a symbolic description of the model. The variables are obtained from the environment or are provided via the ‘data’ argument; in this case, we use the packaged ‘cigarettes’ dataset by Baltagi and Li (2004) (see the next section for more information). Other arguments include the number of posterior draws to save (‘n_save’) and discard (‘n_burn’), and ‘verbose’ to control console output. The argument ‘options’ and related helper functions—most importantly ‘set_options()’—can be used to adjust further settings, including model priors.

The default values of ‘set_options()’ amount to relatively uninformative priors and generally applicable settings. However, priors and other settings are best when provided specifically for a problem. Below, we adjust the prior settings of the linear model above to use a conjugate prior with more informative parameters. For this, we use the prior-specific argument ‘NG’ and helper function ‘set_NG()’.

figure b

Spatial models additionally rely on the ‘W’ argument for the connectivity, and ones with spatially lagged explanatories have an optional argument to specify variables to lag. With a suitable connectivity matrix \({\mathbf {W}}\), e.g. based on the distance between observations, we can estimate spatial models, like the SDM. See the appendix for how to construct the connectivity matrix below from the packaged ‘us_states’ dataset. Below, we estimate an SDM, which we adjust to use a Uniform prior (equivalent to the \(\mathrm {Beta}(1, 1)\) prior) on the spatial autoregressive parameter \(\lambda\) using the argument ‘SAR’ and helper function ‘set_SAR()’. In addition, we increase the length of the burn-in and the number of saved posterior draws.

figure c

From this call, we get an S3 object that supports standard ‘print()’, ‘plot()’, and ‘summary()’ methods. These can be used for quick analysis, but more importantly—the object can be used directly. It essentially consists of two elements—a matrix with saved posterior draws and the underlying R6 model object that was used to obtain these draws. We can use the matrix in custom or third-party procedures; interfacing to packages like coda (Plummer et al. 2006) is straightforward. The model object preserves the state of the sampler, making it easy to obtain additional posterior draws. This facilitates interactive work and is done by simply using our object in a new call to the estimation function or the generic ‘bm()’ function.

After estimation, we would typically check convergence of the sampler. For this, we can call ‘plot(x)’, which offers basic trace plots. Alternatively, the coda interface provides advanced diagnostic statistics and plots and is accessible via an ‘as.mcmc(x)’ method. If we are satisfied with our samples, we can turn to analysis. By calling ‘print(x)’ we get posterior parameter means, similarly to ‘lm()’. More detailed summary measures, such as various types of credible intervals, can be computed directly (using the matrix slot ‘x[[1]]’), or e.g. with coda (‘coda::HPDinterval()’). Posteriors can also be visualised using base functionality, or the output can be adapted to specialised packages like ggplot2 (Wickham 2016) and ggdist (Kay 2021). If, instead, we find that the number of draws is not sufficient (e.g. via ‘coda::effectiveSize()’), we can get additional samples by simply calling ‘bm(x)’.

As can be seen, the bsreg features an easy-to-use interface for applied use. It builds on an extensible and flexible object-oriented system and programming interface. The package, more extensive information on its use, as well as up-to-date documentation is available from the central R archive network (CRAN) at https://cran.r-project.org/package=bsreg or from the repository at https://github.com/nk027/bsreg.

5 Cigarette demand model

In this section, I use bsreg to estimate various specifications of the demand model for cigarettes in the continental United States (US) by Baltagi and Li (2004). With this application, I follow Halleck Vega and Elhorst (2015) and focus on specifics of the Bayesian approach. The cigarette demand model is

$$\begin{aligned} \ln C_{it} = \alpha + \beta _{price} \ln P_{it} + \beta _{inc} \ln I_{it} + \mu _i + \phi _t + \varepsilon _{it}, \end{aligned}$$
(13)

where subscripts are used to indicate a subset of US states and Washington, D.C. (\(i = 1, \dots , 46\)) and time periods (\(t = 1, \dots , 30\)). The dependent C is real per capita cigarette sales, explanatories are the average price of cigarettes (P) and the real per capita disposable income (I). The parameters \(\mu _i\) and \(\phi _t\) are region- and time-fixed effects; errors \(\varepsilon _{it}\) are assumed to follow a multivariate Normal distribution with mean zero and spherical variance-covariance \({\mathbf {I}} \sigma ^2\).

I consider the linear model (LM) in Eq. (13) and the five spatial econometric models currently supported by bsreg. For the spatial lags, I follow Halleck Vega and Elhorst (2015), who use connectivity matrices based on (1) binary contiguity (row-stochastic), and (2) inverse-distance decay (scaled with the maximum eigenvalue). Connectivities are the same for every year. I estimate these models using bsreg as a demonstration of the package and Bayesian methods; for more information on the data, actual application, and additional background see Baltagi and Li (2004), Kelejian and Piras (2014), and Halleck Vega and Elhorst (2015).

Estimation results for the linear model and spatial models using contiguity-based connectivity are presented in Table 1. The posterior point estimates are close to the frequentist ones obtained by Halleck Vega and Elhorst (2015), as can be seen in Fig. 2. Spatial dependence plays an important role—spatial lags are significant in all cases. Estimates of \(\beta _{price}\) and \(\beta _{inc}\) are relatively stable throughout, but omit important details related to spillover effects. In order to adequately compare and interpret the models, we need to consider partial effects of the variables, or summary statistics thereof. As discussed in Sect. 2, the partial effects of models with spatial lag are (a) not directly given by the coefficients, and (b) carry additional locational information (see LeSage and Pace 2009). It is helpful to work with (average) total effects, which can be further divided into direct (within a location itself) and indirect (impacting neighbouring locations) effects. The average total effects of price and income are visualised in Fig. 2. In the plot, we can clearly discern the impact of omitting spatial spillover effects.

Table 1 Bayesian estimates of the cigarette demand model in linear and spatial specifications

Following the Bayesian paradigm, summary statistics of parameters are easy to compute, and uncertainty measurements are straightforward to obtain. To calculate, e.g., the average total effects, we simply compute this effect for every draw from the posterior, or a subset thereof. We obtain a posterior density of the effect in question that reflects all available information on it and uncertainty around it. We can compute the mean, standard error, or credible intervals of this density. The workings of Bayesian uncertainty can be seen in Fig. 2. The linear model (orange, left) entails the assumption of no spatial dependence, omitting all uncertainty that concerns spatial parameters and yielding a narrow posterior. The SLX model (yellow, centre-right) introduces local spillover effects (with given connectivity), which are reflected in the location and scale of the posterior. The SDM (blue, right) additionally considers global spillover effects that act as a filter for the dependent. For it, and the related SAR model, we see clear benefits of obtaining the full posterior—total effects are non-Gaussian (also see the appendix for quantile–quantile plots). Uncertainty around these non-linear models is characterised by heavy tails—they are not adequately summarised by their means and standard errors.

Fig. 2
figure 2

Posterior densities of the average total effects of price and income by model. Bayesian posteriors are indicated by a boxplot and dotplot. Frequentist point estimates are indicated by a cross on the boxplot

One layer of uncertainty that we have ignored so far concerns the structure of spatial connectivity. By fixing \({\mathbf {W}}\) to a contiguity-based structure, we impose the type of connectivity and implicitly assume perfect knowledge of it. In practice, researchers often check the robustness of results against different connectivity structures. However, this is usually done in an ad-hoc manner and—most of the time—only few and related types of connectivity are considered (Plümper and Neumayer 2010; Neumayer and Plümper 2016), which are known to have limited impact on results (see LeSage and Pace 2014). More formal approaches to uncertainty around spatial connectivity structure are Bayesian model averaging (see e.g. Debarsy and LeSage 2020; Krisztin and Piribauer 2021) or the parameterisation of connectivity, e.g. in the SLX(\(\delta\)) model in Eq. (11). This model treats the connectivity structure as a function with parameters to estimate. One useful functional form is that of inverse-distance decay, with one parameter \(\delta\) controlling the speed of decay (see Sect. 3). With this in mind, I extend estimation results of the SLX model in Table 1 to four variants with different inverse-distance decay connectivities in Table 2.

Table 2 Bayesian estimates of the cigarette demand model with SLX models with varying connectivity specifications. The reported values are the posterior means, \({\hat{\mu }}\), and credible intervals, CI (in square brackets), covering 99% of the posterior

The inverse-distance decay connectivity of the SLX model in Table 2 mirrors the results of Halleck Vega and Elhorst (2015) and paints a different picture than the specification using contiguity-based connectivity. Differences in local spillovers are pronounced—indirect effects of the price variable experience a sign-switch, while indirect effects from the income variable increase considerably in size. As can be seen in Fig. 3, these differences have a large impact on the total effects of these variables. Total effects of prices are lower, with a lower direct effect that is additionally offset by positive local spillovers. With inverse-distance based connectivity, the total effects of income are larger than before. To the keen eyed reader, and considering the findings of Halleck Vega and Elhorst (2015), this may seem paradoxical—estimates of \(\beta _{inc}\) are comparable while \(\theta _{inc}\) is considerably more negative.

However, the large values of \(\theta\) in the inverse-distance decay specification can be misleading due to the scaling applied. Partial derivatives of the SLX model are not always as trivial as they may seem. The average direct effect of an explanatory variable k is given by \(\beta _k\). The average indirect effect of a unit i on another unit j, however, is generally more complicated to obtain. It is given by \(N^{-1} \sum _i \sum _j w_{i j} \theta _k\). For a row-stochastic connectivity matrix \({\mathbf {W}}\), as used for Table 1, this value is equivalent to \(\theta _k\). For connectivity that is scaled in some other way, e.g. using the maximum eigenvalue, this is not the case and the average indirect effects need to be computed explicitly. In Table 2, I additionally report \(s_\theta\), the appropriate scaling factors for obtaining average indirect effects from the reported estimates of \(\theta\). For the SLX(3) specification, estimates are inflated by a factor of seven. Scaled estimates that can be compared to the contiguity specification are \(\theta _{price} = 0.036\) and \(\theta _{inc} = -0.115\), suggesting less pronounced differences between the specifications.

Fig. 3
figure 3

Posterior densities of the average total effects of price and income by connectivity specification. Note that due to scaling, effects do not differ as starkly as implied by the parameters in Table 2

With Table 2 we have five different SLX specifications with different implications—which one to use? A standard approach would be model selection based on information criteria, like the Schwarz information criterion (SIC). For the four classical SLX models we arrive at SIC values of − 2758 (contiguity-based), − 3004 (\(\delta = 2\)), − 3047 (\(\delta = 3\)), and − 3037 (\(\delta = 4\)). These values heavily favour the inverse-distance decay specifications, and among them the SLX(3) specification. In fact, Bayesian model averaging using the SIC approximation for the marginal likelihood would place more than 99% of the weight on the SLX(3) variant and an essentially zero weight (less than \(1\mathrm {e}{-60}\)) on the contiguity variant. However, the exact parameterisation of the three inverse-distance decay specifications compared is essentially arbitrary. The SLX(\(\delta\)) allows us to accurately reflect uncertainty around this parameterisation. For this flexibility, we require a prior and need to closely monitor samples of the parameter for convergence and mixing behaviour.

In the SLX(\(\delta\)) model, I use an inverse-gamma prior with \(d_0 = D_0 = 2\) (see Eq. (12)), which has support \((0, \infty )\), a mean of 2, and places 99% of the prior weight on the interval (0.27, 19.22). See the appendix for trace and density plots indicating good mixing and proper convergence of the sampler. The posterior indicates that connectivity decays rather quickly at \(\delta = 3\). This means that very close neighbours hold relatively much weight (see the appendix for a visualisation of connectivity strengths). Compared to the contiguity specification, most of the connectivity is between few states, while connectivity at medium and long distances remains relevant. This implies that the SLX(2) specification picks up effects at larger distances, which may not be relevant to the issue at hand. Meanwhile, the SLX(4) specification is focused on extremely close neighbours, potentially omitting the effects of mid-range neighbours. Regarding results, we find that the SLX(2) model overestimates average indirect effects and the SLX(4) underestimates them when compared to the SLX(3) model of choice.

Focusing on the preferred fixed SLX(3) and flexible SLX(\(\delta\)) specifications, we find the aforementioned uncertainty around the connectivity specification at work in Table 2. The former pinpoints the maximum likelihood value of \(\delta\), while the latter considers all possible values of \(\delta\), reflecting the associated uncertainty in all posterior densities. The 99% credible interval of \(\delta\) covers (2.54, 3.74); notably, the previously considered alternatives of \(\delta \in \{2, 4\}\) fall well out of this range. This uncertainty surrounding \(\delta\) is reflected in other posterior densities, such as the indirect price effect parameter \(\theta _{price}\). With the (unscaled) 99% credible interval covering [0.022, 0.507], the effect could be interpreted as barely significant. Meanwhile, the SLX(3) specification that remains ignorant of uncertainty around \(\delta\) yields a much narrower (unscaled) 99% interval of [0.116, 0.402]. Moreover , posterior densities in Fig. 3 again imply that the posterior effects of the flexible SLX(\(\delta\)) model are non-Gaussian. The Bayesian paradigm offers a suitable tool for dealing with these inherent uncertainties.

6 Concluding remarks

In this paper, I presented a software architecture for Bayesian modelling in the spatial econometric domain. I implemented this object-oriented system with a layered interface for users and programmers in the bsreg package. The result is flexible, easy-to-use software that can readily be extended. For the purpose of demonstration, I used the package to reproduce an analysis of cigarette demand in the United States. I demonstrated the flexibility of Bayesian methods and their suitability for acknowledging and accommodating uncertainty in non-linear spatial econometric models. I documented continued issues with the interpretation of spatial models—highlighting the need for tried and tested software.

Departing from this paper, much future work for Bayesian spatial software and Bayesian spatial methods remains to be done. Regarding software, larger-scale efforts are required—to extend bsreg or replace it. Potential features include (1) a more optimised object-oriented core system, (2) extensions to more models and settings, such as large panel settings, (3) more focus on exposing the programming interface and tighter integration into the existing ecosystem, (4) a more comprehensive and well-connected user interface, with more methods for analysis and links to existing packages. Alternatively, better integration of spatial econometric methods in existing probabilistic languages could be pursued. Method-wise, the uncertainties of spatial econometric model need to be addressed—both in terms of more flexible models and in terms of communicating and presenting results. The summary measures of average direct and indirect effects have raised the bar on spatial econometrics—perhaps posterior densities can help find its range.