1 Introduction

Modeling and forecasting criminal activities in urban areas is an important field of study in a range of social science disciplines like, e.g., economics, criminology, and geography. The results of which are increasingly used for public policy purposes. The steadily growing access to high-quality spatio-temporal data sets with a high level of spatial detail allows uncovering interesting relationships within and between small regional units, like, e.g., census tracts. Variables of interest often studied in this context are in the form of counts (Taddy 2010; Aldor-Noiman et al. 2016; Weinborn et al. 2017). In particular, data coherent forecasting of such crime counts has to take the integer and non-negative nature of the data into account. Most commonly, the observed counts are transformed into continuous data and well-established linear spatial panel models are employed (see e.g. Jin et al. 2020, for a recent contribution in this respect). Spatial panel data models that meet the criterion of coherency are relatively sparse, in particular in the econometrics literature.

A prominent and recent contribution in this area is Liesenfeld et al. (2017), who propose a Poisson random effects panel framework with a latent Gaussian spatio-temporal state process for the number of severe crimes. Inference for the model is based on a (high dimensional) likelihood function, which is not available in closed form and thus requires a Monte Carlo integration technique, like the efficient importance sampling (EIS) proposed by Liesenfeld et al. (2016). Their model is applied to monthly data for the 138 census tracts of Pittsburgh, PA, available from January 2008 to December 2013.

Complementary to this work, we propose a static and a dynamic version of a fixed effects spatial panel data model for counts and demonstrate how these models can be used for predictive purposes. The two model variants are fitted using pseudo maximum likelihood (PMLE) and generalized methods of moments (GMM), respectively. In comparing and contrasting our approach to that of Liesenfeld et al. (2017), we illustrate our approaches using the same data set on severe crime counts in Pittsburgh, PA.

Our modeling framework extends the standard Poisson fixed effects panel count data model, which is the workhorse model for longitudinal counts (see, for example, Cameron and Trivedi 2013, Ch. 9). As in the popular spatial autoregressive (SAR) model for continuous data, we include the spatially lagged counts into the Poisson conditional mean, which allows us to estimate spatial spillover effects along with other parameters of interest. We thereby provide a class of spatio-temporal panel count models for which the regressors and entity fixed effects do not necessarily have to be uncorrelated. Moreover, as pointed out by Elhorst (2014, p. 55) and Cameron and Trivedi (2013, Ch. 9) fixed effect specifications in spatial panel models have the conceptual advantage over random effects models in confining the analysis to explain the sample at hand, which is precisely the aim in our application.

Forecasting criminal activity in neighborhoods within a given specific city limit based on crime counts by using the spatial and temporal correlation of the data constitutes a relevant and critical focus in the literature. An excellent and up-to-date survey of the recent work done in this area is provided by Kounadi et al. (2020). Also, the paper by Liesenfeld et al. (2017) is a rich source of highly relevant papers in this context.

To foreshadow our main findings, we qualitatively confirm the results reported in Liesenfeld et al. (2017). We obtain out-of-sample forecasting results and find that our fixed effects model performs competitively to theirs. This is a striking result considering the fact that the random effects model can incorporate socio-demographic indicators in yearly frequency, which are often good predictors for future crimes. Instead, these effects are absorbed by our entity and time fixed effects.

The paper’s remainder is organized as follows: Section 2 introduces the proposed models and their respective estimation frameworks. Section 3 presents and discussed the results of our Monte Carlo simulation experiments. Our results of the empirical crime forecasting application are provided in Sect. 4 while Sect. 5 concludes.

2 Spatial panel count data models

The econometric framework proposed here is quite general, but takes into account the main features of the crime counts to be found in the empirical application below as documented in Chapter 2 of Liesenfeld et al. (2017).

We start with the static Poisson spatial panel model, which is a straightforward extension of the spatial autoregressive model proposed in Glaser and Jung (2021) to accommodate longitudinal data. It uses multiplicative fixed effects to account for individual heterogeneity. This model allows for incorporating both a contemporaneous and a lagged spatial term in the conditional mean function. We derive an approximate loglikelihood function using the concept of composite likelihood theory and obtain estimates for all relevant parameters of interest using standard numerical optimization procedures. Moreover, a predictive distribution based on the underlying Poisson distribution is readily available to form density forecasts of the counts.

Since crime counts often exhibit substantial autocorrelation (Li et al. 2014; Liesenfeld et al. 2017), we attempt to capture this dependence structure by including time dynamic effects into the spatial panel model and hope to improve the forecasting performance of the model. More precisely, we include the lagged outcome in the conditional mean equation. This approach extends the well-known linear feedback model of Blundell et al. (2002) to the spatial dimension. Since the spatial panel linear feedback model exhibits a dynamic panel structure and assumes entity fixed effects, we have to rely on quasi-differenced GMM to estimate the coefficients consistently (Blundell et al. 2002; Windmeijer 2005, 2008). The price to be paid for the richer dynamic structure is that only point forecasts based on this model are available.

2.1 A Poisson spatial panel model

2.1.1 Model specification, estimation, and inference

We start with a static predictive model for spatial count data which allows us to set up an approximate likelihood function for inferential purposes and obtain probabilistic forecasts. To achieve this goal, we choose a specification for the conditional mean function and the individual or entity-specific effect, which ultimately leads to the well-known form of the likelihood function in non-spatial panel models for counts as proposed in Hausman et al. (1984).

The general model specification is as follows: denote the discrete count variable to be explained with \(y_{it}\), \(i=1,\dots ,N\) and \(t=1,\dots ,T\) and assume it follows a (conditional) Poisson distribution as it is standard in count data regression models. For the regression structure of the Poisson spatial panel model with multiplicative entity-specific effects \(\nu _{i}\), we propose to use

$$\begin{aligned} E[y_{it}|\mathcal{F}_{it},\nu _i] = \mu _{it}\nu _i, \end{aligned}$$
(1)

where the conditioning information set \(\mathcal{F}_{it}\) contains appropriately defined vectors of the counts and additional predetermined regressors to be defined below in more detail.

To close the model, \(\mu _{it}\) needs to be specified. The literature on count data regression and time series models for counts has developed a range of specifications to balance out modelling issues like avoiding explosive behaviour of the counts over time, problems with logarithmic transformations of zero counts and useful interpretations for the model parameters while preserving the non-negativity of the entire regression function. All of these have advantages and disadvantages while none of them has emerged as a standard (see the discussion in Glaser and Jung 2021). The specification proposed here is a straightforward extension of the regression framework used in Glaser and Jung (2021) for cross-section spatial counts and is also influenced by the so-called linear feedback model of Blundell et al. (2002) for non-spatial count data panels. We experimented with alternative specifications but found them to be inferior in our simulation experiments and empirical application.

The base variant for \(\mu _{it}\) takes the form

$$\begin{aligned} \mu _{it}=\rho \sum_{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij} y_{jt} + \exp (X_{it.} \beta ) \; , \end{aligned}$$
(2)

where \(X_{it.}=[x_{it1},\dots ,x_{itK}]\) is the row vector of explanatory variables and \(\beta\) is the corresponding column vector of regression parameters. \(w_{ij}\) is an element of the \(N \times N\) time-invariant spatial weights matrix W, which is assumed to be row-standardized. In the spirit of the spatial autoregressive (SAR) model for continuous data, the first term on the right-hand side of Equation (2) captures possible spatial autocorrelations in the observed counts and therefore, \(\rho\) is called a (global) spatial autocorrelation parameter. The regressors enter the conditional mean function in log-linear form as it is custom in count data regression models. Accordingly, the conditioning information set for this model specification takes the form \(\mathcal{F}_{it} = \{X_{it}, y_{-it} \}\), where \(y_{-it}\) is the vector of observations of all neighbors to unit i in period t. With this specification, we implicitly define the nonlinear regression model

$$\begin{aligned} y_{it} = \mu _{it}\nu _i + u_{it} \end{aligned}$$
(3)

for the observed counts, where the error term \(u_{it}\) is assumed to be uncorrelated with \(y_{jt}\) for \(i \ne j\). Consequently, our model belongs to the broad class of conditional spatial regression models as discussed e.g. in Cressie (1993, p. 408), which enables us to use a conditional likelihood function to estimate the model parameters consistently. Note further, that (3) does not allow for direct interactions as in a linear SAR model for continuous data.

Still, within the static modeling framework, we consider an extension to the base variant specification, which allows for dependence between the observation \(y_{it}\) and the neighboring observations in the previous time period \(t-1\). This dependence is in addition to spatial effects in the cross-section of the data and is captured in the parameter \(\lambda\) as follows

$$\begin{aligned} \mu _{it} = \rho \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij} y_{jt} +\lambda \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij} y_{jt-1} + \exp (X_{it.} \beta ), \end{aligned}$$
(4)

where the conditioning information set now is of the form \(\mathcal{F}_{it} = \{X_{it}, y_{-it},y_{-it-1} \}\).

To preserve the non-negativity of the conditional mean (1), restrictions on the parameter spaces of \(\rho\) and, if included, \(\lambda\) have to be imposed. Naturally, these restrictions depend on the exact values of W, X, and \(\beta\) and would have to be checked individually for each application. In principle, \(\rho\) and \(\lambda\) are allowed to be negative as long as the estimated conditional expectation is positive for each unit. Alternatively, the conditions \(\rho , \lambda \ge 0\) can be set independently of the data, although the parameter space might then be restricted more strictly than necessary.Footnote 1

As discussed in Glaser and Jung (2021), the derivation of a full likelihood function based on the multiplicative structure (1) with conditional mean specification (2) or (4) is not possible due to their nonlinear form. Moreover, an operational high-dimensional multivariate count distribution is also not readily available. Therefore, as in Glaser and Jung (2021), we propose to approximate the likelihood using a pseudo-likelihood function as introduced by Besag (1975) and further developed into the composite likelihood approach of Varin et al. (2011). Within this framework, it can be shown that under suitable regularity conditions maximizing the conditional pseudo-likelihood yields consistent estimators for the parameters of interest.

As already outlined above, to cope with the unobserved entity-specific effect \(\nu _i\), we follow the proposal of Palmgren (1981) and Hausman et al. (1984) and condition the pseudo-loglikelihood on \(\nu _i\). Based on (4) and ignoring terms that do not depend on the model parameters, it is straightforward to arrive at the well-known functionFootnote 2

$$\begin{aligned} \log L_p (\rho , \lambda , \beta ) = \sum _{i=1}^{N} \sum _{t=1}^{T} \bigg [y_{it} \log (\mu _{it})-y_{it}\log \bigg ( \sum _{s=1}^{T}\mu _{is} \bigg )\bigg ]. \end{aligned}$$
(5)

Note the special form of the (conditional) pseudo loglikelihood function which is related to that of the multinomial logit model as discussed in Hausman et al. (1984). The function (5) can be maximized using standard numerical methods and it is straightforward to obtain parameter estimates and their associated standard errors, if desired. Moreover, an estimate of the entity-specific effect \(\nu _i\) can easily be obtained as follows:

$$\begin{aligned} \widehat{\nu} _i=\frac{\sum _{t=1}^{T} y_{it}}{\sum _{t=1}^{T} \widehat{\mu }_{it}} \; , \end{aligned}$$
(6)

where \(\mu _{it}\) is either of the form (2) or (4).

2.1.2 Forecasting using the Poisson spatial panel model

As the focus of this paper is on forecasting, we now derive the minimum mean squared error (MMSE) predictor for the Poisson spatial panel model. We focus on one-step-ahead forecasts, which involves an evaluation of the conditional mean function (1) at \(t=T+1\) to start with. Due to its SAR structure which involves a contemporaneous spatial lag of the observed counts, we make use of the predictors suggested in the literature on SAR models for continuous data (see e.g. Kelejian and Piras 2017, Ch. 4). To simplify the exposition, we introduce a matrix notation for the one-step-ahead MMSE predictor as follows

$$\begin{aligned} E[y_{T+1}|y_{T}, X_{T+1}, \nu ] \equiv f_{T+1|T} = \mu _{T+1|T} \circ \nu \; , \end{aligned}$$
(7)

where \(\mu _{T+1|T}\) is an \(N \times 1\) vector of conditional means, \(\nu = (\nu _1, \ldots , \nu _N)'\), \(y_{T+1} = (y_{1T+1},\ldots , y_{NT+1})'\) and \(y_T\) accordingly, \(X_{T+1}\) is a \(N \times k\) matrix, with k being the number of regressors and \(\circ\) denotes the Hadamard product. \(\mu _{T+1|T}\) can in principle be either of the form (2) or (4). For the exposition here, we employ the latter and present a forecasting approach within a model with contemporaneous and lagged spatial effects.

The conditional mean component of the MMSE predictor for the Poisson spatial panel model can then be written in matrix notation as

$$\begin{aligned} f_{T+1|T} = \Bigl [ \rho W f_{T+1|T} + \lambda W y_{T} + \exp (X_{T+1}\beta ) \Bigr ] \circ \nu \; , \end{aligned}$$
(8)

where the exponential function is applied element-wise to the vector \(X_{T+1}\beta\). This exponential can also be expressed via a power series as \(\exp (A) = \sum _{k=0}^\infty {1/k!} A^{\circ k}\), where \(A ^{\circ k}\) is a Hadamard power. Solving (8) for the MMSE predictor, we then have

$$\begin{aligned} f_{T+1|T} = (I_N - \rho W D_{\nu })^{-1} \left\{ [\lambda W y_T + \exp (X_{T+1} \beta ) ] \circ \nu \right\} \; , \end{aligned}$$
(9)

where \(I_N\) is an \(N \times N\) identity matrix and \(D_{\nu }\) is the diagonal matrix of \(\nu\). A one-step-ahead prediction of counts can be obtained by replacing the parameter (vectors) in (9) with their estimated counterparts and element-wise multiplication by the estimated entity-specific effect. As the prediction is not necessarily an integer, an appropriate rounding procedure can be applied to produce point forecasts.

Data coherent forecasts can directly be obtained from the one-step-ahead density forecast

$$\begin{aligned} \widehat{p}(y_{T+1|T}) = Pois (\hat{f}_{T+1|T} ) \; , \end{aligned}$$
(10)

and data coherent point forecasts by taking, for example, its mode. The predictive models can now be compared across different model specifications using (proper) scoring rules, like, e.g., the logarithmic, the quadratic, and the ranked probability score, and evaluated using a number of tools based on the probability integral transform (PIT) method (see e.g. Czado et al. 2009; Jung et al. 2016). Specifically, it is straightforward to obtain the (nonrandomized) PIT for the probabilistic forecasts based on the Poisson spatial panel model to assess the distributional assumption. In the application below, we discuss the results of \(\chi ^2\) goodness-of-fit tests of the null hypotheses of uniform distribution of the nonrandomized PITs (see e.g. Jung et al. 2016).

2.2 The spatial panel linear feedback model

2.2.1 Model specification, estimation, and inference

In many applications, the observed counts exhibit serial correlation. One way to accommodate this is to extend the static model specification above with time dynamic components to improve its forecasting performance. A parsimonious time dynamic specification introduces a lagged dependent variable into the regression equation (1). In the log-linear framework of count data regression models, this is, however, not straightforward to accomplish because the inclusion of lagged outcomes in the exponential function could lead to explosive series or to problems with transforming zero values (see e.g. Cameron and Trivedi 2013, Ch. 7). An elegant solution to the problem has been proposed by Blundell et al. (2002) in the context of a dynamic non-spatial panel model for counts. Their specification is called linear feedback model (LFM) and is rooted in the linear regression properties of certain integer valued autoregressive processes (INAR) for time series data (see e.g. Jung and Tremayne 2011, for a survey).

Upon this basis, we propose to specify the conditional mean of the so-called spatial panel linear feedback model with fixed effects to take the following form

$$\begin{aligned} E[y_{it}|y_{-it},y_{t-1},X_{it.},\nu _i] = \gamma y_{it-1}+\rho \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij}y_{jt}+ \exp (X_{it.}\beta ) \nu _i. \end{aligned}$$
(11)

As in the LFM, the entity-specific effects are only multiplied with \(\exp (X_{it.}\beta )\) and can therefore be understood as a permanent scaling factor for the individual specific mean. Note that this specification represents a slight deviation from the static Poisson spatial panel model in Sect. 2.1. However, it has the conceptual advantage that the spatial panel linear feedback model reduces to the well-known LFM if \(\rho = 0\).

As in the INAR framework, the autocorrelation parameter \(\gamma\) needs to fulfill \(\gamma \ge 0\) to ensure non-negativity of the conditional expectation (11). In addition, \(\rho \ge 0\) must be assumed as in the Poisson spatial panel model specification above. Note that unlike in the static specification, we do not include a lagged spatial term in our spatio-temporal model because the inclusion of such terms leads to identification problems (see, e.g., Anselin et al. 2008; Elhorst 2010, for a more detailed discussion).

Aside from the restrictions on the autocorrelation and spatial correlation parameters that guarantee a positive conditional expectation, the parameters also need to fulfill conditions to warrant stationarity. Stationarity conditions for miscellaneous dynamic spatial (panel) models are summarized in Elhorst (2012). Relevant for this model are the following conditions, derived in Parent and LeSage (2011) and Elhorst (2012), which imply that there is a trade-off between the autocorrelation and spatial correlation to maintain a stable model (Elhorst 2008):

$$\begin{aligned} \rho \, w_{max} - 1< \gamma&< 1 - \rho \, w_{max} \qquad \text {if} \ \rho \ge 0 \nonumber \\ \rho \, w_{min} - 1< \gamma&< 1 - \rho \, w_{min} \qquad \text {if} \ \rho < 0, \end{aligned}$$
(12)

where \(\omega _{min}\) and \(\omega _{max}\) denote the smallest and largest characteristic root of the spatial weights matrix W. The largest characteristic root of row-standardized spatial weights matrices is unity by definition.

Estimation of the spatial panel linear feedback model can be based on quasi-differenced GMM (qdGMM). In the non-spatial framework, such a two-step GMM estimator is asymptotically efficient and consistent (Blundell et al. 2002; Windmeijer 2005). In order to apply the GMM method, the fixed effects have to be eliminated from the conditional mean (11) for which in principle both the Chamberlain (1992) and Wooldridge (1997) quasi-differencing transformations of the regression errors, defined to be \(u_{it} = y_{it}- \text {E}[y_{it}|y_{-it},y_{t-1},X_{it.},\nu _i]\), are available.

As long as the regressor matrix X does not contain endogenous variables, both transformations can be applied. If the regressor matrix contains endogenous variables, the multiplicative specification of the fixed effects together with additive error terms cannot be estimated using the Chamberlain moment conditions because they would not be valid anymore (Windmeijer 2008). In contrast, the Wooldridge moment conditions are also valid for endogenous explanatory variables (Windmeijer 2000). Consequently, we base our discussion on the more general Wooldridge quasi-differenced errors (\(q_{it}\)), which are given by

$$\begin{aligned} q_{it}= & {} \frac{u_{it}}{\exp (X_{it.}\beta )}-\frac{u_{it-1}}{\exp (X_{it-1.}\beta )} \nonumber \\= & {} \frac{y_{it}-\gamma y_{it-1}-\rho \sum \limits _{\begin{subarray}{c} j=1\\ j\ne i \end{subarray}}^{N} w_{ij}y_{jt}}{\exp (X_{it.}\beta )}-\frac{y_{it-1}-\gamma y_{it-2}-\rho \sum \limits _{\begin{subarray}{c} j=1\\ j\ne i \end{subarray}}^{N} w_{ij}y_{jt-1}}{\exp (X_{it-1.}\beta )}. \end{aligned}$$
(13)

For predetermined regressors (\(E[X_{it}u_{it+j}]=0, \; j\ge 0\) and \(E[X_{it}u_{it-s}]\ne 0, \; s\ge 1\)) it holds that

$$\begin{aligned} E[q_{it}|y^{t-2}, X_i^{t-1}]= & {} \; 0, \end{aligned}$$
(14)

with \(y^{t-2}=[y_{1},\dots , y_{t-2}]\) and \(X_i^{t-1}=[X_{i1},\dots , X_{it-1}]\). In contrast to the linear feedback model discussed in Windmeijer (2008), we note that it is necessary to condition on the past of all entities (\(y^{t-2}\)) instead of the individual past (\(y_i^{t-2}\)), since the spatial panel linear feedback model uses information from all neighbours.

These moment conditions guide the choice of instruments for the GMM estimation, which are gathered in matrix Z below. For the model considered here, two lags of the dependent variable, \(y_{t-2}\) and \(y_{t-3}\), two lags of the simultaneous spatial term, \(Wy_{t-2}\) and \(Wy_{t-3}\), and two lags of the predetermined regressor \(X_{t}\), \(X_{t-1}\) and \(X_{t-2}\), form the instrument matrix. Additional exogenous variables added to the model can also be included in Z.Footnote 3

To reduce the instrument count and to limit the overfitting of the endogenous variables due to too many instruments, Roodman (2009) proposes the use of collapsed instruments. That is, moment conditions are summarized over t so that the estimator minimizes, for example, the magnitude of the empirical moment \(\sum _t \Delta y_{it-2} q_{it}\) instead of the \(T-2\) empirical moments \(\Delta y_{it-2} q_{it}, \;t=3,\dots , T\). Using collapsed instruments leads to a more precise estimation of the optimal weight matrix in the second step and reduces bias (Roodman 2009). The instrument matrix then takes the form:

$$\begin{aligned} Z_i = \begin{bmatrix} y_{i1} &{}\quad 0 &{}\quad [Wy_1]_i &{}\quad 0 &{}\quad X_{i2} &{}\quad X_{i1}\\ y_{i2} &{}\quad y_{i1} &{}\quad [Wy_2]_i &{}\quad [Wy_1]_i &{}\quad X_{i3} &{}\quad X_{i2} \\ y_{i3} &{}\quad y_{i2} &{}\quad [Wy_3]_i &{}\quad [Wy_2]_i &{}\quad X_{i4} &{}\quad X_{i3} \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ y_{iT-2} &{}\quad y_{iT-3} &{}\quad [Wy_{T-2}]_i &{}\quad [Wy_{T-3}]_i &{}\quad X_{iT-1} &{}\quad X_{iT-2} \end{bmatrix}, \end{aligned}$$

where \([W y_{t}]_i\) denotes the \(i{th}\) row of the product \(W y_{t}\) for \(t = 1, \dots , T-2\).

The quasi-differenced error terms and the instrument matrix enter the GMM function which is given by

$$\begin{aligned} Q(\theta ) = \bigg ( \frac{1}{N}\sum _{i=1}^N q_i(\theta )'Z_i\bigg )H^{-1}\bigg ( \frac{1}{N}\sum _{i=1}^N Z_i'q_i(\theta )\bigg ), \end{aligned}$$
(15)

where \(q_i(\theta )=(q_{i3},q_{i4},\dots ,q_{iT})'\), \(\theta\) is the vector of parameters to be estimated, and H is a weight matrix. This function is minimized with respect to the parameters \(\theta\) in two steps to obtain consistent and efficient estimates. The weight matrix is set to be equal to the identity matrix for the first step and the weight matrix for the second step is calculated using the results from the first step such that

$$\begin{aligned} H(\hat{\theta }_1)=\frac{1}{N}\sum _{i=1}^N Z_i'q_i(\hat{\theta }_1)q_i(\hat{\theta }_1)'Z_i. \end{aligned}$$
(16)

The asymptotic variance of the resulting efficient two-step GMM estimator (\(\hat{\theta }_2\)) is

$$\begin{aligned} \widehat{var}(\hat{\theta }_2)=\frac{1}{N}\bigg (C(\hat{\theta }_2)'H^{-1}(\hat{\theta }_1)C(\hat{\theta }_2)\bigg )^{-1}, \end{aligned}$$
(17)

with

$$\begin{aligned} C(\hat{\theta }_2)=\frac{1}{N}\sum _{i=1}^N\frac{\partial Z_i'q_i(\theta )}{\partial \theta }\Big |_{\hat{\theta }_2}. \end{aligned}$$

2.2.2 Forecasting using the spatial panel linear feedback model

To derive a one-step-ahead predictor, it should be noticed that it cannot be obtained straightforwardly. Unlike in the case of difference GMM for continuous data, the quasi-differenced dependent variable forecast cannot simply be added to the level of the previous period to obtain a level forecast. Instead, another equation has to be found which provides a forecast \(\hat{y}_{T+1}\) based only on values known at time T. Especially, the function is not allowed to depend on the multiplicative fixed effects \(\nu _i\). Therefore, we employ the following equation which is based on the Wooldridge transformation for the model in \(t=T+1\),

$$\begin{aligned} \frac{y_{iT+1}}{\exp (X_{iT+1.}\beta )}-\frac{y_{iT}}{\exp (X_{iT.}\beta )}= & {} \frac{\gamma y_{iT}+\rho{ \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N}} w_{ij}y_{jT+1}+u_{iT+1}}{\exp (X_{iT+1.}\beta )}\nonumber \\&-\frac{\gamma y_{iT-1}+\rho \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij}y_{jT}+u_{iT}}{\exp (X_{iT.}\beta )}. \end{aligned}$$

Rearranging yields an expression for \(y_{T+1}\):

$$\begin{aligned} y_{iT+1}= & {} \bigg (\frac{y_{iT}}{\exp (X_{iT.}\beta )}-A_{i.}\bigg ( \gamma \bigg (\frac{y_{T-1}}{\exp (X_{iT.}\beta )}-\frac{y_{T}}{\exp (X_{iT+1.}\beta )} \bigg ) \nonumber \\&+ \frac{\rho W y_{T}}{\exp (X_{iT+1.}\beta )} \bigg ) \bigg )\exp (X_{iT+1.}\beta )\nonumber \\&- A_{i.} \bigg ( \frac{u_{iT}}{\exp (X_{iT.}\beta )}-\frac{u_{iT+1}}{\exp (X_{iT+1.}\beta )}\bigg )\exp (X_{iT+1.}\beta ), \end{aligned}$$
(18)

with \(A_{i.}\) being the \(i^{th}\) row of the Leontief inverse \(A=(I-\rho W)^{-1}\). Because of the assumed predetermination of X and Equations (13) and (14), the expected value of Equation (18) is given by

$$\begin{aligned} E[y_{iT+1}|y^{T}, X_i^{T+1}]= & {} \bigg (\frac{y_{iT}}{\exp (X_{iT.}\beta )}-A_{i.}\bigg (\gamma \bigg (\frac{y_{T-1}}{\exp (X_{iT.}\beta )}-\frac{y_{T}}{\exp (X_{iT+1.}\beta )} \bigg ) \nonumber \\&+ \frac{\rho W y_{T}}{\exp (X_{iT+1.}\beta )} \bigg ) \bigg ) \exp (X_{iT+1.}\beta ), \end{aligned}$$
(19)

which can serve as a one-step-ahead predictor to obtain point predictions. Note that no straightforward density forecast is available within this framework.

3 Monte Carlo simulations

Before presenting an empirical application of both models proposed in the previous section, we provide some simulation evidence about key properties of parameter estimates in empirically relevant situations.Footnote 4 According to the data-generating process implied by our proposed models, obtaining adequate simulated data sets requires the specification of a suitable spatial weights matrix W and draws of random numbers from the joint distribution of the counts which include the spatial effects.

As already discussed above, in the context of deriving a suitable likelihood function for the models under study here, such a discrete-valued joint distribution is not readily available. Following Glaser and Jung (2021), we therefore propose to produce the joint distribution to draw from by means of the Gibbs sampling algorithm of Geman and Geman (1984). This algorithm iteratively draws from the conditional distributions of \(y_{it}\), \(i=1,\dots ,N\) given all other \(y_j\), \(j \ne i\) at a given time t to obtain the joint distribution of \(y_1, \dots y_n\) asymptotically. Thus, in the \(k^{th}\) iteration of this algorithm at time t, the data-generating process for the static model draws from the following distributions:

$$\begin{aligned} y_{1t}^{(k)}\sim & {} \; Pois(y_{1t}|y_{2t}^{(k-1)}, \dots , y_{Nt}^{(k-1)})\nonumber \\ y_{2t}^{(k)}\sim & {} \; Pois(y_{2t}|y_{1t}^{(k)}, y_{3t}^{(k-1)}, \dots , y_{Nt}^{(k-1)})\nonumber \\ \vdots&\qquad\qquad \vdots \nonumber \\ y_{Nt}^{(k)}\sim & {} \;Pois(y_{Nt}|y_{1t}^{(k)}, \dots , y_{N-1t}^{(k)}) \, . \end{aligned}$$

The starting values \(y_{it}^{(0)}\) are drawn from a distribution with a non-spatial model specification.

For the generation of the spatial weights matrix W, we use a nearest neighbors inverse distance matrix.Footnote 5 To compute it, N random coordinates on the two-dimensional Cartesian plane are generated from a \(U(0,100) \times U(0,100)\) distribution. Then, we compute the Euclidean inverse distance matrix for these points, which represent the N spatial units of the simulated dataset. Furthermore, we select a threshold value to determine the nearest neighbors of each unit, all other entries of the matrix are set to zero and the resulting matrix is finally row-standardized. In our simulations, we consider two choices of the threshold value, \(\pi \in \lbrace 25, 50 \rbrace\), to generate a sparse and a dense spatial weight matrix.

3.1 The Poisson spatial panel model

First, we study the properties of the static Poisson spatial panel model from Sect. 2.1. Parameter estimates for this model are obtained using the pseudo maximum likelihood method. The following data-generating process is employed

$$\begin{aligned} y_{i,t}\sim & {} Pois\left( \left[ \rho \sum_{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij}y_{j,t} + \lambda \sum _{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij}y_{j,t-1} + \exp (\beta x_{i,t}) \right] \nu _i \right) , \nonumber \\ x_{i,t}\sim & {} \;i.i.d.\; N(0,\sigma ^2_x), \end{aligned}$$
(20)

where \(\nu _i = \exp \eta _i\) and \(\eta _i \sim \; i.i.d. \; N(0,\sigma ^2_{\eta })\), \(\sigma ^2_x = 0.5\), and \(\sigma ^2_{\eta } = 0.5\). We consider panels with \(T = \lbrace 4, 8, 16 \rbrace\) time periods and \(N = \lbrace 100, 200, 400 \rbrace\) entities.

Table 1 Simulation results for the Poisson spatial panel model

The results are reported in Table 1. Overall, we find that the results for both specifications of the spatial weight matrix are very similar with slightly better results for the less dense matrix. We find that the pseudo maximum likelihood estimator of both spatial effects is biased in small samples but the bias vanishes for larger samples. However, the estimator’s properties for contemporaneous and lagged spatial effects are affected differently by the increase of N and T. It generally seems to be more difficult to estimate the contemporaneous effect. Specifically, we report a smaller RMSE for the lagged spatial effect in the first specification with equal coefficient values. \(\beta\) can be estimated without problems, even in situations where the other coefficients are slightly biased. Similarly, the distribution of our entity-specific effects can be recovered using Equation (6). To verify this, we compute the mean entity-specific effect for each replication and compare it to its true value of \(E(\nu _i) = \exp (\sigma ^2_{\eta }/2) \approx 1.284\). We find that the mean of the estimated fixed effects converges to this number for increasing N and T.Footnote 6 We conclude that spatial effects can be estimated consistently but not very precisely in small-N/small-T settings.

3.2 The spatial panel linear feedback model

The remainder of our simulation study provides some insights into the quasi-differenced GMM estimates’ behavior for specifications of the spatial panel linear feedback model. To obtain a detailed picture of the model’s properties, the values of the spatio-temporal parameters \(\gamma\), and \(\rho\) are varied, and the resulting combinations are used in samples of different size, obtained by varying N and T. Four exemplary configurations of the parameters are chosen for this purpose, representing different combinations of autocorrelation and spatial correlation. The autocorrelation and spatial correlation parameters take the values: \((\gamma , \rho ) := \{(0.5,0.2), (0.5,0.4), (0.7,0.1), (0.2,0.3)\}\). All specifications are chosen such that the stationarity conditions for spatio-temporal models are satisfied. Since we include a third lag in the instrument matrix, the minimum number of periods to estimate the model is four. Consequently, we consider the sample sizes \((T, N):=\lbrace 8, 16 \rbrace \times \lbrace 50, 100, 200, 400 \rbrace\) for all combinations of parameters. The simulation set-up closely follows Blundell et al. (2002), where the model contains one additional regressor \(x_{it}\) whose coefficient \(\beta\) is set to 0.5 in all configurations. We generate 1000 replications for each Monte Carlo experiment. The dependent variable outcomes are generated using the iterative Gibbs sampling algorithm of Geman and Geman (1984) which first iterates 50 times over the N units for time t and repeats this for all time periods discarding a burn-in of 50 time periods. After B iterations over all units \(i=1\dots N\), the last vector is kept as \(y_{t}\) and is then used as a regressor for generating the outcomes for \(t+1\). The data-generating process is given by

$$\begin{aligned} y_{i,t}\sim & {} Pois(\gamma y_{j,t-1} + \rho \sum_{\begin{array}{c} j=1\\ j\ne i \end{array}}^{N} w_{ij}y_{j,t} + \exp (\beta x_{i,t} + \eta _i)), \nonumber \\ x_{i,t} &= {} \varrho _x x_{i,t-1} + \tau \eta _i + \varepsilon _{i,t}, \nonumber \\ x_{i,0} &= {} \frac{\tau }{1 - \varrho _x} \eta _i + \xi _i, \nonumber \\ y_{i,0}\sim & {} Pois(\exp (\beta x_{i,0} + \eta _i)), \end{aligned}$$
(21)

where the error term \(\varepsilon _{i,t} \sim \;i.i.d.\; N(0,\sigma ^2_{\varepsilon })\), the entity effects \(\eta _i \sim \; i.i.d. \; N(0,\sigma ^2_{\eta })\), and \(\xi _i \sim \;i.i.d.\; N\bigl(0,\sigma ^2_{\varepsilon }/(1 - \varrho _x^2)\bigr)\). The spatial weight matrix W is again generated for each replication using randomly generated coordinates. Our results are based on the parameters \(\beta = 0.5\), \(\varrho _x = 0.5\), \(\tau = 0.1\), \(\sigma ^2_{\varepsilon } = 0.5\), \(\sigma ^2_{\eta } = 0.5\).

The results for our four main specifications are displayed in Table 2. Again, the difference between both specifications of the spatial weights matrix are marginal. The most striking result is that quasi-difference GMM estimation generally requires much larger samples to produce accurate coefficient estimates than the MLE for the static model. It appears difficult to estimate spatio-temporal effects in small samples. Blundell et al. (2002) also remark that small sample bias and imprecision are common problems with the quasi-differenced GMM estimator. Still, the quasi-differenced GMM estimator appears to be asymptotically unbiased and the RMSE decreases with the sample size if substantial spatio-temporal effects are present. While \(\gamma\) is on average overestimated, \(\lambda\) is underestimated for small to moderate sample sizes. As expected, \(\gamma\) converges faster by increasing the time dimension of the panel than by increasing the number of entities. We need a sufficient number of time periods to estimate \(\gamma\) precisely for more persistent model specifications. Unfortunately, wrongly estimated spatio-temporal effects using quasi-differenced GMM can have negative effects on the precision of \(\beta\) estimates (see, for example, the third panel of Table 2). It seems that the relative bias of \(\beta\) is positively related to the spatio-temporal persistence of the panel model.

To investigate how the degree of spatial autocorrelation affects the properties of the quasi-difference GMM estimator, we increase only the spatial autocorrelation going from the first to the second specification. The results show that the estimator still converges for all parameters if we move closer to the bounds of the stationarity region. More specifically, the relative bias of \(\hat{\rho }\) is substantially lower in the second specification. The third specification is characterized by high temporal autocorrelation and low spatial autocorrelation. Here, we observe that \(T=8\) time periods are not enough to estimate the spatio-temporal effects properly. Not surprisingly, we find that increasing the time dimension to \(T=16\) leads to greater performance gains than doubling the number of entities. Finally, our fourth specification has temporal autocorrelation coefficient that is slightly lower than the spatial autocorrelation coefficient.Footnote 7 In this case, we find that moving from \(T=8\) to \(T=16\) does not improve the estimates by much. However, using a combination of moderate N and T yields relatively precise results.

Table 2 Simulation results for the spatial panel linear feedback model

4 Application

The proposed models will now be applied in a predictive exercise using the same data set as in Liesenfeld et al. (2017) on severe crime counts in Pittsburgh, PA. We briefly describe the data set and then present empirical results from fitting the Poisson spatial panel model and the spatial panel linear feedback model. For further details about the data and the predictors to be included in the regression framework, we refer to Chapter 2 of that paper.

4.1 The data

The data set consists of crime counts for the 138 census tracts of Pittsburgh, Pennsylvania (in the borders for the 2010 census) sampled at a monthly frequency from January 2008 to December 2013. It contains 33 different crime types which are reported separately and summarized into two categories according to the U.S. Department of Justice and Federal Bureau of Investigation (2004). Part I crimes encompass major crimes such as aggravated assault, burglary, larceny, motor vehicle theft, murder/manslaughter, negligent manslaughter, rape, and robbery. Part II crimes are the remaining minor offences like drunken driving, fraud, and vandalism, etc. (see U.S. Department of Justice and Federal Bureau of Investigation 2004, p. 8 for the full list). The logarithm of the lagged Part II crimes is used here as an explanatory variable for the Part I crimes models according to the so-called ‘broken-windows’ hypothesis (Wilson and Kelling 1982; Kelling and Coles 1996). It suggests that the intensity of less severe crimes in a neighborhood often precedes the occurrence of more severe crimes.

Figure 1 displays the geographical distribution of the average number of Part I crimes in Pittsburgh’s census tracts. Colors indicate the deciles of the data with dark red tracts belonging to the highest decile of Part I crimes and dark blue ones to the lowest deciles. The highest number of Part I crimes with an average of 76 per month is observed for census tract 201 (“Downtown”) followed by census tract 1702 (“Southside”), which lies to its southwest and has an average number of Part I crimes of 42. Small values are observed for tracts at the city border (dark blue areas in Fig. 1) and those with a small population density. Also, the dataset contains several cross-sectional variables measuring the socio-demographic characteristics of the inhabitants of each census tract, more specifically, their median income, the fraction of the population age 18 or older, the fraction of the population which has a bachelor’s degree or higher, and the fraction of female-led households. These variables are known from the criminology literature to indicate the criminal activity in a neighborhood and are used for forecasting in the random effects model proposed by Liesenfeld et al. (2017). Unfortunately, these variables are only available in yearly sampling frequency and would be picked off in our model by the inclusion of time fixed effects.

Fig. 1
figure 1

Map of time averages of Part I crimes in Pittsburgh, grouped in deciles. Dark red (dark blue) areas indicate highest (lowest) deciles

Table 3 reports selected descriptive statistics of the variables in the dataset for the whole sample period and for each year separately. On average, more Part II crimes than Part I crimes are observed per month and census tract, although the difference is quite small considering the large difference in the number of crime types in both classes. The highest amount of crimes was reported in 2008. A downward tendency is recognizable over time in this sample, although the differences between the years are small. 2013 has the lowest average crime rates.

Table 3 Descriptive statistics of the Pittsburgh crime dataset

Figure 2(a) gives a visual impression of the seasonal pattern of the crime counts. Generally, there are more crimes in the first half of the year and a hump in the Part II crimes during August and September. The autocorrelation functions (ACF) of the Part I crimes for each of the 138 census tracts are plotted in Fig. 2(b). They indicate significant but low autocorrelation at the first lag for most census tracts.

The spatial weights matrix used for this application is a queen contiguity matrix. Its entries are defined in the following: \(w_{ij} = \frac{a_{ij}}{\#neighbours}\), \(i,j = 1, \dots , N\), where \(a_{ij}=1\), if i is a neighbour of j (sharing a common border or a common vertex) and \(a_{ij}=0\), if i is not a neighbour of j. Each census tract has on average 5.6 neighbors. Finally, the spatial weights matrix is row-standardized.

To measure spatial association, we use Moran’s I which was introduced by Moran (1950) for binary weights and generalised for arbitrary weight matrices by Cliff and Ord (1981, p. 17). It is the most prevalent measure of spatial association, quantifying the dependence across the entire dataset by summarising cross-products of deviations from the mean as follows

$$\begin{aligned} I_t=\frac{N}{W_0}\frac{\sum _{i=1}^{N}\sum _{j=1}^{N} w_{ij}(y_{it}-\bar{y}_t)(y_{jt}-\bar{y}_t)}{\sum _{i=1}^N (y_{it}-\bar{y}_t)^2}, \qquad t = 1, \dots , T, \end{aligned}$$
(22)

where W is the chosen spatial weights matrix and \(W_0=\sum _{i=1}^{N}\sum _{j\ne i}^{N} w_{ij}\), which equals N for row-standardized weight matrices. The measure was developed for a cross-section of data, so we have to compute it repeatedly for each month, indicated by the subscript t. Unfortunately, there is no distributional theory available for Moran’s I calculated from count data, but we can employ a bootstrap procedure to obtain inferential statements (Lin et al. 2011; Ren et al. 2014; Jin and Lee 2015). To do so, we draw elements of \(y_t\), place it randomly on our map, and recompute \(I_t\). Using 400 bootstrap draws, we can compare the original \(I_t\) statistic to the bootstrap distribution and determine the p-value.

Figure 2(c) and (d) show the values of Moran’s I and standardized Moran’s I for each month in the dataset. The lowest values are obtained for August 2009, June 2010, and December 2012. Using our bootstrap test, we are able to reject the null hypothesis of no spatial dependence at the 5% level for all months except August 2009. Hence, these initial tests based on Moran’s I clearly point to spatial correlation in the data.

Fig. 2
figure 2

Descriptives for Pittsburgh crime data. The blue horizontal lines in (b) indicate the 95 % confidence interval

4.2 Forecasting Pittsburgh’s crime counts using spatial panel count data models

In this section, we begin by applying static Poisson spatial panel models to Pittsburgh’s crime counts. To forecast the number of Part I crimes in the census tracts of Pittsburgh for each month of 2013, a one-step-ahead expanding window forecast is employed. For this matter, we choose the estimation sample ranging from January 2008 to December 2012. Resulting parameter estimates of this first estimation sample are reported in Table 4. We employ different Poisson spatial panel model specifications with additional time fixed effects dummies for each month to capture seasonal effects. While Model (1) only contains the contemporaneous spatial term, Model (2) only contains the lagged spatial term, and Models (3) has the full spatial specification. The coefficient of lagged Part II crimes is highly significant in all specifications, and spatial terms are highly significant for each configuration. We find that the lagged spatial correlation parameter \(\lambda\) is relatively small compared to the contemporaneous spatial term \(\rho\). Its size remains stable, whether or not Part II crimes are included as a regressor, which entails information about the same period as the lagged spatial term.Footnote 8 Time fixed effects are insignificant for all specifications of the static model. The summer months have the highest coefficient values, which seem to support the hypothesis that criminal activity increases with the temperature because a bad temper is evoked faster in hotter conditions (Gorr et al. 2003).

Fig. 3
figure 3

Scoring rules of density forecasts for the Poisson spatial panel model with contemporaneous spatial term (\(\circ\)), with lagged spatial term (\(\triangle\)), and with both spatial terms (\(+\))

To compare and contrast our results with those provided in the literature, we first obtain point forecasts for the Part I crimes. The lower panel of Table 4 gives the root mean squared forecast error (RMSFE) and mean absolute forecast error (MAFE) averaged over the 12 months of the forecasting period (January to December 2013). Table 5 displays the forecast results for each month separately. In general, both measures show a very similar pattern across models. Still, we find very different predictions across model specifications for some months, for example, in February and June. It turns out that the model specification, which only includes the lagged spatial term, performs best in terms of RMSE and MAFE. This is surprising considering that the contemporaneous spatial effect was larger than the lagged spatial effect during the estimation period. Additional estimates (not reported) reveal that specifications with entity fixed effects clearly outperform pooled and time fixed effects models in terms of point forecast accuracy. Comparing these results to the one-step predictions obtained for the random effects model in Liesenfeld et al. (2017), which has an average RMSE (MAFE) over 2013 of 3.466 (2.535), we find that our fixed effects model performs competitively in terms of both measures. This is a striking result considering that the random effects model can incorporate socio-demographic indicators in yearly frequency, which are often good predictors for future crimes. Instead, these effects are absorbed by our entity and time fixed effects.

Table 4 Estimation results for different specifications of the Poisson spatial panel model
Table 5 Point forecasting evaluation for the Poisson spatial panel model

As pointed out in Sect. 2.1, density forecasts can also be obtained for the Poisson spatial panel models based on (10). Figure 3 displays various scoring rules of the density forecasts for all static model specifications over the different forecast months. All model specifications perform very similarly across all three scoring rules, and it is difficult to pick a preferred model specification based on this criterion. Using the logarithmic and ranked probability scores, we find that Model (2) outperforms Model (1) and Model (3) during the latter half of 2013.

Finally, we can use the nonrandomized PITs to check the distributional assumption’s adequacy for the Poisson spatial panel model. The \(\chi ^2\) goodness-of-fit tests of the null hypotheses of uniformity of the distribution of the nonrandomized PITs cannot be rejected for all forecasting months and all three model specifications.Footnote 9 We conclude from these results, that the data do not reject the conditional Poisson assumption.

Now, we present the results based on the spatial panel linear feedback model (Model (4)) and compare our results to those for the static model (Models (1)–(3)). The relevant stationarity conditions from Equation (12) are fulfilled for the spatio-temporal model specification (\(\omega _{min}=-0.333\), \(\omega _{max}=1\)) and the Sargan test does not reject instrument exogeneity. The spatial regressor and the lagged Part II crimes are highly significant which is in line with our results for the static model where spatial autocorrelation coefficients are statistically significant over all specifications. We observe some substantial differences in the estimated coefficient values. Both the spatial autocorrelation coefficient and the coefficient of lagged Part II crimes are much higher for the spatial panel linear feedback model. However, to properly compare direct and indirect effects of lagged Part II crimes would involve the estimation of marginal effects (LeSage and Pace 2009), something that is not attempted at this stage. Strikingly, the coefficient of the dynamic effect, although significant, is quite small numerically.

We evaluate the model specifications’ predictive performance by comparing them to one-step-ahead predictions for the static panel models in Table 5. Here, it becomes obvious that the inclusion of a dynamic effect leads to a worse forecasting performance for this specific dataset. Our dynamic models therefore also perform worse than the random effects model of Liesenfeld et al. (2017) in terms of RMSFE and MAFE. Together with the results from the Monte Carlo illustration these outcomes indicate that the quasi-differenced GMM estimation approach requires larger sample sizes and may not be useful for this specific application. Although the total sample size is moderate, we cannot benefit from the large-T panel setting if the spatial effects drives forecasting performance and the dynamic effect only contributes marginally. Blundell et al. (2002) also report similar shortcomings of the quasi-differenced GMM estimation for the non-spatial linear feedback model and propose to alternatively use a pre-sample mean estimator which has better small sample properties. However, for this estimator, a long time series of the dependent variable must be available prior to the actual estimation sample. Therefore, its applicability critically depends on the data availability, which is not given here.

5 Conclusion

This paper proposes spatial panel models for counts with fixed effects and apply them to forecast crime counts in an urban environment. While the static Poisson spatial panel model proved to be reliable in simulations with small sample sizes, the dynamic spatial panel linear feedback model requires rather large sample sizes to yield accurate results. The former model can be estimated using pseudo maximum likelihood estimation, while the latter is estimated using quasi-differenced GMM. The models are applied to forecast crime counts for the census tracts of Pittsburgh, PA in 2013. For this, the static model outperforms the dynamic model. Modeling the dynamic effect is outweighed in terms of forecasting performance by the less precise quasi-differenced GMM estimation. Our preferred model specification (Poisson spatial panel model with lagged spatial effects) performs comparably to the random effects model suggested in Liesenfeld et al. (2017).

The proposed spatial models for count data incorporate spatially lagged dependent variables to estimate a global spatial effect. Estimation of these models is relatively straightforward and only imposes moderate computational costs. These aspects should foster the applicability of these models by applied researchers. Due to the inclusion of lagged observed counts, rather than lagged intensities, the interpretation of the measured spatial correlation is closer to the one usually encountered when working with linear spatial models for continuous data, making this class of count data models more accessible.