1 Introduction

Zero-inflated spatial data are spatially dependent observations characterized by an excess of zeros. Zero-inflated spatial data are common in many disciplines; for example, counts of harbor seals on glacial ice (Ver Hoef and Jansen 2007), annual mental health expenditures among US federal employees (Neelon et al. 2011), and the number of torrential rainfall events in a region of interest (Lee and Kim 2017). These data are typically a mixture of zeros and either discrete counts or positive real numbers. Standard probability distributions may not be appropriate for modeling zero-inflated data as they are unable to account effectively for the excess zeros (cf. Agarwal et al. 2002; Rathbun and Fei 2006; Lambert 1992).

A class of parametric mixture distributions called two-part models (Mullahy 1986; Lambert 1992) are popular for modeling zero-inflated data. Two-part models account for both the excess zeros and the skewed distribution of nonzero values by using two latent random variables, one for occurrence and the other for prevalence. The occurrence process dictates whether a structural zero or nonzero value is observed, and the prevalence process determines the value of the structural nonzero observations. Two-part models have been extended to model zero-inflated spatial observations (cf. Agarwal et al. 2002; Ver Hoef and Jansen 2007; Olsen and Schafer 2001) where the occurrence and prevalence random variables vary with respect to space (i.e., they are spatial random processes). Spatial two-part models are an extension of the well-known spatial generalized linear mixed models (SGLMMs) (Diggle et al. 1998), albeit with two separate latent spatial processes.

Modeling large zero-inflated spatial datasets remains a key challenge from both a computational and modeling standpoint. First, fitting spatial two-part models can be computationally prohibitive for large datasets because these models are generally overparameterized (Recta et al. 2012; Ver Hoef and Jansen 2007) with more unknown parameters than observations. For high-dimensional data, model-fitting can be computationally burdensome due to large matrix operations, estimation of high-dimensional latent random effects, and slow mixing Markov Chain Monte Carlo (MCMC) algorithms. Second, the underlying occurrence and prevalence spatial processes may potentially exhibit complex spatial dependence characteristics such as non-stationarity and anisotropy. Third, the occurrence and prevalence processes may be strongly correlated, and directly modeling the cross-correlations can be computationally expensive (Recta et al. 2012).

Novel modeling approaches for zero-inflated spatial data have been proposed, but these methods may not scale to larger datasets. In non-spatial settings, Frequentist methods have been used to fit two-part models using Gauss-Hermite quadrature (Min and Agresti 2005), expectation-maximization (Lambert 1992; Roeder et al. 1999), and restricted maximum quasi-likelihoods (Kim et al. 2012). However, these methods do not address a key component of spatial two-part models—inference for the two n-dimensional vectors of latent spatial random effects from the occurrence and prevalence spatial processes. Lyashevska et al. (2016) employs a Monte Carlo maximum likelihood approach to model a moderately large zero-inflated spatial dataset (\(n=\) 4029), but this approach still requires estimation of the high-dimensional spatial random effects and costly matrix operations.

In the Bayesian framework, the literature primarily focuses on the sophistication of zero-inflated spatial models. However, there is a dearth of studies examining the computational issues associated with large zero-inflated spatial datasets. Studies have modeled spatio-temporal dependence (Fernandes et al. 2009; Neelon et al. 2016; Arcuti et al. 2016), addressed overdispersion (Gschlößl and Czado 2008; Lee et al. 2016), used skewed distributions (Dreassi et al. 2014; Liu et al. 2016), used t-distributions to model heavy tailed behavior (Neelon et al. 2015), and modeled prevalence with scale mixtures of normal distributions (Fruhwirth-Schnatter and Pyne 2010) or Student-t processes (Bopp et al. 2020). Another study facilitates posterior sampling for zero-inflated negative binomial distributions (ZINB) by using latent variables that are represented as scale mixtures of normal distributions (Neelon 2019). A notable exception is Wang et al. (2015), which models the presence and abundance of Atlantic cod in 1325 locations along the Gulf of Maine using predictive processes (Banerjee et al. 2008). Predictive processes still requires \({\mathcal {O}}(nm^2+m^3)\) operations where m is the number of knots and n is the dataset size, and careful attention must be paid to knot selection (Guhaniyogi et al. 2011).

In this study, we introduce a computationally efficient approach for fitting a broad range of two-part models to high-dimensional zero-inflated spatial data. We use projection-based intrinsic conditional autoregression (PICAR) (Lee and Haran 2022) to reduce the dimensions of and correlation between the spatial random effects in two-part spatial models. The PICAR method represents the spatial random effects using empirical basis functions based on the Moran’s I statistic and piecewise linear basis functions. Various basis representations have been directly or indirectly used to model spatial data; for instance, predictive processes (Banerjee et al. 2008), random projections (Guan and Haran 2018, 2019; Banerjee et al. 2013; Park and Haran 2020), Moran’s basis for areal models (Hughes and Haran 2013), stochastic partial differential equations (Lindgren et al. 2011), kernel convolutions (Higdon 1998), eigenvector spatial filtering (Griffith 2003), and multi-resolution basis functions (Nychka et al. 2015; Katzfuss 2017), among others. Hughes and Haran (2013) made a significant contribution in modeling large binary and count spatial datasets within the discrete spatial domain while de-confounding the spatial random effects; however, their methods does not extend to zero-inflated spatial data nor spatial datasets in the continuous domain.

To our knowledge, this is the first approach that readily lends itself to wide range of user-specified spatial two-part models while also reducing computational costs for large datasets. We demonstrate the applicability of our proposed approach (PICAR-Z) via simulation studies as well as two real-world applications - a bivalve species abundance dataset and ice thickness measurements in West Antarctica. Because our approach allows for efficient inference, it enables practitioners to thoroughly explore the advantages (and disadvantages) of various modeling choices in two-part models than was previously possible.

The rest of the manuscript is as follows. In Sect. 3, we introduce the general two-part modeling framework for zero-inflated data. We also provide an overview of spatial two-part models and examine the inherent modeling and computational challenges. In Sect. 4, we propose a computationally efficient approach to fit high-dimensional spatial two-part models (PICAR-Z) and provide some implementation guidelines. We demonstrate the utility of PICAR-Z through four different simulation studies in Sect. 5 as well as two high-dimensional environmental datasets in Sect. 6. Finally, a summary and directions for future research are provided in Sect. 7.

2 Description of the data

Our proposed methodology (PICAR-Z) is motivated by two challenges in the environmental sciences (ecology and glaciology). We provide an overview of the motivating research questions as well as descriptions of the corresponding zero-inflated spatial datasets.

Bivalve Species Abundance in the Wadden Sea The Dutch Wadden Sea (DWS) is a UNESCO World Heritage region and a crucial protected ecological habitat comprised of sand barriers, salt marshes, mudflats, and gullies (Compton et al. 2013; Lyashevska et al. 2016). Most notably, the DWS is a critical sanctuary for hundreds of thousands of shorebirds (Lyashevska et al. 2016) as it serves as a vital feeding and breeding location for a wide array of bird species (Boere and Piersma 2012). The DWS plays a critical role in the bird species’ yearly migratory journey by providing an abundance of food resources, such as the Baltic tellin (Macoma balthica), a species of benthic invertebrates. The Baltic tellin is one of the most commonly found macrobenthic species in the western DWS (van der Meer et al. 2003) and one of the preferred preys of bird species (Lyashevska et al. 2016); however, there has been a steady decline in the population over the past 2 decades (Dairain et al. 2020). Recent studies have examined this decline in adult survival with respect to factors such as global warming (Beukema and Dekker 2020), habitat selection (van der Meer et al. 2003), and proliferative disorders (Dairain et al. 2020). In addition, there are environmental factors that affect species survival such as the silt content, median grain size, and altitude (Lyashevska et al. 2016). Analyzing the abundance of Baltic tellin species, while considering spatial dependence and external environmental factors, can provide valuable insights for ecologists and conservationists. This information can guide resource allocation and the development of conservation policies, benefiting not only the Baltic tellin but also the diverse bird species within the Dutch Wadden Sea ecosystem.

We examine spatial abundance data of the Baltic tellin species from Lyashevska et al. (2016) originally obtained from the synoptic intertidal benthic surveys (SIBES) monitoring program (Compton et al. 2013; Bijleveld et al. 2012). The dataset includes counts of the Baltic tellin (Macoma balthica) species sampled at n = 4029 locations. The dataset exhibits zero-inflation as \(65.9\%\) of the locations have zero-counts as well as spatial dependence. The occurrence (presence vs. absence) and prevalence (values of positive counts) maps are provided in the supplement. Given the zero-inflated counts, spatial dependence, and large number of locations, computationally-efficient spatial two-part models are well-suited for analyzing this dataset.

Antarctic Ice Sheet Thickness Based on available geological records, mass loss from the Antarctic ice sheets can potentially lead to drastic global sea level rise (Deschamps et al. 2012), in some cases up to 60 m (Fretwell et al. 2012). Existing studies (Serreze and Barry 2011; Zhang 2007) suggest that a portion of the Antarctic ice sheet could experience significant mass-loss in the next few centuries as a consequence of global climate change. This presents a significant threat since a substantial portion of the world’s population resides in low-elevation coastal regions (Greve et al. 2011). In fact, nearly eight percent of the global population is threatened by a mere five-meter rise in sea level (Nicholls et al. 2008) and 13 percent of the global urban population is threatened by a ten-meter sea level rise (McGranahan et al. 2007). Since Antarctic mass-loss is both a threat to heavily-populated coastal regions and a key indicator of global climate change, an important first step entails understanding the underlying spatial patterns of mass-loss by modeling the current thickness of the Antarctic ice sheet.

We examine semicontinuous observations of ice thickness from the Bedmap2 dataset (Fretwell et al. 2012), generated using satellite altimetry, airborne and ground radar surveys, and seismic sounding. Similar to Chang et al. (2016), we examine gridded ice thickness observations at 20 km resolution over a \(171 \times 171\) grid spanning the entirety of Antarctica including vulnerable regions in the Amundsen Sea Embayment (West Antarctica). The resulting dataset consists of n = 29,241 semi-continuous observations of ice sheet thickness where 10,327 (\(35.3\%\)) are zeros. The observed dataset poses significant modeling and computational challenges due to: (1) high-dimensional observations; (2) the presence of zeros and positive thickness measurements; and (3) the data is collected at regularly-spaced intervals with many unobserved locations in the spatial domain. Traditional spatial generalized linear mixed models are unable to account for the zero-inflation and existing spatial two-part models are unable to scale to large datasets. In this Sect. 6.2, we utilize PICAR-Z to model the semicontinuous ice thickness and interpolate, or downscale, at unobserved locations.

3 Zero-inflated spatial models

In this section, we provide an overview of the two-part modeling framework (Mullahy 1986; Lambert 1992) for spatially dependent zero-inflated observations (cf. Ver Hoef and Jansen 2007; Neelon et al. 2016; Wang et al. 2015). Let \(Z(\textbf{s})\) be a zero-inflated observation for spatial location \(\textbf{s}\subset {\mathcal {D}}\) within the spatial domain \({\mathcal {D}}\in {\mathbb {R}}^{2}\). In the spatial two-part modeling framework, \(Z(\textbf{s})\) are generated as follows.

$$\begin{aligned} Z(\textbf{s}) = \left\{ \begin{array}{ll} 0 &{} \text{ if } O(\textbf{s}) = 0 \\ P(\textbf{s}) &{} \text{ if } O(\textbf{s}) = 1. \end{array}, \right. \end{aligned}$$
(1)

where \(O(\textbf{s})\) and \(P(\textbf{s})\) are the spatial occurrence and prevalence processes, respectively. The occurrence process is typically specified as \(O(\textbf{s})\sim \textrm{Bern}(\cdot |\pi (\textbf{s}))\) with spatially varying probabilities \(\pi (\textbf{s})\in (0,1)\). The prevalence process is modeled as \(P(\textbf{s})\sim {\tilde{F}}(\cdot |{\varvec{\theta }}(\textbf{s}))\) where \({\tilde{F}}(\cdot |{\varvec{\theta }}(\textbf{s}))\) is a discrete or continuous probability distribution with spatially-varying parameters \({\varvec{\theta }}(\textbf{s})\). The key distinction from the univariate case is that the occurrence O and prevalence P random variables now vary across space; hence, we model the occurrence and prevalence as spatial random processes \(O(\textbf{s})\) and \(P(\textbf{s})\).

Spatial two-part models typically fall into two classes—hurdle and mixture models. In hurdle models, the occurrence random process \(O(\textbf{s})\) specifies which locations are associated with zero- or nonzero values. For the nonzero data, their respective positive values are generated by the prevalence random process \(P(\textbf{s})\). In the discrete case, \({\tilde{F}}(\cdot |{\varvec{\theta }}(\textbf{s}))\) is a zero-truncated probability mass function such as the zero-truncated Poisson or the zero-truncated negative binomial distribution. For semi-continuous observations, \({\tilde{F}}(\cdot |{\varvec{\theta }}(\textbf{s}))\) can be a probability density function with positive support such as a log-normal or gamma distribution. For mixture models, the zero-valued observations can be generated by both processes \(O(\textbf{s})\) and \(P(\textbf{s})\). Here, \(O(\textbf{s})\) determines whether a location is classified as a structural zero or nonzero case. For the structural nonzero cases, the prevalence random process \(P(\textbf{s})\) generates both zeros and positive values. In the discrete case, \({\tilde{F}}(\cdot |{\varvec{\theta }}(\textbf{s}))\) is a non-degenerate mass function such as the Poisson or Negative-Binomial distribution. For semi-continuous observations, \({\tilde{F}}(\cdot |{\varvec{\theta }}(\textbf{s}))\) can be a censored model such as a Tobit Type I.

3.1 Modeling framework: spatial two-part models

Here, we outline the general hierarchical modeling framework for spatial two-part models. The occurrence process \(O(\textbf{s})\) corresponds to the Bernouilli probability distribution with either a probit or a logit link function. The prevalence process \(P(\textbf{s})\) employs the appropriate probability distribution based on the observation class (discrete vs. semi-continuous) and zero-inflation structure (hurdle vs. mixture). For hurdle models, a zero-truncated distribution (e.g., zero-truncated Poisson, zero-truncated negative binomial, lognormal, or gamma) is a sensible choice for \({\tilde{F}}(\cdot |{\varvec{\theta }})\). Mixture models utilize a distribution with non-negative support (e.g., Poisson, negative binomial, or Tobit Type I). For both processes, their respective linear predictors (\({\varvec{\eta }}_{o}\) and \({\varvec{\eta }}_{p}\)) include the fixed and spatial random effects. The linear predictor for the occurrence process is \({\varvec{\eta }}_{o}=\textbf{X}{\varvec{ \beta }}_{o} + \textbf{W}_{o} + {\varvec{\epsilon }}_{o}\), where \({\varvec{ \beta }}_{o}\) and \(\textbf{W}_{o}\) are the vectors of the fixed effects and spatial random effects, respectively, and \({\varvec{\epsilon }}_{o}\) are the iid observational errors. The linear predictor for the prevalence process \({\varvec{\eta }}_{p}\) can be constructed similarly to \({\varvec{\eta }}_{o}\). The latent spatial random sub-processes \(W_o(\textbf{s})\) and \(W_p(\textbf{s})\) can be modeled as a stationary zero-mean Gaussian process with a Matérn covariance function (Stein 2012), a widely used class of stationary and isotropic covariance functions. Link functions \(g_o(\cdot )\) and \(g_p(\cdot )\) are specified according to the spatial processes. To complete the Bayesian hierarchical model, we designate prior distributions for the model parameters.

The Bayesian hierarchical framework for spatial two-part models is:

$$\begin{aligned} \begin{array}{ll} {{\textbf {Data Model:}}} ~&{} \qquad \textbf{Z}|O(\textbf{s}),P(\textbf{s}) \sim F\big (\cdot |O(\textbf{s}),P(\textbf{s})\big ) \\ {{\textbf {Process Model:}}} ~&{} \qquad O(\textbf{s})|\pi (\textbf{s}) \sim \text{ Bern }(\cdot |\pi (\textbf{s})) \\ ~&{} \qquad P(\textbf{s})|\theta (\textbf{s}) \sim {\tilde{F}}(\cdot |\theta (\textbf{s})) \\ {{\textbf {Sub-process Model 1: }}} ~&{} \qquad \pi (\textbf{s})|\eta _{o}(\textbf{s})=g^{-1}_{o}(\eta _{o}(\textbf{s}))\\ {{\textbf {(Occurrence) }}} ~&{} \qquad \eta _{o}(\textbf{s})|{\varvec{ \beta }}_{o},W_{o}(\textbf{s}),\epsilon _{o}(s) =\textbf{X}(\textbf{s})^{\prime }{\varvec{ \beta }}_{o} + W_{o}(\textbf{s}) +\epsilon _{o}(\textbf{s}) \\ ~&{} \qquad \textbf{W}_{o}=(W_{o}(\textbf{s}_{1}),\ldots ,W_{o}(\textbf{s}_{n}))^{\prime } \\ ~&{} \qquad \textbf{W}_{o}|\phi _{o},\sigma ^{2}_{o}\sim {\mathcal {N}}(\pmb {0},\sigma ^{2}_{o}\textbf{R}_{\phi _{o}}), \\ ~&{} \qquad \epsilon (s)|\tau ^{2}_{o}\sim {\mathcal {N}}(0,\tau ^{2}_{o})\\ {{\textbf {Sub-process Model 2: }}} ~&{} \qquad \theta (\textbf{s})|\eta _{p}(\textbf{s})=g^{-1}_{p}(\eta _{p}(\textbf{s})) \\ {{\textbf {(Prevalence)}}} ~&{} \qquad \eta _{p}(\textbf{s})|{\varvec{ \beta }}_{p},W_{p}(\textbf{s}),\epsilon _{p}(\textbf{s}) =\textbf{X}(\textbf{s})^{\prime }{\varvec{ \beta }}_{p} + W_{p}(\textbf{s}) +\epsilon _{p}(\textbf{s}) \\ ~&{} \qquad \textbf{W}_{p}=(W_{p}(\textbf{s}_{1}),\ldots ,W_{p}(\textbf{s}_{n}))^{\prime } \\ ~&{} \qquad \textbf{W}_{p}|\phi _{p},\sigma ^{2}_{p}\sim {\mathcal {N}}({{\varvec{0}}},\sigma ^{2}_{p}\textbf{R}_{\phi _{p}}) \\ ~&{} \qquad \epsilon (\textbf{s})|\tau ^{2}_{p}\sim {\mathcal {N}}(0,\tau ^{2}_{p}) \\ {{\textbf {Parameter Model:}}} ~&{}\qquad {\varvec{ \beta }}_{o}\sim p({\varvec{ \beta }}_{o}),\quad {\varvec{ \beta }}_{p}\sim p({\varvec{ \beta }}_{p}), \quad \phi _{o} \sim p(\phi _{o}) , \quad \phi _{p} \sim p(\phi _{p}) \\ ~&{} \qquad \sigma ^{2}_{o} \sim p(\sigma ^{2}_{o}), \quad \sigma ^{2}_{p} \sim p(\sigma ^{2}_{p}), \quad \tau ^{2}_{o} \sim p(\tau ^{2}_{o}), \quad \tau ^{2}_{p} \sim p(\tau ^{2}_{p}) \end{array} \end{aligned}$$
(2)

where \(F\big (\cdot |O(\textbf{s}),P(\textbf{s})\big )\) is the distribution function of a spatial two-part model. From Eq. 2, the likelihood function \(f\big (z|O(\textbf{s}),P(\textbf{s})\big )\) is defined as:

$$\begin{aligned} f\big (z|O(\textbf{s}),P(\textbf{s})\big ) = \left\{ \begin{array}{ll} \pi (\textbf{s}) +(1-\pi (\textbf{s}))\times {\tilde{f}}(0;\theta (\textbf{s})), &{} \text{ if } z=0 \\ (1-\pi (\textbf{s}))\times {\tilde{f}}(z;\theta (\textbf{s})), &{} \text{ if } z>0. \end{array}, \right. \end{aligned}$$
(3)

where \(\pi (\textbf{s})\) and \(\theta (\textbf{s})\) are the spatially-varying occurrence probabilities and prevalence intensities, respectively, and \({\tilde{f}}(z;\theta (\textbf{s}))\) is the density function of the prevalence process. Spatial two-part models fall into two classes (hurdle and mixture models). The key difference between both classes lies in the choice of \({\tilde{F}}(\cdot |\theta (\textbf{s}))\), the distribution of the prevalence process P(s). We provide a detailed description of four spatial two-part models (count hurdle, semicontinuous hurdle, count mixture, and semicontinuous mixture) in the Supplement.

3.2 Computational challenges

Spatial two-part models are subject to computational obstacles in the high-dimensional setting. Both the occurrence \(O(\textbf{s})\) and prevalence \(P(\textbf{s})\) processes include latent spatial random fields, which can be computationally prohibitive to model for even moderately large datasets (more than 1000 observations) (Haran 2011). These require a costly evaluation of an n-dimensional multivariate normal likelihood function with \({\mathcal {O}}(n^{3})\) operations at each iteration of the MCMC algorithm. The highly correlated spatial random effects can result in poor mixing Markov chains (cf. Christensen and Waagepetersen 2002; Haran et al. 2003). Another consideration is modeling the cross-covariance between the occurrence and prevalence processes, which also requires expensive matrix inversions and computing determinants (Recta et al. 2012).

Past studies employed Guass-Hermite quadrature (Min and Agresti 2005), expectation-maximization (Lambert 1992; Roeder et al. 1999), or restricted maximum quasi-likelihoods (Kim et al. 2012) to fit zero-inflated models. However, such approximations may not scale well with high-dimensional random effects (Neelon et al. 2016) that exhibit spatial correlation. In the spatial setting, Monte Carlo Maximum Likelihood (Lyashevska et al. 2016) and predictive processes (Wang et al. 2015) have been incorporated into the two-part modeling framework. Yet, these approaches are still computationally costly and they may not scale well to larger datasets. For instance, Lyashevska et al. (2016) required over 72 h to model 4029 zero-inflated spatial observations. Wang et al. (2015) models a dataset with 1325 locations using predictive processes (Banerjee et al. 2008), which scales at \({\mathcal {O}}(nm^2+m^3)\) where n is the number of observed locations and m denotes the number of knot locations. Selecting the proper knot locations (Guhaniyogi et al. 2011) can also be challenging.

4 A computationally efficient approach for fitting two-part models

In this section, we propose a scalable method (PICAR-Z) for fitting high-dimensional spatial two-part models. Our approach builds upon the projection-based intrinsic conditional autoregression (PICAR) framework (Lee and Haran 2022). We present the general hierarchical modeling framework, practical guidelines for implementation, and a discussion of the computational speedup associated with PICAR-Z.

Consider the vector of spatial random effects \(\textbf{W}=(W(\textbf{s}_{1}),\ldots ,W(\textbf{s}_{n}))'\), which can be approximated as a linear combination of spatial basis functions: \({\textbf{W}}\approx \varvec{\Phi }{\varvec{\delta }}\) where \(\varvec{\Phi }\) is an \(n\times p\) basis function matrix where each column denotes a basis function, and \({\varvec{\delta }}\in {\mathbb {R}}^p\) are the re-parameterized spatial random effects (or basis coefficients). Moreover, \({\varvec{\delta }}\sim {\mathcal {N}}(0,\Sigma _{\delta })\) where \(\Sigma _\delta \) is the \(p\times p\) covariance matrix for the coefficients. Basis functions can be interpreted as a set of distinct spatial patterns and a weighted sum of these patterns constructs a spatial random field. Basis representation has been a popular approach to model spatial data (cf. Cressie and Johannesson 2008; Banerjee et al. 2008; Hughes and Haran 2013; Lindgren et al. 2011; Rue et al. 2009; Haran et al. 2003; Griffith 2003; Higdon 1998; Nychka et al. 2015). Basis representations tend to be computationally efficient (Cressie and Wikle 2015) as they help bypass large matrix operations and reduce the dimensions of and correlation among the spatial random effects.

4.1 Projection intrinsic autoregression (PICAR)

The projection-based intrinsic conditional auto-regression (PICAR) approach consists of three components: (1) generate a triangular mesh on the spatial domain \({\mathcal {D}}\subset {\mathbb {R}}^{2}\); (2) construct a spatial field on the mesh vertices using nonparametric basis functions; (3) interpolate onto the observation locations using piece-wise linear basis functions. Please see (Lee and Haran 2022) for extensions to other classes of hierarchical spatial models such as spatial generalized linear mixed models, spatially-varying coefficients models, and ordinal spatial models.

4.1.1 Mesh construction

Prior to fitting the model, we generate a mesh enveloping the observed spatial locations via Delaunay Triangulation (Hjelle and Dæhlen 2006). Next, the the spatial domain D is partitioned into a collection of non-intersecting irregular triangles. The triangles can share a common edge, corner (i.e., nodes or vertices), or both. The mesh generates a latent undirected graph \(G=\{V,E\}\), where \(V =\{1,2,\ldots , m\}\) are the mesh vertices and E are the edges. Each edge E is represented as a pair (ij) denoting the connection between i and j. The graph G is characterized by its weights matrix \({\textbf{N}}\), an \(m\times m\) matrix where \(N_{ii}=0\) and \(N_{ij} =1\) when mesh node i is connected to node j and \(N_{ij} =0\) otherwise. The triangular mesh can built using the R-INLA package (Lindgren and Rue 2015).

4.1.2 Moran’s basis functions

Next, we construct the Moran’s basis functions (Hughes and Haran 2013; Griffith 2003) on the set of mesh vertices V of graph G. The Moran’s basis functions refer to the leading p eigenvectors of the Moran’s operator \(\mathbf {(I-11'/}m\mathbf {)N(I-11'/}m)\), where \({\textbf{I}}\) is the identity matrix and \({\textbf{1}}\) is a vector of 1’s. Note that this operator is a component of the Moran’s I statistic:

$$\begin{aligned} I(A)=\frac{m}{\mathbf {1'N1}}\frac{\mathbf {Z'(I-11'/}m)\mathbf {N(I-11'/}m\mathbf {)Z}}{\mathbf {Z'(I-11'/}m\mathbf {)Z}}, \end{aligned}$$

a diagnostic of spatial dependence (Moran 1950) used for areal spatial data. Values of the Moran’s I above \(-\frac{1}{m-1}\) indicate positive spatial autocorrelation and values below \(-\frac{1}{m-1}\) indicate negative spatial autocorrelation (Griffith 2003). For the triangular mesh, the positive eigenvectors represent the patterns of spatial clustering, or dependence, among the mesh nodes, and their corresponding eigenvalues denote the magnitude of clustering. We construct the Moran’s basis function matrix \({\textbf{M}}\in {\mathbb {R}}^{m\times p}\), by selecting the first p eigenvectors of the Moran’s operator where \(p\ll m\). Rank selection for p proceeds via an automated heuristic (Lee and Haran 2022) based on out-of-sample validation. A spatial random field can be constructed through a linear combinations of the Moran’s basis functions (contained in matrix \({\textbf{M}}\)) and their corresponding weights \(\delta \in {\mathbb {R}}^{p}\).

4.1.3 Piece-wise linear basis functions

To complete the PICAR approach, we introduce a set of piece-wise linear basis functions (Brenner and Scott 2007) to interpolate points within the triangular mesh (i.e., the undirected graph \(G=(V,E)\)). Following Lee and Haran (2022), we construct a spatial random field on the mesh nodes \({\tilde{\textbf{W}}}=(W(\textbf{v}_{1}),\ldots ,W(\textbf{v}_{m}))'\) where \(\textbf{v}_{i}\in V\) and then project, or interpolate, onto the observed locations \({\textbf{W}}=(W(\textbf{s}_{1}),\ldots ,W(\textbf{s}_{n}))'\) where \(\textbf{s}_{i}\in {\mathcal {D}}\). The latent spatial random field \({\textbf{W}}\) can be represented as \(\textbf{W}=\textbf{A}{\tilde{\textbf{W}}}\), where \({\textbf{A}}\) is an \(n\times m\) projector matrix containing the piece-wise linear basis functions. The rows of \({\textbf{A}}\) correspond to an observation location \(\textbf{s}_{i}\in {\mathcal {D}}\), and the columns correspond to a mesh node \(\textbf{v}_{i}\in V\). The ith row of \({\textbf{A}}\) contains the weights to linearly interpolate \(W(s_{i})\). In practice, we use an \(n\times m\) projector matrix \({\textbf{A}}\) for fitting the hierarchical spatial model. For model validation and prediction, we generate an \(n_{CV} \times m\) projector matrix \({\textbf{A}}_{CV}\) that interpolates onto the \(n_{CV}\) validation locations.

The PICAR approach (Lee and Haran 2022) is specifically designed for spatial models in the continuous spatial domain and includes both a projection (P) and an intrinsic CAR (ICAR) component. The discretizing step of PICAR resembles models like ICAR for areal spatial data (Besag and Kooperberg 1995). However, because the full PICAR approach ultimately extends to the continuous domain and includes dimension-reduction, it allows practitioners to interpolate and quantify uncertainty at unobserved spatial locations on a continuous domain, as shown in examples in Sect. 6. As discussed in Lee and Haran (2022) and Hughes and Haran (2013), \({\varvec{\delta }}\sim {{\mathcal {N}}}({{\varvec{0}}}, (\textbf{M}'\textbf{Q}\textbf{M})^{-1})\) where \(\textbf{Q}\) represents the ICAR precision matrix. If \(rank(\textbf{M})=m\), then the latent process \({\tilde{\textbf{W}}}\sim {{\mathcal {N}}}({{\varvec{0}}}, \textbf{Q}^{-1})\), which is analogous to a Gaussian Markov random field with an ICAR precision matrix located on the mesh vertices. As for the projection component, PICAR represents the latent continuous spatial process as \(\textbf{W}= \textbf{A}{\tilde{\textbf{W}}}= \textbf{A}\textbf{M}{\varvec{\delta }}\) which is simply a projection of \({\tilde{\textbf{W}}}\) onto the continuous domain through projection matrix \(\textbf{A}\).

PICAR’s projected Moran’s basis functions (\(\textbf{A}\textbf{M}\)) and the restricted Moran’s basis functions (HH) from Hughes and Haran (2013) share some features. Both bases are eigenvectors of a variant of the Moran’s operator, \(\mathbf {(I-11'/}m\mathbf {)N(I-11'/}m)\) for PICAR and \(\mathbf {(I-\textbf{P})N(I-\textbf{P})}\) for HH where \(\textbf{P}=\textbf{X}(\textbf{X}'\textbf{X})^{-1}\textbf{X}'\). However, PICAR’s bases can project onto the continuous spatial domain using projector matrix \(\textbf{A}\), while the HH bases are limited to the discrete spatial domain. Recent critiques of low-rank restricted spatial regression models (RSR) (Zimmerman and Ver Hoef 2022; Khan and Calder 2022) show that RSR approaches like HH have poor inferential and predictive performance compared to non-RSR SGLMMs and even non-spatial models. However, PICAR and PICAR-Z do not suffer from these issues because neither is an RSR approach; the focus is entirely on creating a computationally efficient approach and no attempt is made to examine spatial confounding.

4.2 PICAR approach for zero-inflated spatial data

The PICAR approach readily extends to spatial two-part models (Eq. 2). Using PICAR, we approximate both the latent spatial occurrence \(O(\textbf{s})\) and prevalence \(P(\textbf{s})\) processes as an expansion of Moran’s basis functions (Griffith 2003; Hughes and Haran 2013). PICAR-Z approximates the spatial random effects for the occurrence and prevalences processes, respectively, as \(\textbf{W}_{o}\approx \textbf{A}_{o}\textbf{M}_{o}{\varvec{\delta }}_{o}\) and \(\textbf{W}_{p}\approx \textbf{A}_{p}\textbf{M}_{p}{\varvec{\delta }}_{p}\) using projector matrices \(\textbf{A}_{o}\) and \(\textbf{A}_{p}\), Moran’s basis functions matrices \(\textbf{M}_{o}\) and \(\textbf{M}_{p}\), and basis coefficients \({\varvec{\delta }}_{o}\) and \({\varvec{\delta }}_{p}\). The PICAR-Z hierarchical framework for spatial two-part models is:

$$\begin{aligned} \begin{array}{ll} {{\textbf {Data Model: }}} &{} \qquad Z(\textbf{s})|O(\textbf{s}),P(\textbf{s}) \sim F\big (\cdot |O(\textbf{s}),P(\textbf{s})\big )\\ {{\textbf {Process Model: }}} &{} \qquad O(\textbf{s})|\pi (\textbf{s}) \sim \text{ Bern }(\pi (\textbf{s}))\\ &{} \qquad P(\textbf{s})|\theta (\textbf{s}) \sim {\tilde{F}}(\cdot | \theta (\textbf{s}))\\ {{\textbf {Sub-process Model 1: }}} &{} \qquad \pi (\textbf{s})|\eta _{o}(\textbf{s})=g^{-1}_{o}(\eta _{o}(\textbf{s}))\\ {{\textbf {(Occurrence) }}} &{} \qquad \eta _{o}(\textbf{s})|{\varvec{ \beta }}_{o},{\varvec{\delta }}_{o}= \textbf{X}(s)^{\prime }{\varvec{ \beta }}_{o} + [\textbf{A}_{o}\textbf{M}_{o}{\varvec{\delta }}_{o}](\textbf{s})\\ &{} \qquad {\varvec{\delta }}_{o}|\tau _{o} \sim {\mathcal {N}}({{\varvec{0}}},\tau _{o}^{-1}(\textbf{M}_{o}'\textbf{Q}_{o}\textbf{M}_{o})^{-1})\\ {{\textbf {Sub-process Model 2: }}} &{} \qquad \theta (\textbf{s})|\eta _{p}(\textbf{s})=g^{-1}_{p}(\eta _{p}(\textbf{s}))\\ {{\textbf {(Prevalence)}}} &{} \qquad \eta _{p}(\textbf{s})|{\varvec{ \beta }}_{p},{\varvec{\delta }}_{p}=\textbf{X}(s)^{\prime }{\varvec{ \beta }}_{p} + [\textbf{A}_{p}\textbf{M}_{p}{\varvec{\delta }}_{p}](\textbf{s}) \\ &{} \qquad {\varvec{\delta }}_{p}|\tau _{p}\sim {\mathcal {N}}({{\varvec{0}}},\tau _{p}^{-1}(\textbf{M}_{p}'\textbf{Q}_{p}\textbf{M}_{p})^{-1}),\\ {{\textbf {Parameter Model: }}} &{} \qquad {\varvec{ \beta }}_{o}\sim {{\mathcal {N}}}({\varvec{\mu }}_{\beta _o}, \Sigma _{\beta _o}),\quad {\varvec{ \beta }}_{p}\sim {{\mathcal {N}}}({\varvec{\mu }}_{\beta _p}, \Sigma _{\beta _p})\\ &{} \qquad \tau _o \sim {{\mathcal {G}}}(\alpha _{\tau _o},\beta _{\tau _o}), \quad \tau _p \sim {{\mathcal {G}}}(\alpha _{\tau _p},\beta _{\tau _p}) \end{array} \end{aligned}$$

where \(F\big (\cdot |O(\textbf{s}),P(\textbf{s})\big )\) is the distribution of the specified two-part model such as the hurdle and mixture models. \({\tilde{F}}(\cdot |\theta (\textbf{s}))\) denote the distribution function of the prevalence process. \(\textbf{A}_{o}\) and \(\textbf{A}_{p}\) are the projectors matrices for the occurrence and prevalence processes, respectively. For the occurrence and prevalence processes, we now incorporate the Moran’s basis functions matrices \({\textbf{M}}_{o}\) and \({\textbf{M}}_{p}\), the basis coefficients \({\varvec{\delta }}_{o}\) and \({\varvec{\delta }}_{p}\), and the precision parameters \(\tau _{o}\) and \(\tau _{p}\). \(\textbf{Q}\), the \(m \times m\) prior precision matrix for the mesh vertices, is typically fixed prior to model fitting (see Lee and Haran (2022) for additional details). \([\textbf{A}_{o}\textbf{M}_{o}{\varvec{\delta }}_{o}](\textbf{s})\) denotes the value of the basis expansion \(\textbf{A}_{o}\textbf{M}_{o}{\varvec{\delta }}_{o}\) corresponding to location \(\textbf{s}\). The interpretation of \([\textbf{A}_{p}\textbf{M}_{p}{\varvec{\delta }}_{p}]\) follows similarly.

The PICAR-Z approach is amenable to be modified to capture the cross-covariance between the occurrence \(O(\textbf{s})\) and prevalence \(P(\textbf{s})\) processes. Past studies have examined methodology for estimating the cross-covariance between two spatial random fields (Oliver 2003; Recta et al. 2012). We extend the general modeling framework from Recta et al. (2012) to high-dimensional settings by imposing correlation on the dimension-reduced Moran’s basis coefficients \({\varvec{\delta }}_{o}\) and \({\varvec{\delta }}_{p}\). We provide additional details for modeling the cross-correlation under the PICAR-Z framework in the supplement.

4.3 Tuning mechanisms

Though most of the PICAR-Z approach is readily automated, there are key tuning mechanisms left to the practitioner: (1) selecting the rank of Moran’s basis functions matrices (\(p_o\) and \(p_p\)); (2) specifying the precision matrices of the mesh vertices (\(\textbf{Q}_o\) and \(\textbf{Q}_p\)); and (3) identifying the appropriate two-part model. Here, we examine these tuning mechanisms in detail and provide practical guidelines for implementation.

First, we provide an automated heuristic to select the appropriate ranks (\(p_{o}\) and \(p_{p}\)) of the Moran’s basis function matrices for both processes, \({\textbf{M}}_{o}\) and \({\textbf{M}}_{p}\). First, we generate two augmented datasets, \({\textbf{Z}}_{o}^{*}\) and \({\textbf{Z}}_{p}^{*}\), constructed from the original zero-inflated spatial dataset \({\textbf{Z}}\). The first dataset is generated as:

$$\begin{aligned} Z_{o}^{*}(s) = \left\{ \begin{array}{ll} 0, &{} \text{ if } Z(s)=0 \\ 1, &{} \text{ if } Z(s)>0. \end{array}, \right. \end{aligned}$$
(4)

The second dataset \({\textbf{Z}}_{p}^{*}\in {\mathbb {R}}^{n_{p}}\) is the collection of all observations such that \(Z(s)>0\) and \(n_{p}\) corresponds to the sample size of \({\textbf{Z}}_{p}^{*}\). Next, we generate a set \({\mathcal {P}}\) consisting of h equally spaced points within the interval [2, P] where P is the maximum rank and h is the interval resolution (\(h=P-1\) by default). Here, \(P<m\) and both P and h are chosen by the user.

For the augmented dataset \({\textbf{Z}}_{o}^{*}\), we proceed in the following way. For each \(p\in {\mathcal {P}}\), we construct an \(n\times (k+p)\) matrix of augmented covariates \({\tilde{X}}_{o}=[X \quad \mathbf {A_{o}M}_{p}]\) where \(X\in {\mathbb {R}}^{n\times k}\) is the original covariate matrix, \({\textbf{A}}_{o} \in {\mathbb {R}}^{n\times m}\) is the projector matrix, and \({\textbf{M}}_{p} \in {\mathbb {R}}^{m\times p}\) are the leading p eigenvectors of the Moran’s operator. Next, we use maximum likelihood approaches to fit the appropriate generalized linear model (GLM) for binary responses with a logit or probit link function. Finally, we set \(p_{o}\) to be the rank p that yields the lowest out-of-sample root mean squared prediction error (rmspe) or area under the ROC curve (AUC).

We implement a similar procedure for the second augmented dataset \({\textbf{Z}}_{p}^{*}\in {\mathbb {R}}^{n_{p}}\). For each \(p\in {\mathcal {P}}\), we construct an \(n_{p}\times (k+p)\) matrix of augmented covariates \({\tilde{\textbf{X}}}_{p}=\begin{bmatrix} \textbf{X}_{p}&{\textbf{A}}_{p}\textbf{M}_{p} \end{bmatrix}\) where \(\textbf{X}_{p}\in {\mathbb {R}}^{n_{p}\times k}\) is the matrix of covariates, \({\textbf{A}}_{p} \in {\mathbb {R}}^{n_{p}\times m}\) is the projector matrix, and \(\mathbf {M_{p}} \in {\mathbb {R}}^{m\times p}\) are the leading p eigenvectors of the Moran’s operator. Note that the rows of \(\textbf{X}_{p}\), \({\textbf{A}}_{p}\), and \(\mathbf {M_{p}}\) correspond to the \(n_{p}\) locations with nonzero values. Next, we use maximum likelihood approaches to fit the appropriate generalized linear model (GLM) for positive responses. For count data, the likelihood function is a zero-truncated Poisson distribution. For semi-continuous data in the hurdle model framework, we employ a lognormal distribution as the likelihood function. For semi-continuous data in the mixture model framework, we simply fit the traditional linear model. Then, we set \(p_{p}\) to be the rank p that yields the lowest out-of-sample root mean squared prediction error (rmspe).

Next, we provide some choices for \({\textbf{Q}}\), the prior precision matrix for the mesh vertices \({\tilde{\textbf{W}}}\). By default, we set \({\textbf{Q}}\) to be the precision matrix of an intrinsic conditional auto-regressive model (ICAR). Another option would set \({\textbf{Q}}\) as the precision matrix of a conditional auto-regressive model (CAR) with estimable autocorrelation parameter \(\rho \in (0,1)\) such that \({\textbf{Q}}=(\textbf{N1}-\rho {\textbf{N}})\), where \({\textbf {N}}\) is the adjacency matrix. In practice, estimating \(\rho \) could potentially offset the computational gains of the PICAR-Z approach. Note that there are two latent processes (occurrence and prevalence), hence we must estimate two additional parameters, \(\rho _o\) and \(\rho _p\), and construct two precision matrices \(\textbf{Q}_{O}\) and \(\textbf{Q}_{p}\). At each iteration of the MCMC algorithm, both \(\rho _o\) and \(\rho _p\) must be sampled, and determinants \(|\textbf{Q}_{O}|\) and \(|\textbf{Q}_{p}|\) must be recomputed. This would increase computational costs on the order of \({\mathcal {O}}(\frac{2}{3}m^{3})\) operations (i.e., two additional Cholesky decompositions of an \(m\times m\) precision matrix). Another alternative is setting \({\textbf{Q}}=I\), where the basis coefficients \({\varvec{\delta }}_o\) and \({\varvec{\delta }}_p\) are uncorrelated a priori.

Specifying the class (hurdle vs. mixture) of two-part model is an active area of research (cf. Feng 2021; Vuong 1989; Neelon et al. 2016). Past studies have selected the appropriate model class using the Akaike Information Criterion (AIC) (Feng 2021) or the Vuong test statistics (Vuong 1989). However, the choice between a hurdle and mixture model depends on the aims of the investigator and prior scientific knowledge regarding the zero-generating processes (i.e. should the prevalence process also generate zeros). For practitioners, we suggest conducting a sensitivity analysis using both mixture and hurdle models, and then select the appropriate model based on out-of-sample validation. Since PICAR-Z is computationally efficient and scales to larger datasets, conducting such a sensitivity analysis should be feasible in many settings.

4.4 Computational advantages

The PICAR-Z approach requires shorter walltimes per iteration (of the MCMC algorithm) as well as fewer iterations for the Markov chain to converge. The computational speedup results from exploiting lower-dimensional and weakly correlated basis coefficients \({\varvec{\delta }}_{o}\) and \({\varvec{\delta }}_{p}\) and also bypassing expensive matrix operations (e.g. Cholesky decompositions). The PICAR-Z approach has a computational complexity of \({\mathcal {O}}(2np)\) as opposed to \({\mathcal {O}}(2n^{3})\) for the full hierarchical spatial two-part model.

We examine mixing in MCMC algorithms within the context of spatial two-part models. The PICAR-Z approach generates a faster mixing MCMC algorithm than fitting the full two-part model using the ‘reparameterized’ method (Christensen and Waagepetersen 2002) and a competing approach using bi-square basis functions (Cressie and Johannesson 2008). This is corroborated by the larger effective sample size per second (ES/sec), which approximates the rate at which samples are produced from an MCMC algorithm that are equivalent to samples from an IID sampler. Larger values of ES/sec indicates faster mixing. In the simulated examples (Sect. 5), the PICAR-Z approach returns a larger ES/sec than the ‘reparameterized’ approach across all model parameters and spatial random effects (see supplement).

For PICAR, the two major computational bottlenecks are constructing the Moran’s operator and computing its k-leading eigencomponents. The Moran’s operator requires the matrix operation \(\mathbf {(I-11'/}m){\textbf{N}}(\mathbf {I-11'}/m)\) which results in \(2m^{3}-m^{2}\) floating point operations (FLOPs) where m is the number of mesh vertices. For dense mesh structures (large m), we can generate the Moran’s operator by leveraging the embarrassingly parallel operations as well the sparsity of the weights matrix \({\textbf{N}}\). Next, the first k eigencomponents of the Moran’s Operator can be computed using a partial eigendecomposition approach such as the Implicitly Restarted Arnoldi Method (Lehoucq et al. 1998) from the RSpectra package (Qiu and Mei 2019). Since the PICAR-Z approach generally selects the leading \(p_o\) or \(p_p\) eigenvectors where \(p_o\ll m\) and \(p_p\ll m\), an expensive full eigendecomposition of the Moran’s operator is not necessary.

5 Simulation study

In this section, we conduct an extensive simulation study focusing on four commonly-used spatial two-part models: (1) hurdle model with count data; (2) mixture model with count data; (3) hurdle model with semi-continuous data; and (4) mixture model with semi-continuous data. In addition, we provide comparisons to a low-rank approach with nested bisquare basis functions (Sengupta and Cressie 2013; Cressie and Johannesson 2008) and a ‘reparameterized’ method (Christensen and Waagepetersen 2002).

5.1 Simulation study design

For all four two-part models, we simulate \(B=100\) samples for a total of \(4\times 100=400\) datasets in the simulation study. In each sample, we generate a set of 1400 randomly-selected locations on the unit square. 1000 observations are allocated for model-fitting and the remaining 400 reserved for model validation. We chose a smaller sample size \(n=1000\) to allow for comparisons against a ‘reparameterized’ method, which may be difficult to implement with larger datasets (see Sect. 3.2). Please see the supplement for details on the ‘reparameterized’ approach.

For each sample, we randomly generate a matrix of covariates \(\textbf{X}=[\textbf{X}_1, \textbf{X}_2]\). We use the same regression coefficients \({\varvec{ \beta }}_{o}={\varvec{ \beta }}_{p}=(1,1)^{T}\) for all datasets in the simulation study. The spatial random effects \(\textbf{W}_o\) and \(\textbf{W}_p\) are generated from the multivariate Gaussian process proposed in Recta et al. (2012). Note that the prohibitively high computational costs made it challenging to explore model structures carefully and extend them to higher-dimensional settings.

$$\begin{aligned} \begin{bmatrix} \textbf{W}_o\\ \textbf{W}_p \end{bmatrix} \sim {{\mathcal {N}}}\Bigg ( \begin{bmatrix} {{\varvec{0}}}\\ {{\varvec{0}}}\end{bmatrix}, \begin{bmatrix} \textbf{C}(\cdot |\nu _o,\phi _o,\sigma ^2_o) &{} \rho \textbf{L}_o\textbf{L}_p^{T}\\ \rho \textbf{L}_o\textbf{L}_p^{T}&{} \textbf{C}(\cdot |\nu _p,\phi _p,\sigma ^2_p ) \end{bmatrix} \Bigg ). \end{aligned}$$

where \(\textbf{C}(\cdot |\nu _o,\phi _o,\sigma ^2_o)\) and \(\textbf{C}(\cdot |\nu _p,\phi _p,\sigma ^2_p )\) are covariance matrices for \(\textbf{W}_o\) and \(\textbf{W}_p\), respectively. \(\rho \) represents cross-correlation between the occurrence and prevalence processes at the same location. We fix \(\rho =0.7\) to impose moderate positive cross-correlation between the occurrence and prevalence processes. \(\textbf{L}_o\) and \(\textbf{L}_p\) are the lower-triangular Choleski factors of \(\textbf{C}(\cdot |\nu _o,\phi _o,\sigma ^2_o)\) and \(\textbf{C}(\cdot |\nu _p,\phi _p,\sigma ^2_p )\), respectively. That is, \(\textbf{C}(\cdot |\nu _o,\phi _o,\sigma ^2_o )=\textbf{L}_o\textbf{L}_o^{T}\) and \(\textbf{C}(\cdot |\nu _p,\phi _p,\sigma ^2_p )=\textbf{L}_p\textbf{L}_p^{T}\). The covariance matrices, \(\textbf{C}(\cdot |\nu _o,\phi _o,\sigma ^2_o)\) and \(\textbf{C}(\cdot |\nu _p,\phi _p,\sigma ^2_p )\) are from the Matérn class (cf. Rasmussen and Williams 2006; Rasmussen 2004) of covariance functions with parameters \(\nu _o=\nu _p=0.5\), \(\sigma ^{2}_o=\sigma ^{2}_p=1\), and \(\phi _o=\phi _p=0.2\). We use the exponential covariance function (\(\nu =0.5\)) to generate a “rough” latent spatial process that is not mean square differentiable (Rasmussen 2004).

We first generate realizations from the latent occurrence process \(O(\textbf{s})\) such that the underlying probability surface is modeled as \(\pi (\textbf{s})=\text{ logit}^{-1}(\textbf{X}(s)^{\prime }{\varvec{ \beta }}_{O}+\textbf{W}_{O}(s))\). Next, we generate realizations from the prevalence process \(P(\textbf{s})\) from the corresponding prevalence distribution \({\tilde{F}}(\cdot |\theta (\textbf{s}))\) (Eq. 2) with spatially varying intensity (or mean) processes \(\theta (\textbf{s})\). We use a zero-truncated Poisson, Lognormal, Poisson, and Type I Tobit distribution for the hurdle count, hurdle semi-continuous, mixture count, and mixture semi-continuous cases, respectively. For the hurdle count and mixture count cases, we model the underlying intensity process as \(\theta (\textbf{s})=\exp \{\textbf{X}(\textbf{s})^{\prime }{\varvec{ \beta }}_P + \textbf{W}_p(\textbf{s})\}\). For the semi-continuous models, we specify \(\theta (\textbf{s})=({\varvec{\mu }}(\textbf{s}),\tau ^2)\) with the mean process \({\varvec{\mu }}(\textbf{s})=\textbf{X}(\textbf{s})^{\prime }{\varvec{ \beta }}_P + \textbf{W}_p(\textbf{s})\) and nugget variance \(\tau ^2=0.1\). Finally, the observed data are drawn from the respective distribution of the spatial two-part model \(F(\cdot |O(\textbf{s}),P(\textbf{s}))\). Our simulated datasets exhibit balance in the number of zero- and nonzero-valued observations (see summary in the supplement)

5.1.1 Implementation and competing methods

To complete the hierarchical framework, we specify prior distributions for the zero-inflated spatial model parameters. We assign a multivariate normal prior for the regression coefficients where \({\varvec{ \beta }}_O\sim {{\mathcal {N}}}({{\varvec{0}}}, 100{{\mathcal {I}}})\) and \({\varvec{ \beta }}_P\sim {{\mathcal {N}}}({{\varvec{0}}}, 100{{\mathcal {I}}})\). For the variance of the spatial basis coefficients, we specify a non-informative inverse gamma priors where \(\sigma ^{2}_{O}\sim {{\mathcal {I}}}{{\mathcal {G}}}(0.002, 0.002)\) and \(\sigma ^{2}_{P}\sim {{\mathcal {I}}}{{\mathcal {G}}}(0.002, 0.002)\). The cross-correlation coefficient between the occurrence and prevalence processes \(\rho \) follows a uniform distribution \(\rho \sim {\mathcal {U}}(-1,1)\). For the semi-continuous cases, the nugget variance \(\tau ^2_\epsilon \) for the prevalence process follows an inverse gamma distribution \(\tau ^2_\epsilon \sim {{\mathcal {I}}}{{\mathcal {G}}}(0.002, 0.002)\). For the PICAR-Z approach, we ran 150, 000 iterations of the MCMC algorithm. The MCMC algorithm is implemented using the programming language nimble (de Valpine et al. 2017). The selected rank, \(p_o\) and \(p_p\), varies across datasets and the class of two-part models. The median rank for the occurrence processes (\(p_o\)), is smaller than the median rank for the prevalence processes (\(p_p\)), which is consistent with results from a previous study (Lee and Haran 2022). We provide a table and summary of the chosen ranks in the supplement.

We compare the PICAR-Z method against the low-rank (bisquare) approach (Sengupta and Cressie 2013; Cressie and Johannesson 2008) as well as the ‘reparameterized’ approach (Christensen and Waagepetersen 2002). Due to computational constraints, we elected to use the approach from Christensen and Waagepetersen (2002) over the full spatial hierarchical two-part model from Sect. 3.1. The ‘reparameterized’ approach is, by design, a computationally efficient method for modeling latent spatial random processes as it improves mixing in the MCMC algorithm by considerably reducing the correlation in the spatial random effects. This method preserves the rank of the spatial random effects and improves mixing in the MCMC algorithm by considerably reducing the correlation in the spatial random effects. For the ‘reparameterized’ approach, we assume that the class of covariance function (Matérn) and the smoothness parameter (\(\nu =0.5\)) are known a priori, which may not necessarily be the case in other scenarios. Additional details for the competing approaches are provided in the supplement.

We provide the following validation metrics averaged over all samples in the simulation study: (1) out-of-sample root mean squared prediction error (rmspe total) for all observations; (2) area under the receiver operating characteristic curve (AUC) for the zero-valued observations; and (3) rmspe for the nonzero-valued observations (rmspe positive). The AUC is used to assess how well each approach classifies zero-valued observations in a binary classification setting. The out-of-sample root mean squared prediction error (rmspe) is \(\text{ rmspe }=\sqrt{\frac{1}{n_{\textrm{CV}}}\sum _{i=1}^{n_{\textrm{CV}}}(Y^{*}_{i}-{\hat{Y}}^{*}_{i})^{2}}\), where \(n_{\textrm{CV}}=400\), \(Y^{*}_{i}\)’s denote the i-th value in the validation sample, and \({\hat{Y}}^{*}_{i}\)’s are the predicted values at the i-th location. In addition, we compare the computational walltimes to draw 150,000 samples from the respective posterior distributions via MCMC. We asses convergence of the Markov chains using the batch means standard errors. The computation times are based on a single 2.4 GHz Intel Xeon Gold 6240R processor. All the code was run on the George Mason University Office of Research Computing (ORC) HOPPER high-performance computing infrastructure.

5.2 Results

Table 1 contains the out-of-sample prediction results for the entire validation sample (rmspe), positive-valued observations (rmspe), and zero versus nonzero values (AUC) as well as the average model-fitting walltimes. Results of the simulation study suggest that PICAR-Z outperforms both competing approaches in prediction across all four classes of two-part models (see Table 1). All approaches perform comparably for binary classification of the zero versus nonzero cases, as corroborated by similar AUC values. However, the PICAR-Z methods (with and without cross-correlation) provide more accurate predictions for the nonzero (i.e., positive-valued) observations, in comparison to the other two methods. Estimating the correlation parameter does not strongly affect accuracy, save for the semi-continuous hurdle case. Note that the PICAR-Z approach outperformed the ‘reparameterized’ approach in predictive performance, which is consistent with results from past studies that examined basis representations of spatial latent fields (Bradley et al. 2019; Lee and Haran 2022). Figure 1 provides a visual representation of the latent probability \(\pi (\textbf{s})\) and log-intensity \(\textrm{log}(\theta (\textbf{s}))\) surfaces.

Table 1 Simulation Study Results: Median values for all 100 samples in the simulation study

We also consider mixing in MCMC algorithms by examining the effective sample size per second (ES/sec), or the rate at which independent samples are generated by the MCMC algorithm. Larger values of ES/sec are indicative of faster mixing Markov chains. Note that PICAR-Z approach generates a faster mixing MCMC algorithm than the ‘reparameterized’ approach (Christensen and Waagepetersen 2002), a method specifically designed to improve mixing for SGLMMs. For model parameters \(\beta _{1o}\), \(\beta _{2o}\), \(\beta _{1p}\), and \(\beta _{1p}\), PICAR-Z yields an ES/sec of 218.89, 214.15, 44.00, and 43.83, respectively. The ‘reparameterized’ approach returns an ES/sec 0.66, 0.63, 0.26, and 0.25, respectively. For the spatial random effects \({\textbf{W}}_{o}(s)\) and \({\textbf{W}}_{p}(s)\), the median ES/sec is 53.09 and 28.70 for the PICAR-Z approach and 0.71 and 0.40 for the ‘reparameterized’ approach, an improvement by a factor of roughly 74.3 and 72.5. Across all four model classes, the PICAR-Z approach has shorter walltimes to run 150,000 iterations of the Metropolis-Hastings algorithm than ’low-rank (bisquare)’ and the ‘reparameterized’ approach (Table 1). Against the ‘reparameterized’ approach, PICAR-Z exhibits a speed-up factor of roughly 152.4, 121.2, 203.9, and 177.4 for the count hurdle, semi-continuous hurdle, count mixture, and semi-continuous mixture models, respectively. We also conducted a sensitivity analysis regarding the proportion of zeros within the sample and various model performance metrics. Results (see supplement) indicate that datasets with a low proportion of zeros have lower AUC (poor classification) than datasets with larger proportion of zeros. Low proportions of zeros are linked with shorter model-fitting walltimes. Boxplots of the relevant metrics - Total RMSPE, NonZero RMPSE, AUC for the zero-valued observations, and walltimes - are also provided in the supplement.

Fig. 1
figure 1

Prediction results from a single simulated example. Data are generated from the spatial count mixture model in Sect. 5. Top row includes the true and predicted probability surfaces \(\pi (\textbf{s})\) for the occurrence random process \(O(\textbf{s})\). Bottom row presents the true and predicted log-intensity surfaces \(\log (\theta (\textbf{s}))\) for the prevalence random process \(P(\textbf{s})\). Column 1 presents the true latent probability and log-intensity surfaces. Columns 2–3 include the predicted surfaces for PICAR-Z (column 2), low-rank approach with bi-square basis functions (column 3), and the ‘reparameterized’ approach (column 4). In the fifth column, a color scale is provided for the probability and log-intensity surfaces

6 Applications

We demonstrate the scalability and flexibility of PICAR-Z on two large environmental datasets with spatially-referenced zero-inflated observations—zero-inflated counts of a bivalve species and high-resolution ice thickness measurements over West Antarctica.

6.1 Bivalve species in the Dutch Wadden sea

We randomly select 3220 observations to fit our model and hold out 806 observations for validation. Covariates include environmental variables that affect the abundance of the Macoma balthica species such as: (1) median grain size of the sediments; (2) silt content of the sediments; and (3) altitude. Using PICAR-Z, we fit the hurdle count model and the zero-inflated Poisson model (mixture). We construct a triangular mesh with m = 4028 mesh vertices. In both the count hurdle and mixture cases, the automated heuristic (Sect. 4.3) chose ranks \(p_{o}=14\) and \(p_{p}=64\) for the occurrence and prevalence processes, respectively.

We employ similar model specifications and prior distributions as in the simulated examples (Sect. 5.2), including a comparisons to the low-rank (bisquare) approach. Comparative results are provided in Table 2. Both the PICAR-Z and correlated PICAR-Z approach outperforms low-rank approaches in predictions and shorter model-fitting walltimes. These results hold in both the hurdle and mixture modeling approaches. Note that PICAR-Z provides more accurate predictions, compared to the low-rank approach, among the nonzero observations. Comparisons to the ‘reparameterized’ approach (Christensen and Waagepetersen 2002) are computationally prohibitive due to the long wall times associated with the MCMC algorithms. Under PICAR-Z, both the hurdle and mixture models provide comparable out-of-sample predictions, yet the hurdle model has a shorter walltime; therefore, we recommend the count hurdle model for this particular case. We present the inferential results for the regression coefficients \({\varvec{ \beta }}_{o}\) and \({\varvec{ \beta }}_{p}\) in the supplement.

Table 2 Bivalve Species Results: Results are grouped by two-part model (hurdle vs. mixture) and approach (PICAR-Z, PICAR-Z with cross-correlation, and low-rank approach using bi-square basis functions)

6.2 Ice-sheet thickness data for West Antarctica

Similar to the previous application, we partition our dataset accordingly by assigning 23,000 locations as a training set and the remaining 6241 locations as the validation set. We model the observed ice thickness data using: (1) a hurdle semicontinuous model with a lognormal prevalence process; and (2) a semi-continuous mixture model using the Tobit Type I model. We rescale the spatial domain into the unit square, and include the X- and Y- coordinates as covariates. We implement similar settings as the semi-continuous cases in Sect. 5 including the parameters’ prior distributions. Due to the high-dimensional observations, we omit comparisons to the correlated PICAR-Z and ‘reparameterized’ approaches. For PICAR-Z, we construct a triangular mesh with m = 5888 mesh vertices. For both the hurdle model, the automated heuristic chose ranks \(p_{o}=46\) and \(p_{p}=200\) for the occurrence and prevalence processes, respectively. In the mixture model, we use ranks \(p_{o}=46\) and \(p_{p}=80\). For the low-rank approach we use the quad-tree structure (84 basis functions) from Sect. 5.

Comparative results are provided in Table 3 and maps of the predicted values are provided in Fig. 2. Both PICAR-Z performs yields more accurate out-of-sample predictions than the low-rank approach in both the total rmspe (hurdle: 353.31 vs. 372.53 and mixture: 314.47 vs. 323.77) and the rmspe among positive values (hurdle: 436.39 vs. 461.27 and mixture: 384.32 vs. 398.51). Both approaches yield comparable results among zero vs. nonzero predictions based on AUC (hurdle: 0.95 vs. 0.96 and mixture: 0.94 vs. 0.92 ). For the hurdle model, walltimes are slightly shorter for the ’low-rank (bisquare)’ approach, which can be attributed to fewer basis functions chosen to represent the prevalence process. Despite the longer walltimes, PICAR-Z provides more accurate predictions and a larger effective sample size. The mixture model (Tobit Type I) is able to predict more accurately than the hurdle model; however, it comes at the cost of slightly longer walltimes. Maps of the corresponding prediction standard deviations are provided in the supplement. Note that PICAR-Z yields more accurate predictions, albeit with larger uncertainties.

Table 3 West Antarctica Ice Thickness Results: Results are grouped by two-part model (hurdle vs. mixture) and approach (PICAR-Z vs. low-rank with bi-square basis functions)
Fig. 2
figure 2

Maps of the observed ice-thickness at the 6241 validation locations (top left). Predicted values using the hurdle semi-continuous model with PICAR-Z (top middle) and ’low-rank (bisquare)’ (top right). Predictions using the mixture semi-continuous model with PICAR-Z (bottom middle) and ’low-rank (bisquare)’ (bottom right)

7 Discussion

In this study, we propose a computationally efficient approach (PICAR-Z) to model high-dimensional zero-inflated spatial count and semi-continuous observations. Our method approximates the two latent spatial random fields using the PICAR representation. PICAR-Z scales well to higher dimensions, is automated, and extends to a wide range of spatial two-part models. In both simulated and real-world examples, PICAR-Z yields comparable results to the ‘reparameterized’ approach in both inference and prediction, yet incurs just a fraction of the computational costs. In addition, PICAR-Z outperforms a competing approach in both predictions and computational costs. Our method can be easily implemented in a programming language for Markov chain Monte Carlo algorithms such as nimble and stan. PICAR-Z significantly reduces computational overhead while maintaining model performance; thereby allowing practitioners to investigate a wider range of two-part spatial models than was previously possible.

While this study focuses on four types of two-part models, a natural extension would extend ideas from complex two-part models to the spatial setting. Examples include hurdle models with skewed distributions (Dreassi et al. 2014; Liu et al. 2016), t-distributions to model heavy tailed behavior (Neelon et al. 2015), or scale mixtures of normal distributions (Fruhwirth-Schnatter and Pyne 2010). Next, our approach does not provide a procedure for choosing between hurdle and mixture models, at least prior to model-fitting. Developing a formal test or automated heuristic to select the appropriate class of spatial two-part models would be a promising area of future research.

Extending PICAR-Z to the multivariate or spatio-temporal setting has potential for future research. The latent spatial processes can be linked using nonstationary multivariate covariance functions (Kleiber and Nychka 2012) and or basis functions weighted with Gaussian graphical vectors (Krock et al. 2021). For spatio-temporal data, modeling basis coefficients as a vector-autoregressive process (Bradley et al. 2015) can induce temporal dependencies in the latent spatial occurrence and prevalence processes. In light of recent critiques from Zimmerman and Ver Hoef (2022) and Khan and Calder (2022), a subsequent studies would benefit from rigorously examining spatial confounding in zero-inflated spatial datasets by modifying PICAR-Z and the de-confounded Moran’s basis functions (Hughes and Haran 2013).