Erlang mixture modeling for Poisson process intensities

We develop a prior probability model for temporal Poisson process intensities through structured mixtures of Erlang densities with common scale parameter, mixing on the integer shape parameters. The mixture weights are constructed through increments of a cumulative intensity function which is modeled nonparametrically with a gamma process prior. Such model specification provides a novel extension of Erlang mixtures for density estimation to the intensity estimation setting. The prior model structure supports general shapes for the point process intensity function, and it also enables effective handling of the Poisson process likelihood normalizing term resulting in efficient posterior simulation. The Erlang mixture modeling approach is further elaborated to develop an inference method for spatial Poisson processes. The methodology is examined relative to existing Bayesian nonparametric modeling approaches, including empirical comparison with Gaussian process prior based models, and is illustrated with synthetic and real data examples.


Introduction
Poisson processes play a key role in both theory and applications of point processes. They form a widely used class of stochastic models for point patterns that arise in biology, ecology, engineering and finance among many other disciplines. The relatively tractable form of the non-homogeneous Poisson process (NHPP) likelihood is one of the reasons for the popularity of NHPPs in applications involving point process data.
Theoretical background for the Poisson process can be found, for example, in Kingman (1993) and Daley and Vere-Jones (2003). Regarding Bayesian nonparametric modeling and inference, prior probability models have been developed for the NHPP mean measure (e.g., Lo, 1982Lo, , 1992, and mainly for the intensity function of NHPPs over time and/or space. Modeling methods for NHPP intensities include: mixtures of non-negative kernels with weighted gamma process priors for the mixing measure (e.g., Lo and Weng, 1989;Wolpert and Ickstadt, 1998;Ishwaran and James, 2004;Kang et al., 2014); piecewise constant functions driven by Voronoi tessellations with Markov random field priors Arjas, 1998, 1999); Gaussian process priors for logarithmic or logit transformations of the intensity (e.g., Rodrigues and Diggle, 2012); and Dirichlet process mixtures for the NHPP density, i.e., the intensity function normalized in the observation window (e.g., Kottas, 2006;Kottas and Sansó, 2007;Taddy and Kottas, 2012).
Here, we seek to develop a flexible and computationally efficient model for NHPP intensity functions over time or space. We focus on temporal intensities to motivate the modeling approach and to detail the methodological development, and then extend the model for spatial NHPPs. The NHPP intensity over time is represented as a weighted combination of Erlang densities indexed by their integer shape parameters and with a common scale parameter. Thus, different from existing mixture representations, the proposed mixture model is more structured with each Erlang density identified by the corresponding mixture weight. The non-negative mixture weights are defined through increments of a cumulative intensity on R + . Under certain conditions, the Erlang mixture intensity model can approximate in a pointwise sense general intensities on R + (see Section 2.1).
A gamma process prior is assigned to the primary model component, that is, the cumulative intensity that defines the mixture weights. Mixture weights driven by a gamma process prior result in flexible intensity function shapes, and, at the same time, ready prior-to-posterior updating given the observed point pattern. Indeed, a key feature of the model is that it can be implemented with an efficient Markov chain Monte Carlo (MCMC) algorithm that does not require approximations, complex computational methods, or restrictive prior modeling assumptions in order to handle the NHPP likelihood normalizing term. The intensity model is extended to the two-dimensional setting through products of Erlang densities for the mixture components, with the weights built from a measure modeled again with a gamma process prior. The extension to spatial NHPPs retains the appealing aspect of computationally efficient MCMC posterior simulation.
The paper is organized as follows. Section 2 presents the modeling and inference methodology for NHPP intensities over time. The modeling approach for temporal NHPPs is illustrated through synthetic and real data in Section 3. Section 4 develops the model for spatial NHPP intensities, including two data examples. Finally, Section 5 concludes with a discussion of the modeling approach relative to existing Bayesian nonparametric models, as well as of possible extensions of the methodology.

Methodology for temporal Poisson processes
The mixture model for NHPP intensities is developed in Section 2.1, including discussion of model properties and theoretical justification. Sections 2.2 and 2.3 present a prior specification approach and the posterior simulation method, respectively.

The mixture modeling approach
A NHPP on R + can be defined through its intensity function, λ(t), for t ∈ R + , a nonnegative and locally integrable function such that: (a) for any bounded B ⊂ R + , the number of events in B, N (B), is Poisson distributed with mean Λ(B) = B λ(u) du; and (b) given N (B) = n, the times t i , for i = 1, ..., n, that form the point pattern in B arise independently and identically distributed (i.i.d.) according to density λ(t)/Λ(B).
Consequently, the likelihood for the NHPP intensity function, based on the point pattern {0 < t 1 < ... < t n < T } observed in time window (0, T ), is proportional to . Our modeling target is the intensity function, λ(t). We denote by ga(· | α, β) the gamma density (or distribution, depending on the context) with mean α/β. The proposed intensity model involves a structured mixture of Erlang densities, ga(t | j, θ −1 ), mixing on the integer shape parameters, j, with a common scale parameter θ. The non-negative mixture weights are defined through increments of a cumulative intensity function, H, on R + , which is assigned a gamma process prior. More specifically, where G(H 0 , c 0 ) is a gamma process specified through H 0 , a (parametric) cumulative intensity function, and c 0 , a positive scalar parameter (Kalbfleisch 1978). For any t ∈ R + , E(H(t)) = H 0 (t) and Var(H(t)) = H 0 (t)/c 0 , and thus H 0 plays the role of the centering cumulative intensity, whereas c 0 is a precision parameter. As an independent increments process, the G(H 0 , c 0 ) prior for H implies that, given θ, the mixture weights are indepen-dent ga(ω j | c 0 ω 0j (θ), c 0 ) distributed, where ω 0j (θ) = H 0 (jθ) − H 0 ((j − 1)θ). As shown in Section 2.3, this is a key property of the prior model with respect to implementation of posterior inference.
The model in (1) is motivated by Erlang mixtures for density estimation, under which a density g on R + is represented as g(t) ≡ g J,θ (t) = J j=1 p j ga(t | j, θ −1 ), for t ∈ R + . Here, p j = G(jθ) − G((j − 1)θ), where G is a distribution function on R + ; the last weight can be defined as p J = 1 − G((J − 1)θ) to ensure that (p 1 , ..., p J ) is a probability vector.
Erlang mixtures can approximate general densities on the positive real line, in particular, as θ → 0 and J → ∞, g J,θ converges pointwise to the density of distribution function G that defines the mixture weights. This convergence property can be obtained from more general results from the probability literature that studies Erlang mixtures as extensions of Bernstein polynomials to the positive real line (e.g., Butzer, 1954); a convergence proof specifically for the distribution function of g J,θ can be found in Lee and Lin (2010). Density estimation on compact sets via Bernstein polynomials has been explored in the Bayesian nonparametrics literature following the work of Petrone (1999a,b). Regarding Bayesian nonparametric modeling with Erlang mixtures, we are only aware of Xiao et al. (2021) where renewal process inter-arrival distributions are modeled with mixtures of Erlang distributions, using a Dirichlet process prior (Ferguson 1973) for distribution function G. Venturini et al. (2008) study a parametric Erlang mixture model for density estimation on R + , working with a Dirichlet prior distribution for the mixture weights. Therefore, the modeling approach in (1) exploits the structure of the Erlang mixture density model to develop a prior for NHPP intensities, using the density/distribution function and intensity/cumulative intensity function connection to define the prior model for the mixture weights. In this context, the gamma process prior for cumulative intensity H is the natural analogue to the Dirichlet process prior for distribution function G; recall that the Dirichlet process can be defined through normalization of a gamma process (e.g., Ghosal and van der Vaart, 2017). To our knowledge, this is a novel construction for NHPP intensities that has not been explored for intensity estimation in either the classical or Bayesian nonparametrics literature. The following lemma, which can be obtained applying Theorem 2 from Butzer (1954), provides theoretical motivation and support for the mixture model.
Lemma. Let h be the intensity function of a NHPP on R + , with cumulative intensity The form of the prior model for the intensity in (1) allows ready expressions for other NHPP functionals. For instance, the total intensity over the observation time window du is the j-th Erlang distribution function at T . In the context of the MCMC posterior simulation method, this form enables efficient handling of the NHPP likelihood normalizing constant.
Regarding the role of the different model parameters, we reiterate that (1) corresponds to a structured mixture. The Erlang densities, ga(t | j, θ −1 ), play the role of basis functions in the representation for the intensity. In this respect, of primary importance is the q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q flexibility of the nonparametric prior for the cumulative intensity function H that defines the mixture weights. In particular, the gamma process prior provides realizations for H with general shapes that can concentrate on different time intervals, thus favoring different subsets of the Erlang basis densities through the corresponding ω j . Here, the key parameter is the precision parameter c 0 , which controls the variability of the gamma process prior around H 0 , and thus the effective mixture weights. As an illustration, Figure 1 shows prior realizations for the weights ω j (and the resulting intensity function) for different values of c 0 , keeping all other model parameters the same. Note that as c 0 decreases, so does the number of practically non-zero weights.
The prior mean for H is taken to be H 0 (t) = t/b, i.e., the cumulative intensity (hazard) of an exponential distribution with scale parameter b > 0. Although it is possible to use more general centering functions, such as the Weibull H 0 (t) = (t/b) a , the exponential form is sufficiently flexible in practice, as demonstrated with the synthetic data examples of Section 3. Based on the role of H in the intensity mixture model, we typically anticipate realizations for H that are different from the centering function H 0 , and thus, as discussed above, the more important gamma process parameter is c 0 . Moreover, the exponential form for H 0 allows for an analytical result for the prior expectation of the Erlang mixture intensity model. Under H 0 (t) = t/b, the prior expectation for the weights is given by Therefore, conditional on all model hyperparameters, the expectation of λ(t) over the gamma process prior can be written as which converges to b −1 , as J → ∞, for any t ∈ R + (and regardless of the value of θ and c 0 ). In practice, the prior mean for the intensity function is essentially constant at b −1 for t ∈ (0, Jθ), which, as discussed below, is roughly the effective support of the NHPP intensity. This result is useful for prior specification as it distinguishes the role of b from that of parameters θ and c 0 .
Also key are the two remaining model parameters, the number of Erlang basis densities J, and their common scale parameter θ. Parameters θ and J interact to control both the effective support and shape of NHPP intensities arising under (1). Regarding intensity shapes, as the lemma suggests, smaller values of θ and larger values of J generally result in more variable, typically multimodal intensities. Moreover, the representation for λ(t) in (1) utilizes Erlang basis densities with increasing means jθ, and thus (0, Jθ) can be used as a proxy for the effective support of the NHPP intensity. Of course, the mean underestimates the effective support, a more accurate guess can be obtained using, say,

Prior specification
To complete the full Bayesian model, we place prior distributions on the parameters c 0 and b of the gamma process prior for H, and on the scale parameter θ of the Erlang basis densities. A generic approach to specify these hyperpriors can be obtained using the observation time window (0, T ) as the effective support of the NHPP intensity.
We work with exponential prior distributions for parameters c 0 and b. Using the prior mean for the intensity function, which as discussed in Section 2.1 is roughly constant at b −1 within the time interval of interest, the total intensity in (0, T ) can be approximated by T /b. Therefore, taking the size n of the observed point pattern, as a proxy for the total intensity in (0, T ), we can use T /n to specify the mean of the exponential prior distribution for b. Given its role in the gamma process prior, we anticipate that small values of c 0 will be important to allow prior variability around H 0 , as well as sparsity in the mixture weights. Experience from prior simulations, such as the ones shown in Figure   1, is useful to guide the range of "small" values. Note that the pattern observed in Figure   1 is not affected by the length of the observation window. In general, a value around 10 can be viewed as a conservative guess at a high percentile for c 0 . For the data examples of Section 3, we assigned an exponential prior with mean 10 to c 0 , observing substantial learning for this key model hyperparameter with its posterior distribution supported by values (much) smaller than 1.
Also given the key role of parameter θ in controlling the intensity shapes, we recommend favoring sufficiently small values in the prior for θ, especially if prior information suggests a non-standard intensity shape. Recall that θ, along with J, control the effective support of the intensity, and thus "small" values for θ should be assessed relative to the length of the observation window. Again, prior simulation, as in Figure 2, is a useful tool.
A practical approach to specify the prior range of θ values involves reducing the Erlang mixture model to the first component. The corresponding (exponential) density has mean θ, and we thus use (0, T ) as the effective prior range for θ. Because T is a fairly large upper bound, and since we wish to favor smaller θ values, rather than an exponential prior, we use a Lomax prior, p(θ) ∝ (1 + d −1 θ θ) −3 , with shape parameter equal to 2 (thus implying infinite variance), and median d θ ( √ 2 − 1). The value of the scale parameter, d θ , is specified such that Pr(0 < θ < T ) ≈ 0.999. This simple strategy is effective in practice in identifying a plausible range of θ values. For the synthetic data examples of Section 3, for which T = 20, we assigned a Lomax prior with scale parameter d θ = 1 to θ, obtaining overall moderate prior-to-posterior learning for θ.
Finally, we work with fixed J, the value of which can be specified exploiting the role of θ and J in controlling the support of the NHPP intensity. In particular, J can be set equal to the integer part of T /θ * , where θ * is the prior median for θ. More conservatively, this value can be used as a lower bound for values of J to be studied in a sensitivity analysis, especially for applications where one expects non-standard shapes for the intensity function. In practice, we recommend conducting prior sensitivity analysis for all model parameters, as well as plotting prior realizations and prior uncertainty bands for the intensity function to graphically explore the implications of different prior choices.
The number of Erlang basis densities is the only model parameter which is not assigned a hyperprior. Placing a prior on J complicates significantly the posterior simulation method, as it necessitates use of variable-dimension MCMC techniques, while offering relatively little from a practical point of view. The key observation is again that the Erlang densities play the role of basis functions rather than of kernel densities in traditional (less structured) finite mixture models. Also key is the nonparametric nature of the prior for function H that defines the mixture weights which select the Erlang densities to be used in the representation of the intensity. This model feature effectively guards against over-fitting if one conservatively chooses a larger value for J than may be necessary.
In this respect, the flexibility afforded by random parameters c 0 and θ is particularly useful. Overall, we have found that fixing J strikes a good balance between computational tractability and model flexibility in terms of the resulting inferences.

Posterior simulation
Denote as before by {0 < t 1 < ... < t n < T } the point pattern observed in time window (0, T ). Under the Erlang mixture model of Section 2.1, the NHPP likelihood is propor- Erlang distribution function at T . For the posterior simulation approach, we augment the likelihood with auxiliary variables γ = {γ i : i = 1, . . . , n}, where γ i identifies the Erlang basis density to which time event t i is assigned. Then, the augmented, hierarchical model for the data can be expressed as follows: where ω = {ω j : j = 1, ..., J}, and p(θ), p(c 0 ), and p(b) denote the priors for θ, c 0 , and b.
Recall that, under the exponential distribution form for We utilize Gibbs sampling to explore the posterior distribution. The sampler involves ready updates for the auxiliary variables γ i , and, importantly, also for the mixture weights ω j . More specifically, the posterior full conditional for each γ i is a discrete distribution on {1, ..., J} such that Pr(γ i = j | θ, ω, data) ∝ ω j ga(t i | j, θ −1 ), for j = 1, ..., J.
Denote by N j = |{t i : γ i = j}|, for j = 1, ..., J, that is, N j is the number of time points assigned to the j-th Erlang basis density. The posterior full conditional distribution for ω is derived as follows: where we have used the fact that J j=1 N j = n. Therefore, given the other parameters and the data, the mixture weights are independent, and each ω j follows a gamma posterior full conditional distribution. This is a practically important feature of the model in terms of convenient updates for the mixture weights, and with respect to efficiency of the posterior simulation algorithm as it pertains to this key component of the model parameter vector.
Finally, each of the remaining parameters, c 0 , b, and θ, is updated with a Metropolis-Hastings (M-H) step, using a log-normal proposal distribution in each case.

Model extensions to incorporate marks
Here, we discuss how the Erlang mixture prior for NHPP intensities can be embedded in semiparametric models for point patterns that include additional information on marks.
Consider the setting where, associated with each observed time event t i , marks y i ≡ y t i are recorded (marks are only observed when an event is observed). Without loss of generality, we assume that marks are continuous variables taking values in mark space M ⊆ R d , for d ≥ 1. As discussed in Taddy and Kottas (2012), a nonparametric prior for the intensity of the temporal process, T , can be combined with a mark distribution to construct a semiparametric model for marked NHPPs. In particular, consider a generic marked NHPP {(t, y t ) : t ∈ T , y t ∈ M}, that is: the temporal process T is a NHPP on R + with intensity function λ; and, conditional on T , the marks {y t : t ∈ T } are mutually independent. Now, assume that, conditional on T , the marks have density m t that depends only on t (i.e., it does not depend on any earlier time t < t). Then, by the "marking" theorem (e.g., Kingman 1993), we have that the marked NHPP is a NHPP on the extended space R + × M with intensity λ * (t, y t ) = λ(t) m t (y t ). Therefore, the likelihood for the observed marked point pattern {(t i , y i ) : i = 1, ..., n} can be written as Hence, the MCMC method of Section 2.3 can be extended for marked NHPP models built from the Erlang mixture prior for intensity λ, and any time-dependent model for the mark density m t .

Data examples
To empirically investigate inference under the proposed model, we present three synthetic data examples corresponding to decreasing, increasing, and bimodal intensities. We also consider the coal-mining disasters data set, which is commonly used to illustrate NHPP intensity estimation.
We used the approach of Section 2.2 to specify the priors for c 0 , b and θ, and the value for J. In particular, we used the exponential prior for c 0 with mean 10 for all data examples. For the three synthetic data sets (for which T = 20), we used the Lomax prior for θ with shape parameter equal to 2 and scale parameter equal to 1. Prior sensitivity analysis results for the synthetic data example of Section 3.3 are provided in the Sup-plementary Material. Overall, results from prior sensitivity analysis (also conducted for all other data examples) suggest that the prior specification approach of Section 2.2 is effective as a general strategy. Moreover, more dispersed priors for parameters c 0 , b and θ have little to no effect on the posterior distribution for these parameters and essentially no effect on posterior estimates for the NHPP intensity function, even for point patterns with relatively small size, such as the one (n = 112) for the data example of Section 3.3.
The Supplement provides also computational details about the MCMC posterior simulation algorithm, including study of the effect of the number of basis densities (J) and the size of the point pattern (n) on effective sample size and computing time.

Decreasing intensity synthetic point pattern
The first synthetic data set involves 491 time points generated in time window (0, 20) . This form corresponds to the hazard function of a Weibull distribution with shape parameter less than 1, thus resulting in a decreasing intensity function. The Erlang mixture model was applied with J = 50, and an exponential prior for b with mean 0.04. The model captures the decreasing pattern of the data generating intensity function; see Figure 3. We note that there is significant prior-to-posterior learning in the intensity function estimation; the prior intensity mean is roughly constant at value about 25 with prior uncertainty bands that cover almost the entire top left panel in Figure 3.
Prior uncertainty bands were similarly wide for all other data examples.

Increasing intensity synthetic point pattern
We consider again the form β −1 α(β −1 t) α−1 for the NHPP intensity function, but here with (α, β) = (6, 7) such that the intensity is increasing. A point pattern comprising 565 points was generated in time window (0, 20). The Erlang mixture model was applied with J = 50, and an exponential prior for b with mean 0.035. Figure 4 reports inference results.
This example demonstrates the model's capacity to effectively recover increasing intensity shapes over the bounded observation window, even though the Erlang basis densities are ultimately decreasing.

Bimodal intensity synthetic point pattern
The data examples in Sections 3.1 and 3.2 illustrate the model's capacity to uncover monotonic intensity shapes, associated with a parametric distribution different from the Erlang distribution that forms the basis of the mixture intensity model. Here, we consider

Coal-mining disasters data
Our real data example involves the "coal-mining disasters" data (e.g., Andrews and We fit the Erlang mixture model with J = 50, using a Lomax prior for θ with shape parameter 2 and scale parameter 2, 000, such that Pr(0 < θ < 40, 550) ≈ 0.998, and an exponential prior for b with mean 213. We also implemented the model with J = 130, obtaining essentially the same inference results for the NHPP functionals with the ones reported in Figure 6.
The estimates for the point process intensity and density functions ( Figure 6, top row) suggest that the model successfully captures the multimodal intensity shape suggested by the Erlang basis densities that are more influential to the model fit.
The bottom right panel of Figure 6 reports results from graphical model checking, using the "time-rescaling" theorem (e.g., Daley and Vere-Jones 2003). If the point pattern

Modeling for spatial Poisson process intensities
In Section 4.1, we extend the modeling framework to spatial NHPPs with intensities defined on R + × R + . The resulting inference method is illustrated with synthetic and real data examples in Section 4.2 and 4.3, respectively.

The Erlang mixture model for spatial NHPPs
A spatial NHPP is again characterized by its intensity function, λ(s), for s = (s 1 , s 2 ) ∈ R + × R + . The NHPP intensity is a non-negative and locally integrable function such with density λ(s)/{ B λ(u) du}. Therefore, the structure of the likelihood for the intensity function is similar to the temporal NHPP case. In particular, for spatial point pattern, As is typically the case in standard applications involving spatial NHPPs, we consider a regular, rectangular domain for the observation region D, which can therefore be taken without loss of generality to be the unit square.
The corresponding weight is defined through a random measure H supported on R + ×R + , such that ω j 1 j 2 = H(A j 1 j 2 ). This construction extends the one for the weights of the temporal NHPP model. We again place a gamma process prior, G(H 0 , c 0 ), on H, where c 0 is the precision parameter and H 0 is the centering measure on R + × R + . As a natural extension of the exponential cumulative hazard used in Section 2.1 for the gamma process prior mean, we specify H 0 to be proportional to area. In particular, Using the independent increments property of the gamma process, and under the specific choice of H 0 , the prior for the mixture weights is given by ∼ ga(ω j 1 j 2 | c 0 θ 1 θ 2 b −1 , c 0 ), j 1 , j 2 = 1, ..., J, which, as before, is a practically important feature of the model construction as it pertains to MCMC posterior simulation.
To complete the full Bayesian model, we place priors on the common scale parameters for the basis densities, (θ 1 , θ 2 ), and on the gamma process prior hyperparameters c 0 and b. The role played by these model parameters is directly analogous to the one of the corresponding parameters for the temporal NHPP model, as detailed in Section 2.1.
Therefore, we apply similar arguments to the ones in Section 2.2 to specify the model hyperpriors. More specifically, we work with (independent) Lomax prior distributions for scale parameters θ 1 and θ 2 , where the shape parameter of the Lomax prior is set equal to 2 and the scale parameter is specified such that Pr(0 < θ 1 < 1)Pr(0 < θ 2 < 1) ≈ 0.999.
Recall that the observation region is taken to be the unit square; in general, for a square observation region, this approach implies the same Lomax prior for θ 1 and θ 2 . The gamma process precision parameter c 0 is assigned an exponential prior with mean 10. The result of Section 2.1 for the prior mean of the NHPP intensity can be extended to show that E(λ(s 1 , s 2 ) | b, θ 1 , θ 2 ) converges to b −1 , as J → ∞, for any (s 1 , s 2 ) ∈ R + × R + , and for any (θ 1 , θ 2 ) (and c 0 ). The prior mean for the spatial NHPP intensity is practically constant at b −1 within its effective support given roughly by (0, Jθ 1 ) × (0, Jθ 2 ). Hence, taking the size of the observed spatial point pattern as a proxy for the total intensity, b is assigned an exponential prior distribution with mean 1/n. Finally, the choice of the value for J can be guided from the approximate effective support for the intensity, which is controlled by J along with θ 1 and θ 2 . Analogously to the approach discussed in Section 2.2, the value of J (or perhaps a lower bound for J) can be specified through the integer part of 1/θ * , where θ * is the median of the common Lomax prior for θ 1 and θ 2 .

Synthetic data example
Here, we illustrate the spatial NHPP model using synthetic data based on a bimodal intensity function built from a two-component mixture of bivariate logit-normal densities.
Denote by BLN(µ, Σ) the bivariate logit-normal density arising from the logistic transformation of a bivariate normal with mean vector µ and covariance matrix Σ. A spatial point pattern of size 528 was generated over the unit square from a NHPP with intensity Finally, we note the substantial prior-to-posterior learning for all model hyperparameters.

Real data illustration
Our final data example involves a spatial point pattern that has been previously used to illustrate NHPP intensity estimation methods (e.g., Diggle, 2014;Kottas and Sansó, 2007). The data set involves the locations of 514 maple trees in a 19.6 acre square plot in Lansing Woods, Clinton County, Michigan, USA; the maple trees point pattern is included in the left column panels of Figure 8. To apply the spatial Erlang mixture model, we specified the hyperpriors for θ 1 , θ 2 , c 0 and b following the approach discussed in Section 4.1, and set J = 70. As with the synthetic data example, the posterior distributions for model hyperparameters are substantially concentrated relative to their priors; see the bottom right panel of Figure   8. The estimates for the spatial intensity of maple tree locations reported in Figure 8 demonstrate the model's capacity to uncover non-standard, multimodal intensity surfaces.

Discussion
We have proposed a Bayesian nonparametric modeling approach for Poisson processes over time or space. The approach is based on a mixture representation of the point process intensity through Erlang basis densities, which are fully specified save for a scale parameter shared by all of them. The weights assigned to the Erlang densities are defined through increments of a random measure (a random cumulative intensity function in the temporal NHPP case) which is modeled with a gamma process prior. A key feature of the methodology is that it offers a good balance between model flexibility and computational efficiency in implementation of posterior inference. Such inference has been illustrated with synthetic and real data for both temporal and spatial Poisson process intensities.
To discuss our contribution in the context of Bayesian nonparametric modeling methods for NHPPs (briefly reviewed in the Introduction), note that the main approaches can be grouped into two broad categories: placing the prior model on the NHPP intensity function; or, assigning separate priors to the total intensity and the NHPP density (both defined over the observation window).
In terms of applications, especially for spatial point patterns, the most commonly explored class of models falling in the former category involves Gaussian process (GP) priors for logarithmic (or logit) transformations of the NHPP intensity (e.g., . The NHPP likelihood normalizing term renders full posterior inference under GP-based models particularly challenging. This challenge has been by-passed using approximations of the stochastic integral that defines the likelihood normalizing term Møller 2001), data augmentation techniques , and different types of approximations of the NHPP likelihood along with integrated nested Laplace approximation for approximate Bayesian inference (Illian et al. 2012, Simpson et al. 2016. In contrast, the Erlang mixture model can be readily implemented with MCMC algorithms that do not involve approximations to the NHPP likelihood or complex computational techniques. The Supplementary Material includes comparison of the proposed model with two GP-based models: the sigmoidal Gaussian Cox process (SGCP) model ) for temporal NHPPs; and the log-Gaussian Cox process (LGCP) model for spatial NHPPs, as implemented in the R package lgcp . The results, based on the synthetic data considered in Sections 3.3 and 4.2, suggest that the Erlang mixture model is substantially more computationally efficient than the SGCP model, as well as less sensitive to model/prior specification than LGCP models for which the choice of the GP covariance function can have a large effect on the intensity surface estimates.
Since it involves a mixture formulation for the NHPP intensity, the proposed modeling approach is closer in spirit to methods based on Dirichlet process mixture priors for the NHPP density (e.g., Kottas and Sansó, 2007;Taddy and Kottas, 2012). Both types of approaches build posterior simulation from standard MCMC techniques for mixture models, using latent variables that configure the observed points to the mixture components.
Models that build from density estimation with Dirichlet process mixtures benefit from the wide availability of related posterior simulation methods (e.g., the number of mixture components in the NHPP density representation does not need to be specified), and from the various extensions of the Dirichlet process for dependent distributions that can be explored to develop flexible models for hierarchically related point processes (e.g., Taddy, 2010;Kottas et al., 2012;Xiao et al., 2015;Rodriguez et al., 2017). However, by construc-tion, this approach is restricted to modeling the NHPP intensity only on the observation window, in fact, with a separate prior for the NHPP density and for the total intensity over the observation window. The Erlang mixture model overcomes this limitation. For instance, in the temporal case, the prior model supports the intensity on R + , and the priors for the total intensity and the NHPP density over (0, T ) (given in Equation (2)) are compatible with the prior for the NHPP intensity.
The proposed model admits a parsimonious representation for the NHPP intensity with the Erlang basis densities defined through a single parameter, the common scale parameter θ. Such intensity representations offer a nonparametric Bayesian modeling perspective for point processes that may be attractive in other contexts and for different types of applications. For instance, Zhao and Kottas (2021) study representations for the intensity through weighted combinations of structured beta densities (with different priors for the mixture weights), which are particularly well suited to flexible and efficient inference for spatial NHPP intensities over irregular domains.
Finally, we note that the Erlang mixture prior model is useful as a building block towards Bayesian nonparametric inference for point processes that can be represented as hierarchically structured, clustered NHPPs. Current research is exploring fully nonparametric modeling for a key example, the Hawkes process (Hawkes 1971), using the Erlang mixture prior for the Hawkes process immigrant (background) intensity function.

Supplementary Information
The Supplementary Material includes results from prior sensitivity analysis, information on computing times and effective sample sizes, the technical details of the MCMC algorithm for the spatial NHPP model, and results from comparison of the Erlang mixture model with two Gaussian process based models for temporal or spatial NHPPs. prior for θ, with scale parameter d θ = 9 (instead of d θ = 1), such that Pr(0 < θ < T ) ≈ 0.9 (instead of Pr(0 < θ < T ) ≈ 0.999), where T = 20. Analogously, the third and fourth rows correspond to more dispersed exponential priors for c 0 and b, respectively. Note that the posterior distribution for θ and c 0 is largely unaffected by the change in the dispersion of the prior distribution, whereas there is some effect on the tail of the posterior density for b. Importantly, for a point pattern with a relatively small size, posterior estimates for the intensity function are essentially the same under the different prior choices. Increasing J increases the dimension of the parameter space.     relation functions (at 48 lags) for the intensity function evaluated at the 51 equally-spaced grid points in (0, T ). The difference in the autocorrelations in the two panels of Figure 2 is compatible with the corresponding mean ESS given in Table 2.

Computational details for the MCMC algorithm
Convergence and mixing of the MCMC algorithm can also be assessed graphically through trace plots of the intensity function evaluated at specific time points within the observation window. An example is given in Figure 3 for the synthetic data of Section 3.1 of the paper. The plots in Figure 3 are representative of intensity trace plots obtained for all other data examples.

Comparison with Gaussian process based models
Here, we compare our model with Bayesian nonparametric models based on Gaussian process (GP) priors for logarithmic or logit transformations of the NHPP intensity (e.g.,  . Such models are popular in both the statistics and the machine learning literature. For the temporal case, Section 2.1 considers comparison with the sigmoidal Gaussian Cox process (SGCP) model ). Since we were not able to find publicly available software, results are based on our implementation of the SGCP model, which allows for a detailed comparison involving full inference results and computational efficiency. Software in the form of R packages is available for spatial and spatio-temporal log-Gaussian Cox process (LGCP) models (Baddeley and Turner 2005;, although its output is limited in terms of inferences that are of interest in our setting. Therefore, for the spatial NHPP case, Section 2.2 focuses on graphical comparison of point estimates for the intensity surface in the context of the synthetic data considered in Section 4.2 of the paper.

Temporal Poisson process models
Under the SGCP model, the temporal NHPP intensity is represented as λ(t) = λ * σ(g(t)), where λ * is an upper bound on the intensity function, σ(z) = (1 + e −z ) −1 is the logistic function, and g is a real-valued random function assigned a GP prior.  develop MCMC posterior simulation using latent variables (associated with "thinned" events) to handle the intractable NHPP likelihood normalizing term.
To compare the Erlang mixture and SGCP models, we work with the synthetic data set of Section 3.3 of the paper. This is a choice arising from practical considerations; as discussed below, the SGCP model is particularly computationally intensive, and we thus consider the data example with the smallest point pattern size.
We applied the SGCP model using a GP prior for function g with constant mean, µ, and squared-exponential covariance function, cov(g(t 1 ), g(t 2 )) = θ 1 e −θ 2 (t 1 −t 2 ) 2 . Therefore, the SGCP model parameters comprise λ * and (µ, θ 1 , θ 2 ), all four of which are assigned priors. We note that a prior specification strategy is not provided in .
We observed that, at least for this data set, inference for all SGCP model parameters is sensitive to the prior choice. Moreover, there is a conflict regarding the role of parameter λ * : it controls the extent of prior uncertainty, with larger λ * values resulting in wider prior interval bands for the intensity, but at the same time, increasing λ * increases the number of latent variables and thus also the MCMC algorithm computing time.
We tuned the priors for the SGCP model parameters to obtain the best possible estimates for the intensity function. The resulting estimates are reported in the left panel of Figure 4. Even though we intentionally favored the SGCP model through the prior selection, the Erlang mixture model posterior mean estimate captures the peaks of the underlying intensity more accurately than the SGCP model, indeed, under larger levels of prior uncertainty (middle panel of Figure 4).
The right panel of Figure 4 shows boxplots of 51 ESS values computed from the posterior samples for λ(t) at 51 equally-spaced time points on a grid over (0, T ) = (0, 20).
The ESS values are based on 15,000 posterior samples, obtained after discarding 5,000

Spatial Poisson process models
The LGCP model for spatial NHPP intensities can be expressed as λ(s) =λ(s) exp(g(s)), where function g is assigned a GP prior with isotropic correlation function, and variance σ 2 . The correlation function includes parameter φ > 0, which controls the rate at which correlation decreases with distance, and it may contain additional parameters (as in, e.g., the Matérn case). The mean of the GP prior is set to −σ 2 /2, which implies E(exp(g(s))) = 1, and thus E(λ(s)) =λ(s).
To obtain point estimates for the spatial intensity function under the LGCP model, one can use two R packages: spatstat (Baddeley and Turner 2005) for parameter estimation, and lgcp ) for intensity estimation.
The function lgcpPredictSpatial of the lgcp package performs posterior inference for the intensity, obtaining posterior samples for a discretized version of function g through Metropolis-adjusted Langevin algorithms (Taylor and Diggle 2014). However, to use lgcp-PredictSpatial, values for σ 2 and φ, and functionλ(s) need to be provided. Since the lgcp package does not contain any functions for inference about the LGCP model hyperparameters, we use the spatstat package, which provides non-Bayesian estimates for σ 2 , φ, andλ(s). In particular, spatstat function density.ppp computes a nonparametric kernel intensity estimator, which can be used to estimateλ(s). Moreover, function lgcp.estpcf yields estimates for σ 2 and φ through a non-parametric kernel estimator for the pair correlation function.
Evidently, the approach described above is not fully Bayesian. And, since it involves a two-stage estimation procedure, comparison with the Erlang mixture model in terms of uncertainty estimates is not particularly meaningful. We thus consider graphical comparison of the Erlang mixture model posterior mean intensity estimate with the point estimates obtained from the LGCP model under two different GP correlation functions.
The comparison is based on the synthetic data example presented in Section 4.2 of the paper, for which we have the true underlying intensity as the point of reference.
The top row of Figure 5 shows the true intensity, and the posterior mean intensity under the Erlang mixture model, obtained under the prior specification discussed in Section 4.2 of the paper. Note that the Erlang mixture model estimates are fairly robust to the prior choice for the model hyperparameters, indeed, there is substantial learning for the hyperparameters under very dispersed priors (refer to Figure 7 of the paper).
The bottom row of Figure 5 plots the LGCP model point estimates for the intensity, using either an exponential or a Matérn GP correlation function. We note that a further challenge with the use of LGCP models is that the spatstat package does not provide estimation for the additional parameters of correlation functions more general than the exponential, in particular, it does not provide an estimate for the smoothness parameter (ν) of the Matérn correlation function. As suggested in , we selected the value for ν by comparing graphically the pair correlation function estimator and the covariance function (with σ 2 and φ estimated).
The LGCP model estimates retrieve the bimodal global pattern of the true intensity, but with localized behavior (especially in the case of the exponential correlation function) that is not present in the underlying intensity surface. In addition to the challenge of obtaining full inference with appropriate uncertainty quantification, the sensitivity of the point estimates to the choice of the GP correlation function is evident.