A spatial scan statistic for zero-inflated Poisson process

Cançado, André L. F.; da-Silva, Cibele Q.; da Silva, Michel F.

doi:10.1007/s10651-013-0272-1

A spatial scan statistic for zero-inflated Poisson process

Published: 11 March 2014

Volume 21, pages 627–650, (2014)
Cite this article

Environmental and Ecological Statistics Aims and scope Submit manuscript

André L. F. Cançado¹,
Cibele Q. da-Silva¹ &
Michel F. da Silva¹

624 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

The scan statistic is widely used in spatial cluster detection applications of inhomogeneous Poisson processes. However, real data may present substantial departure from the underlying Poisson process. One of the possible departures has to do with zero excess. Some studies point out that when applied to data with excess zeros, the spatial scan statistic may produce biased inferences. In this work, we develop a closed-form scan statistic for cluster detection of spatial zero-inflated count data. We apply our methodology to simulated and real data. Our simulations revealed that the Scan-Poisson statistic steadily deteriorates as the number of zeros increases, producing biased inferences. On the other hand, our proposed Scan-ZIP and Scan-ZIP+EM statistics are, most of the time, either superior or comparable to the Scan-Poisson statistic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A nonparametric spatial scan statistic for continuous data

Article Open access 20 October 2015

Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

Article Open access 08 November 2023

Identifying Clusters in Spatial Data Via Sequential Importance Sampling

References

Agarwal DK, Gelfand AE, Citron-Pousty S (2002) Zero-inflated models with application to spatial count data. Environ Ecol Stat 9:341–355
Article Google Scholar
Agresti A (1990) Categorical data analysis. Wiley, New York
Google Scholar
Böhning D, Dietz E, Schlattmann P, Mendonça L, Kirchner U (1999) The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology. J R Stat Soc 162(2):195–209
Article Google Scholar
Casella G, Berger RL (1990) Statistical inference. Duxbury Press, Belmont, CA
Gómez-Rubio V, López-Quílez A (2010) Statistical methods for the geographical analysis of rare diseases. Adv Exp Med Biol 686:151–171
Article PubMed Google Scholar
Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6):1481–1496
Article Google Scholar
Kulldorff M, Tango T, Park P (2003) Power comparisons for disease clustering tests. Comput Stat Data Anal 42:665–684
Article Google Scholar
Kulldorff M, Huang L, Konty K (2009) A scan statistic for continuous data based on the normal probability model. Int J Health Geogr 8:58
Article PubMed Central PubMed Google Scholar
Lambert D (1992) Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34:1–14
Article Google Scholar
Özmen I, Famoye F (2007) Count regression models with an application to zoological data containing structural zeros. J Data Sci 5:491–502
Google Scholar
Rathbun SL, Fei S (2006) A spatial zero-inflated poisson regression model for oak regeneration. Environ Ecol Stat 13:409–426
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the anonymous referees for their careful reading of the manuscript and for constructive suggestions that considerably improved the article. André L. F. Cançado was partially supported by DPP/UnB. Cibele Q. da-Silva was supported by the National Research Council (CNPq-Brazil, BPPesq) and by the Office to Improve University Research (CAPES-Brazil) via Project PROCAD-NF 2008.

Author information

Authors and Affiliations

Statistics Department, Universidade de Brasília, Campus Darcy Ribeiro, Brasília, DF, 70910-900, Brazil
André L. F. Cançado, Cibele Q. da-Silva & Michel F. da Silva

Authors

André L. F. Cançado
View author publications
You can also search for this author in PubMed Google Scholar
Cibele Q. da-Silva
View author publications
You can also search for this author in PubMed Google Scholar
Michel F. da Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to André L. F. Cançado.

Additional information

Handling Editor: Pierre Dutilleul.

Appendix

Proof of Theorem 1

Let $\lambda (D)$ and $\lambda (D^{\prime })$ denote the values for the test statistic for the two different data sets. Further, consider

$$\begin{aligned} I(Z)=\left[ \,\frac{\sum _{i \in {Z}}^{}x_i(1-d_i)}{\sum _{i \in {Z}}^{}n_i(1-d_i)}\right] \quad \hbox {and} \quad O(Z)=\left[ \,\frac{\sum _{i \notin {Z}}^{}x_i(1-d_i)}{\sum _{i \notin {Z}}^{}n_i(1-d_i)}\right] . \end{aligned}$$

Similarly, let

$$\begin{aligned} I'(Z)=\left[ \,\frac{\sum _{i \in {Z}}^{}x_i'(1-d_i')}{\sum _{i \in {Z}}^{}n_i'(1-d_i')}\right] \quad \hbox {and} \quad O'(Z)=\left[ \,\frac{\sum _{i \notin {Z}}^{}x_i'(1-d_i')}{\sum _{i \notin {Z}}^{}n_i'(1-d_i')}\right] , \end{aligned}$$

and let

$$\begin{aligned} C&= \sum _{i=1}^{k}x_i(1-d_i) = \sum _{j=1}^{k}x_j'(1-d_j'),\\ R&= \sum _{i=1}^{k}n_i(1-d_i)=\sum _{j=1}^{k}n_j'(1-d_j'),\\ c(Z)&= \sum _{i \in {Z}}^{}x_i(1-d_i),\\ r(Z)&= \sum _{i \in {Z}}^{}n_i(1-d_i),\\ c'(Z)&= \sum _{i \in {Z}}^{}x_i'(1-d_i'),\\ r'(Z)&= \sum _{i \in {Z}}^{}n_i'(1-d_i'),\\ K&= \left[ \,\frac{\sum _{i=1}^{k}x_i(1-d_i)}{\sum _{i=1}^{k}n_i(1-d_i)}\right] ^{\sum _{i=1}^{k}x_i(1-d_i)}. \end{aligned}$$

As stated previously, these quantities are related to the sufficient statistics of the MLEs.

Under the null hypothesis, $\lambda (D)=1,$ and this implies that

$$\begin{aligned} K&= \hbox {sup}_{Z \in \mathcal {Z}} \left[ I(Z)\right] ^{c(Z)} \left[ O(Z)\right] ^{C-c(Z)}= \left[ I(\hat{Z})\right] ^{c(\hat{Z})} \left[ O(\hat{Z})\right] ^{C-c(\hat{Z})} \\&= \hbox {sup}_{Z \in \mathcal {Z}} \left[ I'(Z)\right] ^{c'(Z)} \left[ O'(Z)\right] ^{C-c'(Z)} = \left[ I'(\tilde{Z}^{\prime })\right] ^{c'(\tilde{Z}^{\prime })} \left[ O'(\tilde{Z}^{\prime })\right] ^{C-c'(\tilde{Z}^{\prime })}. \end{aligned}$$

Thus, $c(\hat{Z})=C=c'(\tilde{Z}^{\prime })$, and the distributions of $\lambda (D)$ and $\lambda (D^{\prime })$ are the same.

Under the alternative hypothesis $\lambda (D)>1$, and we need to show that $\lambda (D') \ge \lambda (D)$. Notice that under the conditions of the theorem, $c'(\tilde{Z}^{\prime }) \ge c(\hat{Z})$, since

$$\begin{aligned} \sum _{i \in \tilde{Z}^{\prime }}^{}x_i'(1-d_i')= \sum _{i \in \hat{Z}}^{}x_i'(1-d_i')+ \sum _{i \in \overline{\hat{Z}}\cap \tilde{Z}^{\prime }}^{}x_i'(1-d_i') \ge \sum _{j \in \hat{Z}} x_j(1-d_j), \end{aligned}$$

as $\overline{\hat{Z}}\cap \tilde{Z}^{\prime }$ might be different from the empty set. When $\lambda (D) > 1$, we have from Eq. (9) that

$$\begin{aligned} \lambda (D)&= \hbox {sup}_{Z \in \mathcal {Z}} \frac{1}{K} \left[ I(Z)\right] ^{c(Z)} \left[ O(Z)\right] ^{C-c(Z)}\\&= \frac{1}{K} \left[ I(\hat{Z})\right] ^{c(\hat{Z})} \left[ O(\hat{Z})\right] ^{C-c(\hat{Z})} \\&= \frac{1}{K} \left[ \frac{c(\hat{Z})}{r(\hat{Z})}\right] ^{c(\hat{Z})} \left[ \frac{C-c(\hat{Z})}{R-r(\hat{Z})}\right] ^{C-c(\hat{Z})}\\&\le \frac{1}{K}\left[ \frac{c'(\hat{Z})}{r'(\hat{Z})}\right] ^{c'(\hat{Z})} \left[ \frac{C-c'(\hat{Z})}{R-r'(\hat{Z})}\right] ^{C-c'(\hat{Z})}\\&\le \hbox {sup}_{Z \in \mathcal {Z}} \frac{1}{K} \left[ I'(Z)\right] ^{c'(Z)} \left[ O'(Z)\right] ^{C-c'(Z)}\\&= \frac{1}{K} \left[ I'(\tilde{Z}^{\prime })\right] ^{c'(\tilde{Z}^{\prime })} \left[ O'(\tilde{Z}^{\prime })\right] ^{C-c'(\tilde{Z}^{\prime })}= \lambda (D^{\prime }). \end{aligned}$$

The first inequality holds since for any constants $\alpha $, $\beta $, and $N$, $(\alpha n)^n(\beta (N-n))^{N-n}$ is an increasing function of $n$ when $\alpha n > \beta (N-n)$. This is true since $\lambda (D) > 1$ implies that $I(\hat{Z})>O(\hat{Z})$, that is, $\frac{c(\hat{Z})}{r(\hat{Z})}>\frac{C}{R}$. This also means that $I'(\hat{Z}) > O'(\hat{Z})$. In order to verify this, using a proof by contradiction, let us suppose that $I'(\hat{Z}) \le O'(\hat{Z})$. Then, $\frac{c'(\hat{Z})}{r'(\hat{Z})}\le \frac{C}{R}$. Since $c'(\hat{Z}) \ge c(\hat{Z})$, this implies that $\frac{c(\hat{Z})}{r(\hat{Z})} \le \frac{c'(\hat{Z})}{r(\hat{Z})} \le \frac{C}{R}$ whenever $r(\hat{Z})=r'(\hat{Z})$, which is absurd. $\square $

Proof of Theorem 2

According to Definition 1, in order to prove that $\lambda $ is an IMP test, it is necessary to show that if statements (1) and (2) are true, then (3) cannot hold. This is equivalent to showing that for any $(Z,\theta _0,\theta _Z) \ \in A_Z$,

$$\begin{aligned} P( w \in R^{\prime }_Z \mid (Z,\theta _0,\theta _Z))-P( w \in R_Z \mid (Z,\theta _0,\theta _Z))\le 0. \end{aligned}$$

(13)

For an arbitrary $Z$, let $D_{-}=\{w: w \in R_Z, w \notin R^{\prime }_Z \}$ and $D_{+}=\{w: w \in R^{\prime }_Z, w \notin R_Z \}$. Define

$$\begin{aligned} M = \hbox {sup}_{ \ w \in D_{+}} \frac{L(Z,\theta _Z,\theta _0 \mid w)}{L(\theta _0 \mid w)}. \end{aligned}$$

By the definition of $D_{+}$ and $D_{-}$, since $R_Z$ is described in terms of $Z$, which is the most likely cluster in a subset of the sample space, we have that each $w$ in $D_{-}$ has a higher likelihood ratio than any $w$ in $D_{+}$; that is,

$$\begin{aligned} M&= \hbox {sup}_{ \ w \in D_{+}} \frac{L(Z,\theta _Z,\theta _0 \mid w)}{L(\theta _0 \mid w)} \le \hbox {inf}_{ \ w \in D_{-}} \frac{L(Z,\theta _Z,\theta _0 \mid w)}{L(\theta _0 \mid w)},\\ M&= \hbox {sup}_{ \ w \in D_{+}} \frac{\left[ \frac{\sum _{i \in Z_{w}}^{}x_i(1-d_i)}{\sum _{i \in Z_{w}}^{}n_i(1-d_i)} \right] ^{\sum _{i \in Z_{w}}^{}x_i(1-d_i)} \left[ \frac{\sum _{j \notin Z_{w}}^{}x_j(1-d_j)}{\sum _{j \notin Z_{w}}^{}n_j(1-d_j)} \right] ^{\sum _{j \notin Z_{w}}^{}x_j(1-d_j)}}{\left[ \frac{\sum _{i=1}^{k}x_i(1-d_i)}{\sum _{i=1}^{k}n_i(1-d_i)} \right] ^{\sum _{i=1}^{k}x_i(1-d_i)}}\\&\le \hbox {inf}_{ \ w \in D_{-}} \frac{\left[ \frac{\sum _{i \in Z_{w}}^{}x_i(1-d_i)}{\sum _{i \in Z_{w}}^{}n_i(1-d_i)} \right] ^{\sum _{i \in Z_{w}}^{}x_i(1-d_i)} \left[ \frac{\sum _{j \notin Z_{w}}^{}x_j(1-d_j)}{\sum _{j \notin Z_{w}}^{}n_j(1-d_j)} \right] ^{\sum _{j \notin Z_{w}}^{}x_j(1-d_j)}}{\left[ \frac{\sum _{i=1}^{k}x_i(1-d_i)}{\sum _{i=1}^{k}n_i(1-d_i)} \right] ^{\sum _{i=1}^{k}x_i(1-d_i)}}\\&= \hbox {inf}_{ \ w \in D_{-}} \frac{L(Z,\theta _Z,\theta _0 \mid w)}{L(\theta _0 \mid w)}. \end{aligned}$$

The proof of inequality (13) for any $(Z, \theta _Z, \theta _0) \in A_Z$ follows largely from Kulldorff (1997), where it is verified that

$$\begin{aligned}&P( w \in R^{\prime }_Z \mid (Z,\theta _0,\theta _Z))-P( w \in R_Z \mid (Z,\theta _0,\theta _Z)) \\&\quad \le M(P(w \in R^{\prime } \mid H_0) - P(w \in R \mid H_0))=0. \end{aligned}$$

The last equality holds since $R_j=R_j^{\prime }$ for all $j \ne Z$, according to statement 2 in Definition 1.$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cançado, A.L.F., da-Silva, C.Q. & da Silva, M.F. A spatial scan statistic for zero-inflated Poisson process. Environ Ecol Stat 21, 627–650 (2014). https://doi.org/10.1007/s10651-013-0272-1

Download citation

Received: 03 September 2012
Revised: 09 December 2013
Published: 11 March 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10651-013-0272-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A spatial scan statistic for zero-inflated Poisson process

Abstract

Access this article

Similar content being viewed by others

A nonparametric spatial scan statistic for continuous data

Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

Identifying Clusters in Spatial Data Via Sequential Importance Sampling

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Proof of Theorem 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A spatial scan statistic for zero-inflated Poisson process

Abstract

Access this article

Similar content being viewed by others

A nonparametric spatial scan statistic for continuous data

Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

Identifying Clusters in Spatial Data Via Sequential Importance Sampling

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Proof of Theorem 1

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation