Abstract
A scan statistic is examined for the purpose of testing the existence of a global peak in a random process with dependent variables of any distribution. The scan statistic tail probability is obtained based on the covariance of the moving sums process, thereby accounting for the spatial nature of the data as well as the size of the searching window. Exact formulas linking this covariance to the window size and the correlation coefficient are developed under general, common and auto covariance structures of the variables in the original process. The implementation and applicability of the formulas are demonstrated on multiple processes of t-statistics, treating also the case of unknown covariance. A sensitivity analysis provides further insight into the variant interaction of the tail probability with the influence parameters. An R code for the tail probability computation and the data analysis is offered within the supplementary material.
Similar content being viewed by others
References
Adak S (1998) Time-dependent spectral analysis of nonstationary time series. J Am Stat Assoc 93(444):1488–1501
Adler RJ, Taylor JE (2007) Random fields and geometry. Springer Monographs in Mathematics, Springer, New York
Amarioarei A, Preda C (2014) Approximations for two-dimensional discrete scan statistics in some block-factor type dependent models. J Stat Plan Infer 151-152:107–120
Amos DE, Bulgren WG (1972) Computation of a multivariate F distribution. Math Comput 26(117):255– 264
Bates D, Maechler M (2010) Matrix: sparse and dense matrix classes and methods. R package version 0.999375-46. Retrieved from http://CRAN.R-project.org/package=Matrix
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
Benjamini Y, Hochberg Y (1997) Multiple hypothesis testing with weights. Scand J Stat 24:407– 418
Bouaynaya N, Schonfeld D (2008) Non-stationary analysis of coding and non-coding regions in nucleotide sequences. IEEE J Selected Topics Signal Process 2(3):357–364
Chan H, Zhang N (2007) Scan statistics with weighted observations. J Am Stat Assoc 102:595–602
Chen H, Xing H, Zhang NR (2011) Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biolz 7(1):e1001060. doi:10.1371/journal.pcbi.1001060
Chen J (1998) Approximations and inequalities for discrete scan statistics. unpublished Ph.D. Dissertation, University of Connecticut, Storrs, CT
Cheng SH, Higham N (1998) A modified cholesky algorithm based on a symmetric indefinite factorization. SIAM J Matrix Anal Appl 19:1097–1110
Conneely KN, Boehnke M (2007) So many correlated tests, so little time! rapid adjustment of P values for multiple correlated tests. Am J Hum Genet 81:1158–1168
Darling RW, Waterman M (1986) Extreme value distribution for the largest cube in a random lattice. SIAM J Appl Math 46:118–132
David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM (2006) A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci 103:5320–5325
Efron B (2007) Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102(477):93–103
Efron B (2010) Correlated Z-values and the accuracy of large-scale statistical estimates. J Am Stat Assoc 105(491):1042–1055
Genovese CR, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32:1035–1061
Genovese CR, Roeder K, Wasserman L (2006) False discovery control with P-value weighting. Biometrika 93(3):509–524
Genz A (1992) Numerical computation of multivariate normal probabilities. J Comput Graph Stat 1:141–150
Genz A (1993) Comparison of methods for the computation of multivariate normal probabilities. Computing Science and Statistics 25:400–405
Genz A, Bretz F (2009) Computation of multivariate normal and t probabilities, vol 195. Springer-Verlag, Heidelberg
Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2014) mvtnorm: multivariate normal and t distributions. R package version 0.9-9996. http://CRAN.R-project.org/package=mvtnorm
Glaz J, Balakrishnan N (eds) (1999) Scan statistics and applications. Boston, Birkhäuser
Glaz J, Naus J (1991) Tight bounds and approximations for scan statistic probabilities for discrete data. Ann Appl Probab 1:306–318
Glaz J, Naus J, Wallenstein S (2001) Scan statistics. Springer-Verlag, New York
Glaz J, Naus J, Wang X (2011) Approximations and inequalities for moving sums. Methodol Comput Appl Probab 14(3):597–616
Glaz J, Naus J, Wang X (2012) Approximations and inequalities for moving sums. Methodol Comput Appl Probab 14:597–616
Goldstein L, Waterman M (1992) Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons. Bull Math Biol 54(5):785–812
Haiman G, Preda C (2013) One dimensional scan statistics generated by some dependent stationary sequences. Statisitcs and Probability Letters 83(5):1457–1463
Higham N. (2002) Computing the nearest correlation matrix - a problem from finance. IMA J Numer Anal 22:329–343
Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci:120–130
Huang L, Tiwari CT, Zou Z, Kulldorff M, Feuer EJ (2009) Weighted normal spatial scan statistic for heterogeneous population data. J Am Stat Assoc 104 (487):886–898
Huber W, Toedling J, Steinmetz L (2006) Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics 22(16):1963–1970
Juneau K, Palm C, Miranda M, Davis RW (2007) High-density yeast-tiling array reveals previously undiscovered introns and extensive regulation of meiotic splicing. Proc Natl Acad Sci 104:1522–1527
Karlin S, Brendel V (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257:39–49
Karlin S, Dembo A (1992) Limit-distribution of maximal segmental score among markov-dependent partial sums. Adv Appl Probab 24:113–140
Keles S, Van der Laan MJ, Dudoit S, Cawley S (2006) Multiple testing methods for ChIP-Chip high density Oligonucleotide array data. J Comput Biol 13(3):579–613
Koutras MV, Alexandrou VA (1995) Runs, scans and URN model distributions: a unified Markov chain approach. Ann Inst Stat Math 47(4):743–766
Ledoit O, Wolf M (2003) Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance 10:603–621
Lin DY (2005) An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21:781–787
Lindgren G, Leadbetter MR, Rootzen H (1983) Extremes and related properties of stationary sequences and processes. Springer-Verlag, New York
Loader CR (1991) Large-deviation approximations to the distribution of scan statistics. Adv Appl Probab 23:751–771
Mourier T, Jeffares DC (2003) Eukaryotic intron loss. Science 300 (5624):1393—1393
Naus J (1974) Probabilities for a generalized birthday problem. J Am Stat Assoc 69:810–815
Naus J (1982) Approximations for distributions of scan statistics. J Am Stat Assoc 77:177–183
Perone-Pacifico M, Genovese C, Verdinelli I, Wasserman L (2004) False discovery control for random fields. J Am Soc Stat Assoc 99:1002–1014
R Development Core Team (2011) R: A language and environment for statistical computing. Foundation for statistical computing, ISBN 3-900051-07-0. Vienna, Austria. Retrieved from http://www.R-project.org/
Reiner A, Yekutieli D, Benjamini Y (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19(3):368–375
Reiner-Benaim A, Davis WR, Juneau K (2014) Scan statistics analysis for detection of introns in time-course tiling array data. Stat Appl Genet Mol Biol 13:173–90
Reiner-Benaim A, Yekutieli D, Letwin N, Elmer G, Lee N, Kafkafi N, Benjamini Y (2007) Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay. Bioinformatics 23(17):2239–2246
Rice SO (1945) Mathematical analysis of random noise. Bell System Technical Journal 24:46–156
Roeder K, Devlin B, Wasserman L (2007) Improving power in genome-wide association studies: weights tip the scale. Genet Epidemiol 31(7):741–747
Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, Article 32
Schäfer J, Opgen-Rhein R, Zuber V, Ahdesmaki M, Pedro Duarte Silva A, Strimmer K (2013) corpcor: efficient estimation of covariance and (Partial) correlation. R package version 1.6.6. http://strimmerlab.org/software/corpcor/
Schwartzman A, Gavrilov Y, Adler R (2011) Multiple testing of local maxima for detection of peaks in 1D. Ann Stat 39(6):3290–3319
Seaman SR, Müller-Myhsok B (2005) Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. Am J Hum Genet 76:399–408
Siegmund D. (1988) Approximate tail probabilities for the maxima of some random fields. Ann Probab 16(2):487–501
Siegmund D., Kim H (1989) The likelihood ratio test for a change-point in simple linear regression. Biometrika 76(3):409–423
Siegmund DO, Zhang NR, Yakir B (2011) False discovery rate for scanning statistics. Biometrika 98:979–985
Taylor JE, Worsley KJ (2007) Detecting sparse signal in random fields, with an application to brain mapping. J Am Stat Assoc 102(479):913–928
Woodroofe M (1976) Frequentist properties of bayesian sequential tests. Biometrika 63(1):101–110
Yekutieli D, Reiner-Benaim A, Benjamini Y, Elmer GI, Kafkafi N, Letwin NE, Lee NH (2006) Approaches to multiplicity issues in complex research in microarray analysis. Statistica Neerlandica 60(4):414–437
Zelinski JS, Bouaynaya N, Schonfeld D, O’Neill W (2008) Time-dependent ARMA modeling of genomic sequences. BMC Bioinforma 9(Suppl 9):S14
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Reiner-Benaim, A. Scan Statistic Tail Probability Assessment Based on Process Covariance and Window Size. Methodol Comput Appl Probab 18, 717–745 (2016). https://doi.org/10.1007/s11009-015-9447-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11009-015-9447-6