Advertisement

Methodology and Computing in Applied Probability

, Volume 18, Issue 3, pp 717–745 | Cite as

Scan Statistic Tail Probability Assessment Based on Process Covariance and Window Size

  • Anat Reiner-BenaimEmail author
Article
  • 73 Downloads

Abstract

A scan statistic is examined for the purpose of testing the existence of a global peak in a random process with dependent variables of any distribution. The scan statistic tail probability is obtained based on the covariance of the moving sums process, thereby accounting for the spatial nature of the data as well as the size of the searching window. Exact formulas linking this covariance to the window size and the correlation coefficient are developed under general, common and auto covariance structures of the variables in the original process. The implementation and applicability of the formulas are demonstrated on multiple processes of t-statistics, treating also the case of unknown covariance. A sensitivity analysis provides further insight into the variant interaction of the tail probability with the influence parameters. An R code for the tail probability computation and the data analysis is offered within the supplementary material.

Keywords

Scan statistic Tail probability Moving sums Covariance structure Peak detection Sequence search 

Mathematics Subject Classification (2010)

62G32 62J15 30C40 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11009_2015_9447_MOESM1_ESM.pdf (308 kb)
(PDF 307 KB)

References

  1. Adak S (1998) Time-dependent spectral analysis of nonstationary time series. J Am Stat Assoc 93(444):1488–1501MathSciNetCrossRefzbMATHGoogle Scholar
  2. Adler RJ, Taylor JE (2007) Random fields and geometry. Springer Monographs in Mathematics, Springer, New YorkzbMATHGoogle Scholar
  3. Amarioarei A, Preda C (2014) Approximations for two-dimensional discrete scan statistics in some block-factor type dependent models. J Stat Plan Infer 151-152:107–120MathSciNetCrossRefzbMATHGoogle Scholar
  4. Amos DE, Bulgren WG (1972) Computation of a multivariate F distribution. Math Comput 26(117):255– 264MathSciNetzbMATHGoogle Scholar
  5. Bates D, Maechler M (2010) Matrix: sparse and dense matrix classes and methods. R package version 0.999375-46. Retrieved from http://CRAN.R-project.org/package=Matrix
  6. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300MathSciNetzbMATHGoogle Scholar
  7. Benjamini Y, Hochberg Y (1997) Multiple hypothesis testing with weights. Scand J Stat 24:407– 418MathSciNetCrossRefzbMATHGoogle Scholar
  8. Bouaynaya N, Schonfeld D (2008) Non-stationary analysis of coding and non-coding regions in nucleotide sequences. IEEE J Selected Topics Signal Process 2(3):357–364CrossRefGoogle Scholar
  9. Chan H, Zhang N (2007) Scan statistics with weighted observations. J Am Stat Assoc 102:595–602MathSciNetCrossRefzbMATHGoogle Scholar
  10. Chen H, Xing H, Zhang NR (2011) Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biolz 7(1):e1001060. doi: 10.1371/journal.pcbi.1001060 MathSciNetCrossRefGoogle Scholar
  11. Chen J (1998) Approximations and inequalities for discrete scan statistics. unpublished Ph.D. Dissertation, University of Connecticut, Storrs, CTGoogle Scholar
  12. Cheng SH, Higham N (1998) A modified cholesky algorithm based on a symmetric indefinite factorization. SIAM J Matrix Anal Appl 19:1097–1110MathSciNetCrossRefzbMATHGoogle Scholar
  13. Conneely KN, Boehnke M (2007) So many correlated tests, so little time! rapid adjustment of P values for multiple correlated tests. Am J Hum Genet 81:1158–1168CrossRefGoogle Scholar
  14. Darling RW, Waterman M (1986) Extreme value distribution for the largest cube in a random lattice. SIAM J Appl Math 46:118–132MathSciNetCrossRefzbMATHGoogle Scholar
  15. David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM (2006) A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci 103:5320–5325CrossRefGoogle Scholar
  16. Efron B (2007) Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102(477):93–103MathSciNetCrossRefzbMATHGoogle Scholar
  17. Efron B (2010) Correlated Z-values and the accuracy of large-scale statistical estimates. J Am Stat Assoc 105(491):1042–1055MathSciNetCrossRefzbMATHGoogle Scholar
  18. Genovese CR, Wasserman L (2004) A stochastic process approach to false discovery control. Ann Stat 32:1035–1061MathSciNetCrossRefzbMATHGoogle Scholar
  19. Genovese CR, Roeder K, Wasserman L (2006) False discovery control with P-value weighting. Biometrika 93(3):509–524MathSciNetCrossRefzbMATHGoogle Scholar
  20. Genz A (1992) Numerical computation of multivariate normal probabilities. J Comput Graph Stat 1:141–150Google Scholar
  21. Genz A (1993) Comparison of methods for the computation of multivariate normal probabilities. Computing Science and Statistics 25:400–405Google Scholar
  22. Genz A, Bretz F (2009) Computation of multivariate normal and t probabilities, vol 195. Springer-Verlag, HeidelbergCrossRefzbMATHGoogle Scholar
  23. Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2014) mvtnorm: multivariate normal and t distributions. R package version 0.9-9996. http://CRAN.R-project.org/package=mvtnorm
  24. Glaz J, Balakrishnan N (eds) (1999) Scan statistics and applications. Boston, BirkhäuserzbMATHGoogle Scholar
  25. Glaz J, Naus J (1991) Tight bounds and approximations for scan statistic probabilities for discrete data. Ann Appl Probab 1:306–318MathSciNetCrossRefzbMATHGoogle Scholar
  26. Glaz J, Naus J, Wallenstein S (2001) Scan statistics. Springer-Verlag, New YorkCrossRefzbMATHGoogle Scholar
  27. Glaz J, Naus J, Wang X (2011) Approximations and inequalities for moving sums. Methodol Comput Appl Probab 14(3):597–616MathSciNetCrossRefzbMATHGoogle Scholar
  28. Glaz J, Naus J, Wang X (2012) Approximations and inequalities for moving sums. Methodol Comput Appl Probab 14:597–616MathSciNetCrossRefzbMATHGoogle Scholar
  29. Goldstein L, Waterman M (1992) Poisson, compound poisson and process approximations for testing statistical significance in sequence comparisons. Bull Math Biol 54(5):785–812CrossRefzbMATHGoogle Scholar
  30. Haiman G, Preda C (2013) One dimensional scan statistics generated by some dependent stationary sequences. Statisitcs and Probability Letters 83(5):1457–1463MathSciNetCrossRefzbMATHGoogle Scholar
  31. Higham N. (2002) Computing the nearest correlation matrix - a problem from finance. IMA J Numer Anal 22:329–343MathSciNetCrossRefzbMATHGoogle Scholar
  32. Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci:120–130Google Scholar
  33. Huang L, Tiwari CT, Zou Z, Kulldorff M, Feuer EJ (2009) Weighted normal spatial scan statistic for heterogeneous population data. J Am Stat Assoc 104 (487):886–898MathSciNetCrossRefzbMATHGoogle Scholar
  34. Huber W, Toedling J, Steinmetz L (2006) Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics 22(16):1963–1970CrossRefGoogle Scholar
  35. Juneau K, Palm C, Miranda M, Davis RW (2007) High-density yeast-tiling array reveals previously undiscovered introns and extensive regulation of meiotic splicing. Proc Natl Acad Sci 104:1522–1527CrossRefGoogle Scholar
  36. Karlin S, Brendel V (1992) Chance and statistical significance in protein and DNA sequence analysis. Science 257:39–49CrossRefGoogle Scholar
  37. Karlin S, Dembo A (1992) Limit-distribution of maximal segmental score among markov-dependent partial sums. Adv Appl Probab 24:113–140MathSciNetCrossRefzbMATHGoogle Scholar
  38. Keles S, Van der Laan MJ, Dudoit S, Cawley S (2006) Multiple testing methods for ChIP-Chip high density Oligonucleotide array data. J Comput Biol 13(3):579–613MathSciNetCrossRefGoogle Scholar
  39. Koutras MV, Alexandrou VA (1995) Runs, scans and URN model distributions: a unified Markov chain approach. Ann Inst Stat Math 47(4):743–766MathSciNetCrossRefzbMATHGoogle Scholar
  40. Ledoit O, Wolf M (2003) Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance 10:603–621CrossRefGoogle Scholar
  41. Lin DY (2005) An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21:781–787CrossRefGoogle Scholar
  42. Lindgren G, Leadbetter MR, Rootzen H (1983) Extremes and related properties of stationary sequences and processes. Springer-Verlag, New YorkzbMATHGoogle Scholar
  43. Loader CR (1991) Large-deviation approximations to the distribution of scan statistics. Adv Appl Probab 23:751–771MathSciNetCrossRefzbMATHGoogle Scholar
  44. Mourier T, Jeffares DC (2003) Eukaryotic intron loss. Science 300 (5624):1393—1393CrossRefGoogle Scholar
  45. Naus J (1974) Probabilities for a generalized birthday problem. J Am Stat Assoc 69:810–815MathSciNetCrossRefzbMATHGoogle Scholar
  46. Naus J (1982) Approximations for distributions of scan statistics. J Am Stat Assoc 77:177–183MathSciNetCrossRefzbMATHGoogle Scholar
  47. Perone-Pacifico M, Genovese C, Verdinelli I, Wasserman L (2004) False discovery control for random fields. J Am Soc Stat Assoc 99:1002–1014MathSciNetCrossRefzbMATHGoogle Scholar
  48. R Development Core Team (2011) R: A language and environment for statistical computing. Foundation for statistical computing, ISBN 3-900051-07-0. Vienna, Austria. Retrieved from http://www.R-project.org/
  49. Reiner A, Yekutieli D, Benjamini Y (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19(3):368–375CrossRefGoogle Scholar
  50. Reiner-Benaim A, Davis WR, Juneau K (2014) Scan statistics analysis for detection of introns in time-course tiling array data. Stat Appl Genet Mol Biol 13:173–90MathSciNetzbMATHGoogle Scholar
  51. Reiner-Benaim A, Yekutieli D, Letwin N, Elmer G, Lee N, Kafkafi N, Benjamini Y (2007) Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay. Bioinformatics 23(17):2239–2246CrossRefGoogle Scholar
  52. Rice SO (1945) Mathematical analysis of random noise. Bell System Technical Journal 24:46–156MathSciNetCrossRefzbMATHGoogle Scholar
  53. Roeder K, Devlin B, Wasserman L (2007) Improving power in genome-wide association studies: weights tip the scale. Genet Epidemiol 31(7):741–747CrossRefGoogle Scholar
  54. Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, Article 32Google Scholar
  55. Schäfer J, Opgen-Rhein R, Zuber V, Ahdesmaki M, Pedro Duarte Silva A, Strimmer K (2013) corpcor: efficient estimation of covariance and (Partial) correlation. R package version 1.6.6. http://strimmerlab.org/software/corpcor/
  56. Schwartzman A, Gavrilov Y, Adler R (2011) Multiple testing of local maxima for detection of peaks in 1D. Ann Stat 39(6):3290–3319MathSciNetCrossRefzbMATHGoogle Scholar
  57. Seaman SR, Müller-Myhsok B (2005) Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. Am J Hum Genet 76:399–408CrossRefGoogle Scholar
  58. Siegmund D. (1988) Approximate tail probabilities for the maxima of some random fields. Ann Probab 16(2):487–501MathSciNetCrossRefzbMATHGoogle Scholar
  59. Siegmund D., Kim H (1989) The likelihood ratio test for a change-point in simple linear regression. Biometrika 76(3):409–423MathSciNetCrossRefzbMATHGoogle Scholar
  60. Siegmund DO, Zhang NR, Yakir B (2011) False discovery rate for scanning statistics. Biometrika 98:979–985MathSciNetCrossRefzbMATHGoogle Scholar
  61. Taylor JE, Worsley KJ (2007) Detecting sparse signal in random fields, with an application to brain mapping. J Am Stat Assoc 102(479):913–928MathSciNetCrossRefzbMATHGoogle Scholar
  62. Woodroofe M (1976) Frequentist properties of bayesian sequential tests. Biometrika 63(1):101–110MathSciNetCrossRefzbMATHGoogle Scholar
  63. Yekutieli D, Reiner-Benaim A, Benjamini Y, Elmer GI, Kafkafi N, Letwin NE, Lee NH (2006) Approaches to multiplicity issues in complex research in microarray analysis. Statistica Neerlandica 60(4):414–437MathSciNetCrossRefzbMATHGoogle Scholar
  64. Zelinski JS, Bouaynaya N, Schonfeld D, O’Neill W (2008) Time-dependent ARMA modeling of genomic sequences. BMC Bioinforma 9(Suppl 9):S14CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.University of HaifaHaifaIsrael

Personalised recommendations