Skip to main content

Applications of the Scan Statistic in DNA Sequence Analysis

  • Chapter
Book cover Scan Statistics and Applications

Part of the book series: Statistics for Industry and Technology ((SIT))

Abstract

Advances of biochemical techniques have made available large databases of long DNA sequences. These sequences reflect conglomerates of random and nonrandom letter strings from the nucleotide alphabet {A, C, G, T}. As the databases expand, mathematical methods play an increasingly important role in analyzing and interpreting the rapidly accumulating DNA data. In this chapter, we discuss a specific example of identifying nonrandom clusters of palindromes in a family of herpesvirus genomes using the r-scan statistic. Palindrome positions on the genome are modeled by i.i.d. random variables uniformly distributed on the unit interval (0,1). After a comparison of three Poisson-type approximations, the r-scan distribution is computed by a compound Poisson approximation proposed by Glaz (1994). Some of the significant palindrome clusters are located at genome regions containing origins of replication and regulatory signals of the herpesviruses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aldous, D. (1989).Probability Approximations via the Poisson Clumping HeuristicNew York: Springer-Verlag.

    MATH  Google Scholar 

  2. Arratia, R., Goldstein, L. and Gordon, L. (1989). Two moments suffice for Poisson approximations: The Chen-Stein methodAnnals of Probability 179–25.

    Article  MathSciNet  MATH  Google Scholar 

  3. Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein methodStatistical Science 5403–434.

    MathSciNet  MATH  Google Scholar 

  4. Barbour, A. D., Holst, L. and Janson, S. (1992).Poisson ApproximationOxford: Clarendon Press.

    MATH  Google Scholar 

  5. Berman, M. and Eagleson, G. K. (1985). A useful upper bound for the tail probabilities of the scan statistic when the sample size is largeJournal of the American Statistical Association 80886–889.

    Article  MathSciNet  Google Scholar 

  6. Chen, L. H. Y. (1975). Poisson approximation for dependent trialsAnnals of Probability 3534–545.

    Article  MATH  Google Scholar 

  7. Cressie, N. (1977). The minimum of higher order gapsAustalian Journal of Statistics 19132–143.

    Article  MathSciNet  MATH  Google Scholar 

  8. Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processesAnnals of Applied Probability 2329–357.

    Article  MathSciNet  MATH  Google Scholar 

  9. Doolittle, R. F. (Ed.) (1990). Molecular Evolution: Computer Analysis of Protein and Nucleic Acid SequencesMethods of Enzymology 183San Diego: Academic Press.

    Google Scholar 

  10. Farrell, P. J. (1993). Epstein-Barr virus, InGenetic Maps Sixth Edition Book 1 Viruses(Ed., S. J. Brien), pp. 1.120–1.133, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.

    Google Scholar 

  11. Glaz, J. (1989). Approximations and bounds for the distribution of the scan statisticJournal of the American Statistical Association 84560–566.

    Article  MathSciNet  MATH  Google Scholar 

  12. Glaz, J. (1992). Approximations for tail probabilities and moments of the scan statisticComputational Statistics & Data Analysis 14213–227.

    Article  MathSciNet  MATH  Google Scholar 

  13. Glaz, J. and Naus, J. (1991). Tight bounds and approximations for scan statistic probabilities for discrete dataAnnals of Applied Probability 1306–318.

    Article  MathSciNet  MATH  Google Scholar 

  14. Glaz, J., Naus, J., Roos, M. and Wallenstein, S. (1994). Poisson approximations for the distribution and moments of ordered m-spacingsJournal of Applied Probability 31271–281.

    Article  MathSciNet  Google Scholar 

  15. Huffer, F. E. and Lin, C.-T. (1998a). Computing the exact distribution of the extremes of sums of consecutive spacingsComputational Statistics Data Analysis(to appear).

    Google Scholar 

  16. Huffer, F. E. and Lin, C.-T. (1998b). Approximating the distribution of the scan statistic using moments of the number of clumpsJournal of the American Statistical Association(to appear).

    Google Scholar 

  17. Karlin, S., Blaisdell, B. E., Sapolsky, R. J., Cardon, L. and Burge, C. (1993). Assessments of DNA inhomogeneities in yeast chromosome IIINucleic Acids Research 21703–711.

    Article  Google Scholar 

  18. Karlin S. and Brendel, V. (1992). Chance and statistical significance in Protein and DNA sequence analysisScience 25739–49.

    Article  Google Scholar 

  19. Karlin S. and. Cardon, L. R. (1994). Computational DNA sequence analysisAnnual Reviews of Microbiology 48619–654.

    Article  Google Scholar 

  20. Karlin, S., Mrázek, J. and Campbell, A. M. (1996). Frequent oligonucleotides and peptides of the Haemophilus influenzae genomeNucleic Acids Research 244263–4272.

    Article  Google Scholar 

  21. Karlin, S., Mrázek, J. and Campbell, A. M. (1997). Compositional biases of bacterial genomes and evolutionary implicationsJournal of Bacteriology 1793899–3913.

    Google Scholar 

  22. Karlin, S. and Taylor, H. M. (1981).A Second Course in Stochastic ProcessesSecond edition, New York: Academic Press.

    MATH  Google Scholar 

  23. Labrecque, L. G., Barnes, D. M., Fentiman, I. S. and Griffin, B. E. (1995). Epstein-Barr virus in epithelial cell tumors: a breast cancer studyCancer Research 5539–45.

    Google Scholar 

  24. Leung, M. Y., Blaisdell, B. E., Burge, C. and Karlin, S. (1991). An efficient algorithm for identifying matches with errors in multiple long molecular sequencesJournal of Molecular Biology 2211367–1378.

    Article  Google Scholar 

  25. Leung, M. Y., Schachtel, G. A. and Yu, H. S. (1994). Scan statistics and DNA sequence analysis: the search for an origin of replication in a virusNonlinear World 1445–471.

    MathSciNet  MATH  Google Scholar 

  26. Leung, M. Y., Marsh, G. M. and Speed, T. P. (1996). Over-and under-representation of short DNA words in herpesvirus genomesJournal of Computational Biology 3345–360.

    Article  Google Scholar 

  27. Masse, M. J., Karlin, S., Schachtel, G. A. and Mocarski, E. S. (1992). Human cytomegalo virus origin of DNA replication (oriLyt) resides within a highly complex repetitive regionProceedings of the National Academy of Science USA 895246–5250.

    Article  Google Scholar 

  28. McGeoch, D. J. and Schaffer, P. A. (1993). Herpes Simplex Virus, InGenetic Maps Sixth Edition Book 1 Viruses(Ed., S. J. O’Brien), pp. 1.147–1.156, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.

    Google Scholar 

  29. Naus, J. I. and Sheng, K.-N. (1996). Screening for unusual matched segments in multiple protein sequencesCommunications in Statistics—Simulation and Computation 25937–952.

    Article  MathSciNet  MATH  Google Scholar 

  30. Roos, M. (1993). Compound Poisson approximations for the numbers of extreme spacingsAdvances in Applied Probability 25847–874.

    Article  MathSciNet  MATH  Google Scholar 

  31. Sheng, K. and Naus, J. (1994). Pattern matching between two nonaligned random sequencesBulletin of Mathematical Biology 561143–1162.

    MATH  Google Scholar 

  32. Vital, C., Monlun, E., Vital, A., Martin-Negrier, M. L., Cales, V., Leger, F., Longy-Boursier, M., Le Bras, M. and Bloch, B. (1995). Concurrent herpes simplex type 1 necrotizing encephalitis, cytomegalovirus ventriculoencephalitis and cerebral lymphoma in an AIDS patientActa Pathologica 89105–108.

    Google Scholar 

  33. Waterman, M. S. (Ed.) (1989).Mathematical Methods for DNA SequencesBoca Raton: CRC Press.

    MATH  Google Scholar 

  34. Waterman, M. S. (1995).Introduction to Computational BiologyNew York: Chapman and Hall.

    MATH  Google Scholar 

  35. Weston, K. (1988). An enhancer element in the short unique region of human cytomegalo virus regulates the production of a group of abundant immediate early transcriptsVirology 162406–416.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer Science+Business Media New York

About this chapter

Cite this chapter

Leung, MY., Yamashita, T.E. (1999). Applications of the Scan Statistic in DNA Sequence Analysis. In: Glaz, J., Balakrishnan, N. (eds) Scan Statistics and Applications. Statistics for Industry and Technology. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1578-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-4612-1578-3_12

  • Publisher Name: Birkhäuser, Boston, MA

  • Print ISBN: 978-1-4612-7201-4

  • Online ISBN: 978-1-4612-1578-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics