Applications of the Scan Statistic in DNA Sequence Analysis

Leung, Ming-Ying; Yamashita, Traci E.

doi:10.1007/978-1-4612-1578-3_12

Ming-Ying Leung^4,5 &
Traci E. Yamashita^4,5

Part of the book series: Statistics for Industry and Technology ((SIT))

611 Accesses
5 Citations

Abstract

Advances of biochemical techniques have made available large databases of long DNA sequences. These sequences reflect conglomerates of random and nonrandom letter strings from the nucleotide alphabet {A, C, G, T}. As the databases expand, mathematical methods play an increasingly important role in analyzing and interpreting the rapidly accumulating DNA data. In this chapter, we discuss a specific example of identifying nonrandom clusters of palindromes in a family of herpesvirus genomes using the r-scan statistic. Palindrome positions on the genome are modeled by i.i.d. random variables uniformly distributed on the unit interval (0,1). After a comparison of three Poisson-type approximations, the r-scan distribution is computed by a compound Poisson approximation proposed by Glaz (1994). Some of the significant palindrome clusters are located at genome regions containing origins of replication and regulatory signals of the herpesviruses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aldous, D. (1989).Probability Approximations via the Poisson Clumping HeuristicNew York: Springer-Verlag.
MATH Google Scholar
Arratia, R., Goldstein, L. and Gordon, L. (1989). Two moments suffice for Poisson approximations: The Chen-Stein methodAnnals of Probability 179–25.
Article MathSciNet MATH Google Scholar
Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein methodStatistical Science 5403–434.
MathSciNet MATH Google Scholar
Barbour, A. D., Holst, L. and Janson, S. (1992).Poisson ApproximationOxford: Clarendon Press.
MATH Google Scholar
Berman, M. and Eagleson, G. K. (1985). A useful upper bound for the tail probabilities of the scan statistic when the sample size is largeJournal of the American Statistical Association 80886–889.
Article MathSciNet Google Scholar
Chen, L. H. Y. (1975). Poisson approximation for dependent trialsAnnals of Probability 3534–545.
Article MATH Google Scholar
Cressie, N. (1977). The minimum of higher order gapsAustalian Journal of Statistics 19132–143.
Article MathSciNet MATH Google Scholar
Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processesAnnals of Applied Probability 2329–357.
Article MathSciNet MATH Google Scholar
Doolittle, R. F. (Ed.) (1990). Molecular Evolution: Computer Analysis of Protein and Nucleic Acid SequencesMethods of Enzymology 183San Diego: Academic Press.
Google Scholar
Farrell, P. J. (1993). Epstein-Barr virus, InGenetic Maps Sixth Edition Book 1 Viruses(Ed., S. J. Brien), pp. 1.120–1.133, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.
Google Scholar
Glaz, J. (1989). Approximations and bounds for the distribution of the scan statisticJournal of the American Statistical Association 84560–566.
Article MathSciNet MATH Google Scholar
Glaz, J. (1992). Approximations for tail probabilities and moments of the scan statisticComputational Statistics & Data Analysis 14213–227.
Article MathSciNet MATH Google Scholar
Glaz, J. and Naus, J. (1991). Tight bounds and approximations for scan statistic probabilities for discrete dataAnnals of Applied Probability 1306–318.
Article MathSciNet MATH Google Scholar
Glaz, J., Naus, J., Roos, M. and Wallenstein, S. (1994). Poisson approximations for the distribution and moments of ordered m-spacingsJournal of Applied Probability 31271–281.
Article MathSciNet Google Scholar
Huffer, F. E. and Lin, C.-T. (1998a). Computing the exact distribution of the extremes of sums of consecutive spacingsComputational Statistics Data Analysis(to appear).
Google Scholar
Huffer, F. E. and Lin, C.-T. (1998b). Approximating the distribution of the scan statistic using moments of the number of clumpsJournal of the American Statistical Association(to appear).
Google Scholar
Karlin, S., Blaisdell, B. E., Sapolsky, R. J., Cardon, L. and Burge, C. (1993). Assessments of DNA inhomogeneities in yeast chromosome IIINucleic Acids Research 21703–711.
Article Google Scholar
Karlin S. and Brendel, V. (1992). Chance and statistical significance in Protein and DNA sequence analysisScience 25739–49.
Article Google Scholar
Karlin S. and. Cardon, L. R. (1994). Computational DNA sequence analysisAnnual Reviews of Microbiology 48619–654.
Article Google Scholar
Karlin, S., Mrázek, J. and Campbell, A. M. (1996). Frequent oligonucleotides and peptides of the Haemophilus influenzae genomeNucleic Acids Research 244263–4272.
Article Google Scholar
Karlin, S., Mrázek, J. and Campbell, A. M. (1997). Compositional biases of bacterial genomes and evolutionary implicationsJournal of Bacteriology 1793899–3913.
Google Scholar
Karlin, S. and Taylor, H. M. (1981).A Second Course in Stochastic ProcessesSecond edition, New York: Academic Press.
MATH Google Scholar
Labrecque, L. G., Barnes, D. M., Fentiman, I. S. and Griffin, B. E. (1995). Epstein-Barr virus in epithelial cell tumors: a breast cancer studyCancer Research 5539–45.
Google Scholar
Leung, M. Y., Blaisdell, B. E., Burge, C. and Karlin, S. (1991). An efficient algorithm for identifying matches with errors in multiple long molecular sequencesJournal of Molecular Biology 2211367–1378.
Article Google Scholar
Leung, M. Y., Schachtel, G. A. and Yu, H. S. (1994). Scan statistics and DNA sequence analysis: the search for an origin of replication in a virusNonlinear World 1445–471.
MathSciNet MATH Google Scholar
Leung, M. Y., Marsh, G. M. and Speed, T. P. (1996). Over-and under-representation of short DNA words in herpesvirus genomesJournal of Computational Biology 3345–360.
Article Google Scholar
Masse, M. J., Karlin, S., Schachtel, G. A. and Mocarski, E. S. (1992). Human cytomegalo virus origin of DNA replication (oriLyt) resides within a highly complex repetitive regionProceedings of the National Academy of Science USA 895246–5250.
Article Google Scholar
McGeoch, D. J. and Schaffer, P. A. (1993). Herpes Simplex Virus, InGenetic Maps Sixth Edition Book 1 Viruses(Ed., S. J. O’Brien), pp. 1.147–1.156, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.
Google Scholar
Naus, J. I. and Sheng, K.-N. (1996). Screening for unusual matched segments in multiple protein sequencesCommunications in Statistics—Simulation and Computation 25937–952.
Article MathSciNet MATH Google Scholar
Roos, M. (1993). Compound Poisson approximations for the numbers of extreme spacingsAdvances in Applied Probability 25847–874.
Article MathSciNet MATH Google Scholar
Sheng, K. and Naus, J. (1994). Pattern matching between two nonaligned random sequencesBulletin of Mathematical Biology 561143–1162.
MATH Google Scholar
Vital, C., Monlun, E., Vital, A., Martin-Negrier, M. L., Cales, V., Leger, F., Longy-Boursier, M., Le Bras, M. and Bloch, B. (1995). Concurrent herpes simplex type 1 necrotizing encephalitis, cytomegalovirus ventriculoencephalitis and cerebral lymphoma in an AIDS patientActa Pathologica 89105–108.
Google Scholar
Waterman, M. S. (Ed.) (1989).Mathematical Methods for DNA SequencesBoca Raton: CRC Press.
MATH Google Scholar
Waterman, M. S. (1995).Introduction to Computational BiologyNew York: Chapman and Hall.
MATH Google Scholar
Weston, K. (1988). An enhancer element in the short unique region of human cytomegalo virus regulates the production of a group of abundant immediate early transcriptsVirology 162406–416.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Texas at San Antonio, San Antonio, TX, USA
Ming-Ying Leung & Traci E. Yamashita
Johns Hopkins School of Hygiene and Public Health, Baltimore, MD, USA
Ming-Ying Leung & Traci E. Yamashita

Authors

Ming-Ying Leung
View author publications
You can also search for this author in PubMed Google Scholar
Traci E. Yamashita
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics, University of Connecticut at Storrs, Storrs, CT, 06269-3120, USA
Joseph Glaz
Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario, L8S 4K1, Canada
N. Balakrishnan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Leung, MY., Yamashita, T.E. (1999). Applications of the Scan Statistic in DNA Sequence Analysis. In: Glaz, J., Balakrishnan, N. (eds) Scan Statistics and Applications. Statistics for Industry and Technology. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1578-3_12

Download citation

DOI: https://doi.org/10.1007/978-1-4612-1578-3_12
Publisher Name: Birkhäuser, Boston, MA
Print ISBN: 978-1-4612-7201-4
Online ISBN: 978-1-4612-1578-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics