Abstract
Advances of biochemical techniques have made available large databases of long DNA sequences. These sequences reflect conglomerates of random and nonrandom letter strings from the nucleotide alphabet {A, C, G, T}. As the databases expand, mathematical methods play an increasingly important role in analyzing and interpreting the rapidly accumulating DNA data. In this chapter, we discuss a specific example of identifying nonrandom clusters of palindromes in a family of herpesvirus genomes using the r-scan statistic. Palindrome positions on the genome are modeled by i.i.d. random variables uniformly distributed on the unit interval (0,1). After a comparison of three Poisson-type approximations, the r-scan distribution is computed by a compound Poisson approximation proposed by Glaz (1994). Some of the significant palindrome clusters are located at genome regions containing origins of replication and regulatory signals of the herpesviruses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aldous, D. (1989).Probability Approximations via the Poisson Clumping HeuristicNew York: Springer-Verlag.
Arratia, R., Goldstein, L. and Gordon, L. (1989). Two moments suffice for Poisson approximations: The Chen-Stein methodAnnals of Probability 179–25.
Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein methodStatistical Science 5403–434.
Barbour, A. D., Holst, L. and Janson, S. (1992).Poisson ApproximationOxford: Clarendon Press.
Berman, M. and Eagleson, G. K. (1985). A useful upper bound for the tail probabilities of the scan statistic when the sample size is largeJournal of the American Statistical Association 80886–889.
Chen, L. H. Y. (1975). Poisson approximation for dependent trialsAnnals of Probability 3534–545.
Cressie, N. (1977). The minimum of higher order gapsAustalian Journal of Statistics 19132–143.
Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processesAnnals of Applied Probability 2329–357.
Doolittle, R. F. (Ed.) (1990). Molecular Evolution: Computer Analysis of Protein and Nucleic Acid SequencesMethods of Enzymology 183San Diego: Academic Press.
Farrell, P. J. (1993). Epstein-Barr virus, InGenetic Maps Sixth Edition Book 1 Viruses(Ed., S. J. Brien), pp. 1.120–1.133, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.
Glaz, J. (1989). Approximations and bounds for the distribution of the scan statisticJournal of the American Statistical Association 84560–566.
Glaz, J. (1992). Approximations for tail probabilities and moments of the scan statisticComputational Statistics & Data Analysis 14213–227.
Glaz, J. and Naus, J. (1991). Tight bounds and approximations for scan statistic probabilities for discrete dataAnnals of Applied Probability 1306–318.
Glaz, J., Naus, J., Roos, M. and Wallenstein, S. (1994). Poisson approximations for the distribution and moments of ordered m-spacingsJournal of Applied Probability 31271–281.
Huffer, F. E. and Lin, C.-T. (1998a). Computing the exact distribution of the extremes of sums of consecutive spacingsComputational Statistics Data Analysis(to appear).
Huffer, F. E. and Lin, C.-T. (1998b). Approximating the distribution of the scan statistic using moments of the number of clumpsJournal of the American Statistical Association(to appear).
Karlin, S., Blaisdell, B. E., Sapolsky, R. J., Cardon, L. and Burge, C. (1993). Assessments of DNA inhomogeneities in yeast chromosome IIINucleic Acids Research 21703–711.
Karlin S. and Brendel, V. (1992). Chance and statistical significance in Protein and DNA sequence analysisScience 25739–49.
Karlin S. and. Cardon, L. R. (1994). Computational DNA sequence analysisAnnual Reviews of Microbiology 48619–654.
Karlin, S., Mrázek, J. and Campbell, A. M. (1996). Frequent oligonucleotides and peptides of the Haemophilus influenzae genomeNucleic Acids Research 244263–4272.
Karlin, S., Mrázek, J. and Campbell, A. M. (1997). Compositional biases of bacterial genomes and evolutionary implicationsJournal of Bacteriology 1793899–3913.
Karlin, S. and Taylor, H. M. (1981).A Second Course in Stochastic ProcessesSecond edition, New York: Academic Press.
Labrecque, L. G., Barnes, D. M., Fentiman, I. S. and Griffin, B. E. (1995). Epstein-Barr virus in epithelial cell tumors: a breast cancer studyCancer Research 5539–45.
Leung, M. Y., Blaisdell, B. E., Burge, C. and Karlin, S. (1991). An efficient algorithm for identifying matches with errors in multiple long molecular sequencesJournal of Molecular Biology 2211367–1378.
Leung, M. Y., Schachtel, G. A. and Yu, H. S. (1994). Scan statistics and DNA sequence analysis: the search for an origin of replication in a virusNonlinear World 1445–471.
Leung, M. Y., Marsh, G. M. and Speed, T. P. (1996). Over-and under-representation of short DNA words in herpesvirus genomesJournal of Computational Biology 3345–360.
Masse, M. J., Karlin, S., Schachtel, G. A. and Mocarski, E. S. (1992). Human cytomegalo virus origin of DNA replication (oriLyt) resides within a highly complex repetitive regionProceedings of the National Academy of Science USA 895246–5250.
McGeoch, D. J. and Schaffer, P. A. (1993). Herpes Simplex Virus, InGenetic Maps Sixth Edition Book 1 Viruses(Ed., S. J. O’Brien), pp. 1.147–1.156, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press.
Naus, J. I. and Sheng, K.-N. (1996). Screening for unusual matched segments in multiple protein sequencesCommunications in Statistics—Simulation and Computation 25937–952.
Roos, M. (1993). Compound Poisson approximations for the numbers of extreme spacingsAdvances in Applied Probability 25847–874.
Sheng, K. and Naus, J. (1994). Pattern matching between two nonaligned random sequencesBulletin of Mathematical Biology 561143–1162.
Vital, C., Monlun, E., Vital, A., Martin-Negrier, M. L., Cales, V., Leger, F., Longy-Boursier, M., Le Bras, M. and Bloch, B. (1995). Concurrent herpes simplex type 1 necrotizing encephalitis, cytomegalovirus ventriculoencephalitis and cerebral lymphoma in an AIDS patientActa Pathologica 89105–108.
Waterman, M. S. (Ed.) (1989).Mathematical Methods for DNA SequencesBoca Raton: CRC Press.
Waterman, M. S. (1995).Introduction to Computational BiologyNew York: Chapman and Hall.
Weston, K. (1988). An enhancer element in the short unique region of human cytomegalo virus regulates the production of a group of abundant immediate early transcriptsVirology 162406–416.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media New York
About this chapter
Cite this chapter
Leung, MY., Yamashita, T.E. (1999). Applications of the Scan Statistic in DNA Sequence Analysis. In: Glaz, J., Balakrishnan, N. (eds) Scan Statistics and Applications. Statistics for Industry and Technology. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1578-3_12
Download citation
DOI: https://doi.org/10.1007/978-1-4612-1578-3_12
Publisher Name: Birkhäuser, Boston, MA
Print ISBN: 978-1-4612-7201-4
Online ISBN: 978-1-4612-1578-3
eBook Packages: Springer Book Archive