Abstract
Mathematically, a DNA segment can be viewed as a sequence of four-state (AC G T)trials, and a perfect match of sizeMoccurs when two DNA sequences have at least one identical subsequence (or pattern) of lengthM.Pattern matching probabilities are crucial for statistically rigorous comparisons of DNA (and other) sequences, and many bounds and approximations of such probabilities have recently been developed. There are few results on exact probabilities, especially for trials with unequal state probabilities, and no exact analytical formulae for the pattern matching probability involving arbitrarily long nonaligned sequences. In this chapter, a simple and efficient method based on the finite Markov chain imbedding technique is developed to obtain the exact probability of perfect matching for i.i.d. four-state trials with either equal or unequal state probabilities. A large deviation approximation is derived for very long sequences, and numerical examples are given to illustrate the results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arratia, R., Gordon, L. and Waterman, M. S. (1986). An extreme value theory for sequence matchingAnnals of Statistics 14971–993.
Arratia, R., Gordon, L. and Waterman, M. S. (1990). The Erdös-Rényi law in distribution, for coin tossing and sequence matchingAnnals of Statistics 18539–570.
Chao, M. T. and Fu, J. C. (1989). A limit theorem of certain repairable systemsAnnals of the Institute of Statistical Mathematics 4809–818.
Erdös, P. and Révész, P. (1975). On the length of the longest head-runTopics in Information Theory Colloquia of Mathematical Society János Bolyai 16219–228, Keszthely, Hungary.
Fu, J. C. (1986). Bounds for reliability of large consecutive-k-out-ofn:F systems with unequal component reliabilityIEEE Transactions on Reliability 35316–319.
Fu, J. C. (1996). Distribution theory of urns and patterns associated with a sequence of multi-state trialsStatistica Sinica 6957–974.
Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approachJournal of the American Statistical Association 891050–1058.
Fu, Y. X. and Curnow, R. N. (1990). Locating a changed segment in a sequence of Bernoulli variablesBiometrika 77295–304.
Glaz, J. (1993). Approximations for the tail probabilities and moments of the scan statisticStatistics in Medicine 121845–1852.
Glaz, J. and Naus, J. I. (1991). Tight bounds and approximations for scan statistic probabilities for discrete dataAnnals of Applied Probability 1306–318.
Gordon, L., Schilling M. F. and Waterman, M. S. (1986). An extreme value theory for long head runsProbability Theory and Related Fields 72279–287.
Hoover, D. R. (1990). Subset complement addition upper bounds - an improved inclusion-exclusion methodJournal of Statistical Planning and Inference 24195–202.
Hunter, D. (1976). An upper bound for the probability of a unionJournal of Applied Probability 13597–603.
Karlin, S. and Ost, F. (1987). Counts of long aligned word matches among random letter sequencesAdvances in Applied Probability 19293–351.
Karlin, S. and Ost, F. (1988). Maximal length of common words among random sequencesAnnals of Probability 16535–563.
Koutras, M. V. and Alexandrou, V. A. (1995). Runs, scans, and runs models: a unified Markov chain approachAnnals of the Institute of Statistical Mathematics 47743–766.
Leung, M. Y., Blaisdell, B. E., Burge, C. and Karlin, S. (1991). An efficient algorithm for identifying matches with errors in multiple long molecular sequencesJournal of Molecular Biology 2211367–1378.
Lou, W. Y. W. (1996). On runs and longest run tests: a method of finite Markov chain imbeddingJournal of the American Statistical Association 911595–1601.
Mott, R. F., Kirwood, T. B. L. and Curnow, R. N. (1990). An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequencesBulletin of Mathematical Biology 52773–784.
Naus, J. I. (1965). The distribution of the size of the maximum cluster of points on a lineJournal of the American Statistical Association 60532–538.
Naus, J. I. (1974). Probabilities for a generalized birthday problemJournal of the American Statistical Association 69810–815.
Naus, J. I. and Sheng, K. N. (1997). Matching among multiple random sequencesBulletin of Mathematical Biology 59483–496.
Papastavridis, S. G. and Koutras, M. V. (1992). Consecutive-k-out-n systems with maintenanceAnnals of the Institute of Statistical Mathematics 44605–612.
Sheng, K. N. and Naus, J. I. (1994). Pattern matching between two non-aligned random sequencesBulletin of Mathematical Biology 561143–1162.
Waterman, M. S. (1995).Introduction to Computational Biology: Maps Sequences and GenomesLondon: Chapman and Hall.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media New York
About this chapter
Cite this chapter
Fu, J.C., Lou, W.Y.W., Chen, S.C. (1999). On the Probability of Pattern Matching in Nonaligned DNA Sequences: A Finite Markov Chain Imbedding Approach. In: Glaz, J., Balakrishnan, N. (eds) Scan Statistics and Applications. Statistics for Industry and Technology. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1578-3_13
Download citation
DOI: https://doi.org/10.1007/978-1-4612-1578-3_13
Publisher Name: Birkhäuser, Boston, MA
Print ISBN: 978-1-4612-7201-4
Online ISBN: 978-1-4612-1578-3
eBook Packages: Springer Book Archive