On the Probability of Pattern Matching in Nonaligned DNA Sequences: A Finite Markov Chain Imbedding Approach

Fu, James C.; Lou, W. Y. Wendy; Chen, S. C.

doi:10.1007/978-1-4612-1578-3_13

James C. Fu⁴,
W. Y. Wendy Lou⁵ &
S. C. Chen⁶

Part of the book series: Statistics for Industry and Technology ((SIT))

608 Accesses
5 Citations

Abstract

Mathematically, a DNA segment can be viewed as a sequence of four-state (AC G T)trials, and a perfect match of sizeMoccurs when two DNA sequences have at least one identical subsequence (or pattern) of lengthM.Pattern matching probabilities are crucial for statistically rigorous comparisons of DNA (and other) sequences, and many bounds and approximations of such probabilities have recently been developed. There are few results on exact probabilities, especially for trials with unequal state probabilities, and no exact analytical formulae for the pattern matching probability involving arbitrarily long nonaligned sequences. In this chapter, a simple and efficient method based on the finite Markov chain imbedding technique is developed to obtain the exact probability of perfect matching for i.i.d. four-state trials with either equal or unequal state probabilities. A large deviation approximation is derived for very long sequences, and numerical examples are given to illustrate the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arratia, R., Gordon, L. and Waterman, M. S. (1986). An extreme value theory for sequence matchingAnnals of Statistics 14971–993.
Article MathSciNet MATH Google Scholar
Arratia, R., Gordon, L. and Waterman, M. S. (1990). The Erdös-Rényi law in distribution, for coin tossing and sequence matchingAnnals of Statistics 18539–570.
Article MathSciNet MATH Google Scholar
Chao, M. T. and Fu, J. C. (1989). A limit theorem of certain repairable systemsAnnals of the Institute of Statistical Mathematics 4809–818.
Article MathSciNet Google Scholar
Erdös, P. and Révész, P. (1975). On the length of the longest head-runTopics in Information Theory Colloquia of Mathematical Society János Bolyai 16219–228, Keszthely, Hungary.
Google Scholar
Fu, J. C. (1986). Bounds for reliability of large consecutive-k-out-ofn:F systems with unequal component reliabilityIEEE Transactions on Reliability 35316–319.
Article MATH Google Scholar
Fu, J. C. (1996). Distribution theory of urns and patterns associated with a sequence of multi-state trialsStatistica Sinica 6957–974.
MathSciNet MATH Google Scholar
Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approachJournal of the American Statistical Association 891050–1058.
Article MathSciNet MATH Google Scholar
Fu, Y. X. and Curnow, R. N. (1990). Locating a changed segment in a sequence of Bernoulli variablesBiometrika 77295–304.
Article MathSciNet MATH Google Scholar
Glaz, J. (1993). Approximations for the tail probabilities and moments of the scan statisticStatistics in Medicine 121845–1852.
Article Google Scholar
Glaz, J. and Naus, J. I. (1991). Tight bounds and approximations for scan statistic probabilities for discrete dataAnnals of Applied Probability 1306–318.
Article MathSciNet MATH Google Scholar
Gordon, L., Schilling M. F. and Waterman, M. S. (1986). An extreme value theory for long head runsProbability Theory and Related Fields 72279–287.
Article MathSciNet MATH Google Scholar
Hoover, D. R. (1990). Subset complement addition upper bounds - an improved inclusion-exclusion methodJournal of Statistical Planning and Inference 24195–202.
Article MATH Google Scholar
Hunter, D. (1976). An upper bound for the probability of a unionJournal of Applied Probability 13597–603.
Article MathSciNet MATH Google Scholar
Karlin, S. and Ost, F. (1987). Counts of long aligned word matches among random letter sequencesAdvances in Applied Probability 19293–351.
Article MathSciNet MATH Google Scholar
Karlin, S. and Ost, F. (1988). Maximal length of common words among random sequencesAnnals of Probability 16535–563.
Article MathSciNet MATH Google Scholar
Koutras, M. V. and Alexandrou, V. A. (1995). Runs, scans, and runs models: a unified Markov chain approachAnnals of the Institute of Statistical Mathematics 47743–766.
Article MathSciNet MATH Google Scholar
Leung, M. Y., Blaisdell, B. E., Burge, C. and Karlin, S. (1991). An efficient algorithm for identifying matches with errors in multiple long molecular sequencesJournal of Molecular Biology 2211367–1378.
Article Google Scholar
Lou, W. Y. W. (1996). On runs and longest run tests: a method of finite Markov chain imbeddingJournal of the American Statistical Association 911595–1601.
Article MathSciNet MATH Google Scholar
Mott, R. F., Kirwood, T. B. L. and Curnow, R. N. (1990). An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequencesBulletin of Mathematical Biology 52773–784.
MATH Google Scholar
Naus, J. I. (1965). The distribution of the size of the maximum cluster of points on a lineJournal of the American Statistical Association 60532–538.
Article MathSciNet Google Scholar
Naus, J. I. (1974). Probabilities for a generalized birthday problemJournal of the American Statistical Association 69810–815.
Article MathSciNet MATH Google Scholar
Naus, J. I. and Sheng, K. N. (1997). Matching among multiple random sequencesBulletin of Mathematical Biology 59483–496.
Article MATH Google Scholar
Papastavridis, S. G. and Koutras, M. V. (1992). Consecutive-k-out-n systems with maintenanceAnnals of the Institute of Statistical Mathematics 44605–612.
Article MathSciNet MATH Google Scholar
Sheng, K. N. and Naus, J. I. (1994). Pattern matching between two non-aligned random sequencesBulletin of Mathematical Biology 561143–1162.
MATH Google Scholar
Waterman, M. S. (1995).Introduction to Computational Biology: Maps Sequences and GenomesLondon: Chapman and Hall.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Manitoba, Winnipeg, Manitoba, Canada
James C. Fu
Mount Sinai School of Medicine, New York, NY, USA
W. Y. Wendy Lou
National Donghwa University, Hualian, Taiwan, R.O.C.
S. C. Chen

Authors

James C. Fu
View author publications
You can also search for this author in PubMed Google Scholar
W. Y. Wendy Lou
View author publications
You can also search for this author in PubMed Google Scholar
S. C. Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics, University of Connecticut at Storrs, Storrs, CT, 06269-3120, USA
Joseph Glaz
Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario, L8S 4K1, Canada
N. Balakrishnan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fu, J.C., Lou, W.Y.W., Chen, S.C. (1999). On the Probability of Pattern Matching in Nonaligned DNA Sequences: A Finite Markov Chain Imbedding Approach. In: Glaz, J., Balakrishnan, N. (eds) Scan Statistics and Applications. Statistics for Industry and Technology. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1578-3_13

Download citation

DOI: https://doi.org/10.1007/978-1-4612-1578-3_13
Publisher Name: Birkhäuser, Boston, MA
Print ISBN: 978-1-4612-7201-4
Online ISBN: 978-1-4612-1578-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics