Skip to main content

On the Probability of Pattern Matching in Nonaligned DNA Sequences: A Finite Markov Chain Imbedding Approach

  • Chapter
Scan Statistics and Applications

Part of the book series: Statistics for Industry and Technology ((SIT))

Abstract

Mathematically, a DNA segment can be viewed as a sequence of four-state (AC G T)trials, and a perfect match of sizeMoccurs when two DNA sequences have at least one identical subsequence (or pattern) of lengthM.Pattern matching probabilities are crucial for statistically rigorous comparisons of DNA (and other) sequences, and many bounds and approximations of such probabilities have recently been developed. There are few results on exact probabilities, especially for trials with unequal state probabilities, and no exact analytical formulae for the pattern matching probability involving arbitrarily long nonaligned sequences. In this chapter, a simple and efficient method based on the finite Markov chain imbedding technique is developed to obtain the exact probability of perfect matching for i.i.d. four-state trials with either equal or unequal state probabilities. A large deviation approximation is derived for very long sequences, and numerical examples are given to illustrate the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arratia, R., Gordon, L. and Waterman, M. S. (1986). An extreme value theory for sequence matchingAnnals of Statistics 14971–993.

    Article  MathSciNet  MATH  Google Scholar 

  2. Arratia, R., Gordon, L. and Waterman, M. S. (1990). The Erdös-Rényi law in distribution, for coin tossing and sequence matchingAnnals of Statistics 18539–570.

    Article  MathSciNet  MATH  Google Scholar 

  3. Chao, M. T. and Fu, J. C. (1989). A limit theorem of certain repairable systemsAnnals of the Institute of Statistical Mathematics 4809–818.

    Article  MathSciNet  Google Scholar 

  4. Erdös, P. and Révész, P. (1975). On the length of the longest head-runTopics in Information Theory Colloquia of Mathematical Society János Bolyai 16219–228, Keszthely, Hungary.

    Google Scholar 

  5. Fu, J. C. (1986). Bounds for reliability of large consecutive-k-out-ofn:F systems with unequal component reliabilityIEEE Transactions on Reliability 35316–319.

    Article  MATH  Google Scholar 

  6. Fu, J. C. (1996). Distribution theory of urns and patterns associated with a sequence of multi-state trialsStatistica Sinica 6957–974.

    MathSciNet  MATH  Google Scholar 

  7. Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approachJournal of the American Statistical Association 891050–1058.

    Article  MathSciNet  MATH  Google Scholar 

  8. Fu, Y. X. and Curnow, R. N. (1990). Locating a changed segment in a sequence of Bernoulli variablesBiometrika 77295–304.

    Article  MathSciNet  MATH  Google Scholar 

  9. Glaz, J. (1993). Approximations for the tail probabilities and moments of the scan statisticStatistics in Medicine 121845–1852.

    Article  Google Scholar 

  10. Glaz, J. and Naus, J. I. (1991). Tight bounds and approximations for scan statistic probabilities for discrete dataAnnals of Applied Probability 1306–318.

    Article  MathSciNet  MATH  Google Scholar 

  11. Gordon, L., Schilling M. F. and Waterman, M. S. (1986). An extreme value theory for long head runsProbability Theory and Related Fields 72279–287.

    Article  MathSciNet  MATH  Google Scholar 

  12. Hoover, D. R. (1990). Subset complement addition upper bounds - an improved inclusion-exclusion methodJournal of Statistical Planning and Inference 24195–202.

    Article  MATH  Google Scholar 

  13. Hunter, D. (1976). An upper bound for the probability of a unionJournal of Applied Probability 13597–603.

    Article  MathSciNet  MATH  Google Scholar 

  14. Karlin, S. and Ost, F. (1987). Counts of long aligned word matches among random letter sequencesAdvances in Applied Probability 19293–351.

    Article  MathSciNet  MATH  Google Scholar 

  15. Karlin, S. and Ost, F. (1988). Maximal length of common words among random sequencesAnnals of Probability 16535–563.

    Article  MathSciNet  MATH  Google Scholar 

  16. Koutras, M. V. and Alexandrou, V. A. (1995). Runs, scans, and runs models: a unified Markov chain approachAnnals of the Institute of Statistical Mathematics 47743–766.

    Article  MathSciNet  MATH  Google Scholar 

  17. Leung, M. Y., Blaisdell, B. E., Burge, C. and Karlin, S. (1991). An efficient algorithm for identifying matches with errors in multiple long molecular sequencesJournal of Molecular Biology 2211367–1378.

    Article  Google Scholar 

  18. Lou, W. Y. W. (1996). On runs and longest run tests: a method of finite Markov chain imbeddingJournal of the American Statistical Association 911595–1601.

    Article  MathSciNet  MATH  Google Scholar 

  19. Mott, R. F., Kirwood, T. B. L. and Curnow, R. N. (1990). An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequencesBulletin of Mathematical Biology 52773–784.

    MATH  Google Scholar 

  20. Naus, J. I. (1965). The distribution of the size of the maximum cluster of points on a lineJournal of the American Statistical Association 60532–538.

    Article  MathSciNet  Google Scholar 

  21. Naus, J. I. (1974). Probabilities for a generalized birthday problemJournal of the American Statistical Association 69810–815.

    Article  MathSciNet  MATH  Google Scholar 

  22. Naus, J. I. and Sheng, K. N. (1997). Matching among multiple random sequencesBulletin of Mathematical Biology 59483–496.

    Article  MATH  Google Scholar 

  23. Papastavridis, S. G. and Koutras, M. V. (1992). Consecutive-k-out-n systems with maintenanceAnnals of the Institute of Statistical Mathematics 44605–612.

    Article  MathSciNet  MATH  Google Scholar 

  24. Sheng, K. N. and Naus, J. I. (1994). Pattern matching between two non-aligned random sequencesBulletin of Mathematical Biology 561143–1162.

    MATH  Google Scholar 

  25. Waterman, M. S. (1995).Introduction to Computational Biology: Maps Sequences and GenomesLondon: Chapman and Hall.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer Science+Business Media New York

About this chapter

Cite this chapter

Fu, J.C., Lou, W.Y.W., Chen, S.C. (1999). On the Probability of Pattern Matching in Nonaligned DNA Sequences: A Finite Markov Chain Imbedding Approach. In: Glaz, J., Balakrishnan, N. (eds) Scan Statistics and Applications. Statistics for Industry and Technology. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-1578-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-4612-1578-3_13

  • Publisher Name: Birkhäuser, Boston, MA

  • Print ISBN: 978-1-4612-7201-4

  • Online ISBN: 978-1-4612-1578-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics