Journal of Mathematical Biology

, Volume 56, Issue 1–2, pp 51–92 | Cite as

Multiple pattern matching: a Markov chain approach

Article

Abstract

RNA motifs typically consist of short, modular patterns that include base pairs formed within and between modules. Estimating the abundance of these patterns is of fundamental importance for assessing the statistical significance of matches in genomewide searches, and for predicting whether a given function has evolved many times in different species or arose from a single common ancestor. In this manuscript, we review in an integrated and self-contained manner some basic concepts of automata theory, generating functions and transfer matrix methods that are relevant to pattern analysis in biological sequences. We formalize, in a general framework, the concept of Markov chain embedding to analyze patterns in random strings produced by a memoryless source. This conceptualization, together with the capability of automata to recognize complicated patterns, allows a systematic analysis of problems related to the occurrence and frequency of patterns in random strings. The applications we present focus on the concept of synchronization of automata, as well as automata used to search for a finite number of keywords (including sets of patterns generated according to base pairing rules) in a general text.

Keywords

Automata Exceptional words Functionals of Markov chains Generating functions Markov chain embedding technique Modular pattern Motifs Pattern matching Probability Random strings RNA Transfer matrix methods 

Mathematics Subject Classification (2000)

Primary 65C40 68Q45 92D20 Secondary 68R15 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aho A.V. and Corasick M.J. (1975). Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6): 333–340 MATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Aston J.A.D. and Martin D.E.K. (2005). Waiting time distributions of competing patterns in higher-order Markovian sequences. J. Appl. Prob. 42(4): 977–988 MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Biggins J.D. and Cannings C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Adv. Appl. Prob. 19: 521–545 MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Benson, G.: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 573–580 (1999)Google Scholar
  5. 5.
    Bourdeau V., Ferbeyre G., Pageau M., Paquin B. and Cedergren R. (1999). The distribution of RNA motifs in natural sequences. Nucleic Acids Res. 27(22): 4457–4467 CrossRefGoogle Scholar
  6. 6.
    Bender E.A. and Kochman F. (1993). The distribution of subword counts is usually normal. Eur. J. Comb. 14(4): 265–275 MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: RECOMB ’03: Proceedings of the seventh annual international conference on Research in computational molecular biology, pp. 67–75 (2003)Google Scholar
  8. 8.
    Bussemaker H.J., Li H. and Siggia E.D. (2000). Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA 97(18): 10096–10100 CrossRefMathSciNetGoogle Scholar
  9. 9.
    Brémaud P. (1998). Markov Chains: Gibbs fields, Monte Carlo Simulation and Queues. Springer, Heidelberg Google Scholar
  10. 10.
    Bourdon, J., Vallée, B.: Generalized pattern matching statistics. In: Colloquium on Mathematics and Computer Science: Algorithms and Trees, Trends in Mathematics, pp. 249–265. Birkhauser, 2002Google Scholar
  11. 11.
    Bourdon, J., Vallée, B.: Pattern matching statistics on correlated sources. In: Proceedings of the seventh Latin American Symposium on Theoretical Informatics (LATIN’06), pp. 224–237, Valdivia, Chile (2006)Google Scholar
  12. 12.
    Breen S., Waterman M.S. and Zhang N. (1985). Renewal theory for several patterns. J. Appl. Prob. 22: 228–234 MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Clément J., Flajolet P. and Vallée B. (2001). Dynamical sources in information theory: a general analysis of trie structures. Algorithmica 29(1): 307–369 MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Chen, X.: Limit theorems for functional of ergodic Markov chains with general state space, vol. 139. Memoirs of the American Mathematical Society, 1999Google Scholar
  15. 15.
    Crochemore M. and Rytter W. (2002). Jewels of Stringology. World Scientific, Singapore Google Scholar
  16. 16.
    Cech T.R., Zaug A.J. and Grabowski P.J. (1981). In vitro splicing of the ribosomal RNA precursor of Tetrahymena: involvement of a guanosine nucleotide in the excision of the intervening sequence. Cell 27(3 Pt 2): 487–496 CrossRefGoogle Scholar
  17. 17.
    Durrett R. (1999). Essentials of Stochastic Processes. Springer, Heidelberg MATHGoogle Scholar
  18. 18.
    Durrett R. (2004). Probability: Theory and Examples, third edition. Duxbury Press, North Scituate Google Scholar
  19. 19.
    Eddy S.R. and Durbin R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Res. 22(11): 2079–2088 CrossRefGoogle Scholar
  20. 20.
    Ferbeyre G., Bourdeau V., Pageau M., Miramontes P. and Cedergren R. (2000). Distribution of hammerhead and hammerhead-like RNA motifs through the GenBank. Genome Res. 10(7): 1011–1019 CrossRefGoogle Scholar
  21. 21.
    Fu J.C. and Chang Y.M. (2002). On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J. Appl. Prob. 39(1): 70–80 MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Fu J.C. and Chang Y.M. (2003). On ordered series and later waiting time distributions in a sequence of Markov dependent multistate trials. J. Appl. Prob. 40(3): 623–642 MATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Feller W. (1968). An Introduction to Probability Theory and Its Applications third edition. Wiley, New York Google Scholar
  24. 24.
    Felsenstein J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17(6): 368–376 CrossRefGoogle Scholar
  25. 25.
    Fu J.C. and Koutras M.V. (1994). Distribution theory of runs: a Markov chain approach. J. Am. Statist. Assoc. 89(427): 1050–1058 MATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    Flajolet P., Kirschenhofer P. and Tichy R.F. (1988). Deviations from uniformity in random strings. Probab. Th. Rel. Fields 80(1): 139–150 MATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Fu, J.C., Lou, W.Y.W.: Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach. World Scientific, Singapor (2003)Google Scholar
  28. 28.
    Flajolet, P., Sedgewick, R.: Analytic Combinatorics, 2006. Electronic version available online at http://algo.inria.fr/flajolet/Publications/book060418.pdfGoogle Scholar
  29. 29.
    Flajolet P., Szpankowski W. and Vallée B. (2006). Hidden word statistics. J. ACM 53(1): 147–183 CrossRefMathSciNetGoogle Scholar
  30. 30.
    Gani J. and Irle A. (1999). On patterns in sequences of random events. Mh. Math. 127: 295–309 MATHCrossRefMathSciNetGoogle Scholar
  31. 31.
    Goulden I.P. and Jackson D.M. (2004). Combinatorial Enumeration. Dover, New York MATHGoogle Scholar
  32. 32.
    Griffiths-Jones S., Moxon S., Marshall M., Khanna A., Eddy S.R. and Bateman A. (2005). RFAM: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33(Database issue): 121–124 CrossRefGoogle Scholar
  33. 33.
    Gerber H.U. and Li S.-Y.R. (1981). The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stoch. Process. Appl. 11(1): 101–108 MATHCrossRefMathSciNetGoogle Scholar
  34. 34.
    Guibas L.J. and Odlyzko A.M. (1978). Maximal prefix-synchronized codes. SIAM J. Appl. Math. 35(2): 401–418 MATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Guibas L.J. and Odlyzko A.M. (1981). Periods in strings. J. Combin. Theory Ser. A 30(1): 19–42 MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    Guibas L.J. and Odlyzko A.M. (1981). String overlaps, pattern matching and nontransitive games. J. Comb. Theory Ser. A 30(2): 183–208 MATHCrossRefMathSciNetGoogle Scholar
  37. 37.
    Guerrier-Takada C., Gardiner K., Marsh T., Pace N. and Altman S. (1983). The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35(3 Pt 2): 849–857 CrossRefGoogle Scholar
  38. 38.
    Hentze M.W., Caughman S.W., Rouault T.A., Barriocanal J.G., Dancis A., Harford J.B. and Klausner R.D. (1987). Identification of the iron-responsive element for the translational regulation of human ferritin mRNA. Science 238(4833): 1570–1573 CrossRefGoogle Scholar
  39. 39.
    Han Q. and Hirano K. (2003). Sooner and later waiting time problems for patterns in Markov dependent trials. J. Appl. Prob. 40(1): 73–86 MATHCrossRefMathSciNetGoogle Scholar
  40. 40.
    Hopcroft J.E. and Ullman J.D. (1979). Introduction to automata theory, languages and computation. Addison-Wesley, Reading MATHGoogle Scholar
  41. 41.
    Jones G.L. (2004). On the Markov chain central limit theorem. Probab. Surv. 1: 299–320 CrossRefMathSciNetMATHGoogle Scholar
  42. 42.
    Knight R., De Sterck H., Markel R., Smit S., Oshmyansky A. and Yarus M. (2005). Abundance of correctly folded RNA motifs in sequence space, calculated on computational grids. Nucleic Acids Res. 33(18): 5924–5935 CrossRefGoogle Scholar
  43. 43.
    Klein R.J. and Eddy S.R. (2003). RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinform. 4: 44 CrossRefGoogle Scholar
  44. 44.
    Kimura M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78(1): 454–458 MATHCrossRefGoogle Scholar
  45. 45.
    Knuth D.E., Pratt V.R. and Morris J.H. (1977). Fast pattern matching in strings. SIAM J. Comput. 6(2): 323–350 MATHCrossRefMathSciNetGoogle Scholar
  46. 46.
    Kucherov, G., Noe, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds (extended abstract), (2006)Google Scholar
  47. 47.
    Knight R. and Yarus M. (2003). Finding specific RNA motifs: function in a zeptomole world?. RNA 9(2): 218–230 CrossRefGoogle Scholar
  48. 48.
    Lewis, B.P., Burge, C.B., Bartel, D.P.: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120(1), 15–20, Jan 2005. LetterGoogle Scholar
  49. 49.
    Li S.-Y.R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Probab. 8(6): 1171–1176 MATHMathSciNetGoogle Scholar
  50. 50.
    Lladser, M.: Minimal markov chain embeddings of pattern problems. In: Proceedings of the 2007 Information Theory and Applications Workshop, University of California, San Diego (2007)Google Scholar
  51. 51.
    Lothaire M., Rota G.-C., Doran B., Ismail M., Lam T.Y., Wutwak E., Flajolet P. and Lutwak E. (2005). Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications). Cambridge University Press, Cambridge Google Scholar
  52. 52.
    Lu C., Tej S.S., Luo S., Haudenschild C.D., Meyers B.C. and Green P.J. (2005). Elucidation of the small RNA component of the transcriptome. Science 309(5740): 1567–1569 CrossRefGoogle Scholar
  53. 53.
    Martin D. (2005). Distribution of the number of successes in success runs of length at least k in higher-order Markovian sequences. Methodol. Comput. Appl. Probab. 7(4): 543–554 MATHCrossRefMathSciNetGoogle Scholar
  54. 54.
    Nicodème P. (2003). Regexpcount, a symbolic package for counting problems on regular expressions and words. Fundamenta Informaticae 56(1-2): 71–88 MATHMathSciNetGoogle Scholar
  55. 55.
    Nicodème P., Salvy B. and Flajolet P. (2002). Motif statistics. Theoret. Comput. Sci. 287(2): 593–617 MATHCrossRefMathSciNetGoogle Scholar
  56. 56.
    Pozdnyakov V.I. and Kulldorff M. (2006). Waiting times for patterns and a method of gambling teams. Am. Math. Month. 113(2): 134–143 MathSciNetCrossRefMATHGoogle Scholar
  57. 57.
    Park Y. and Spouge J.L. (2004). Searching for multiple words in a Markov sequence. INFORMS J. Comput. 16(4): 341–347 CrossRefMathSciNetGoogle Scholar
  58. 58.
    Robin S.S. and Daudin J.J. (2001). Exact distribution of the distances between any occurrences of a set of words. Ann. Inst. Statist. Math. 53(4): 895–905 MATHCrossRefMathSciNetGoogle Scholar
  59. 59.
    Régnier M. and Denise A. (2004). Rare events and conditional events on random strings. DMTCS 6(2): 191–214 MATHMathSciNetGoogle Scholar
  60. 60.
    Rivas E. and Eddy S.R. (2000). The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics 16(4): 334–340 CrossRefGoogle Scholar
  61. 61.
    Régnier M. (2000). A unified approach to word occurrences probabilities. Discrete Appl. Math. 104(1): 259–280 MATHCrossRefMathSciNetGoogle Scholar
  62. 62.
    Régnier, M., Lifanov, A., Makeev, V.: Three variations on word counting. In: Proceedings German Conference on Bioinformatics, GCB’00, Heidelberg, pp. 75–82. Logos-Verlag, 2000Google Scholar
  63. 63.
    Robin S., Rodolphe F. and Schbath S. (2005). DNA, Words and Models. Cambridge University Press, New York MATHGoogle Scholar
  64. 64.
    Régnier M. and Szpankowski W. (1998). On pattern frequency occurrences in a Markovian sequence. Algorithmica 22(4): 631–649 MATHCrossRefMathSciNetGoogle Scholar
  65. 65.
    Salehi-Ashtiani K. and Szostak J.W. (2001). In vitro evolution suggests multiple origins for the hammerhead ribozyme. Nature 414(6859): 82–84 CrossRefGoogle Scholar
  66. 66.
    Shao J. (2003). Mathematical Statistics, second edition. Springer, Heidelberg Google Scholar
  67. 67.
    Sipser, M.: Introduction to the Theory of Computation. International Thomson Publishing, (1996)Google Scholar
  68. 68.
    Singh R., Robida M.D. and Karimpour S. (2006). Building biological complexity with limited genes. Curr. Genom. 7: 97–114 CrossRefGoogle Scholar
  69. 69.
    Sabeti P.C., Unrau P.J. and Bartel D.P. (1997). Accessing rare activities from random RNA sequences: the importance of the length of molecules in the starting pool. Chem. Biol. 4(10): 767–774 CrossRefGoogle Scholar
  70. 70.
    Tang J. and Breaker R.R. (2000). Structural diversity of self-cleaving ribozymes. Proc. Natl. Acad. Sci. USA 97(11): 5784–5789 CrossRefGoogle Scholar
  71. 71.
    Vallée B. (2001). Dynamical sources in information theory: fundamental intervals and word prefixes. Algorithmica 29(1): 262–306 MATHCrossRefMathSciNetGoogle Scholar
  72. 72.
    Waterman, M.S.: Introduction to computational biology: maps, sequences and genomes. Chapman & Hall, WAT m 95:1 1.Ex (1995)Google Scholar
  73. 73.
    Wilf H.S. (1994). Generatingfunctiology, second edition. Academic, New York Google Scholar
  74. 74.
    Welch M., Majerfeld I. and Yarus M. (1997). 23S rRNA similarity from selection for peptidyl transferase mimicry. Biochemistry 36(22): 6614–6623 CrossRefGoogle Scholar
  75. 75.
    Winkler W., Nahvi A. and Breaker R.R. (2002). Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression. Nature 419(6910): 952–956 CrossRefGoogle Scholar
  76. 76.
    Yarus M., Caporaso J.G. and Knight R. (2005). Origins of the genetic code: the escaped triplet theory. Annu. Rev. Biochem. 74: 179–198 CrossRefGoogle Scholar
  77. 77.
    Yarus M. and Welch M. (2000). Peptidyl transferase: ancient and exiguous. Chem. Biol. 7(10): 187–190 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Manuel E. Lladser
    • 1
  • M. D. Betterton
    • 2
  • Rob Knight
    • 3
  1. 1.Department of Applied MathematicsUniversity of Colorado at BoulderBoulderUSA
  2. 2.Department of PhysicsUniversity of Colorado at BoulderBoulderUSA
  3. 3.Department of Chemistry and BiochemistryUniversity of Colorado at BoulderBoulderUSA

Personalised recommendations