Common Substrings in Random Strings

  • Eric Blais
  • Mathieu Blanchette
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4009)

Abstract

In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r > 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)Google Scholar
  2. 2.
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)CrossRefGoogle Scholar
  3. 3.
    Arratia, R., Waterman, M.S.: An Erdős-Rényi law with shifts. Advances in Mathematics 55, 13–23 (1985)CrossRefMathSciNetMATHGoogle Scholar
  4. 4.
    Arratia, R., Waterman, M.S.: Critical Phenomena in sequence matching. The Annals of Probability 13, 1236–1249 (1985)CrossRefMathSciNetMATHGoogle Scholar
  5. 5.
    Blais, E.: Computing Probabilities for Common Substrings in Random Strings. M.Sc. Thesis, McGill University (2006)Google Scholar
  6. 6.
    Erdős, P., Rényi, A.: On a new law of large numbers. Journal d’Analyse Mathématique 22, 103–111 (1970)CrossRefGoogle Scholar
  7. 7.
    Erdős, P., Révész, P.: On the length of the longest head run. Topics in Information Theory. Coll. Math. Soc. János Bolyai 16, 219–228 (1975)Google Scholar
  8. 8.
    Feller, W.: An Introduction to Probability Theory and its Applications, 3rd edn., vol. 1. John Wiley & Sons, Chichester (1968)MATHGoogle Scholar
  9. 9.
    Fishman, G.S.: Monte Carlo: Concepts, Algorithms, and Apps. Springer, Heidelberg (1996)Google Scholar
  10. 10.
    Guibas, L.J., Odlyzko, A.M.: String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A 30, 183–208 (1981)CrossRefMathSciNetMATHGoogle Scholar
  11. 11.
    Harary, F.: Graphical Enumeration. Academic Press, London (1973)MATHGoogle Scholar
  12. 12.
    Karlin, S., Ost, F.: Maximal length of common words among random letter sequences. The Annals of Probability 16, 535–563 (1988)CrossRefMathSciNetMATHGoogle Scholar
  13. 13.
    Morgenstern, B., Frech, K., Dress, A., Werner, T.: DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998)CrossRefGoogle Scholar
  14. 14.
    Naus, J., Sheng, K.-N.: Matching among multiple random sequences. Bulletin of Mathematical Biology 59, 483–496 (1997)CrossRefMATHGoogle Scholar
  15. 15.
    Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)CrossRefMathSciNetMATHGoogle Scholar
  16. 16.
    Nijenhuis, A., Wilf, H.: Combinatorial Algorithms for Computers and Calculators. Academic Press, London (1978)MATHGoogle Scholar
  17. 17.
    Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. 8th Inter. Conf. on Int. Sys. for Mol. Biol., pp. 269–278 (2000)Google Scholar
  18. 18.
    Régnier, M.: A unified approach to word occurrence probabilities. Discrete Applied Mathematics 104, 259–280 (2000)CrossRefMathSciNetMATHGoogle Scholar
  19. 19.
    Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1998)CrossRefMathSciNetMATHGoogle Scholar
  20. 20.
    Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30, 5549–5560 (2002)CrossRefGoogle Scholar
  21. 21.
    van Aardenne-Ehrenfest, T., de Bruijn, N.G.: Circuits and trees in oriented linear graphs. Simon Stevin 28, 203–217 (1951)MathSciNetMATHGoogle Scholar
  22. 22.
    van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Eric Blais
    • 1
  • Mathieu Blanchette
    • 1
  1. 1.McGill Centre for Bioinformatics and School of Computer ScienceMcGill UniversityMontréalCanada

Personalised recommendations