Common Substrings in Random Strings
In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r > 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.
Unable to display preview. Download preview PDF.
- 1.Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)Google Scholar
- 5.Blais, E.: Computing Probabilities for Common Substrings in Random Strings. M.Sc. Thesis, McGill University (2006)Google Scholar
- 7.Erdős, P., Révész, P.: On the length of the longest head run. Topics in Information Theory. Coll. Math. Soc. János Bolyai 16, 219–228 (1975)Google Scholar
- 9.Fishman, G.S.: Monte Carlo: Concepts, Algorithms, and Apps. Springer, Heidelberg (1996)Google Scholar
- 17.Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. 8th Inter. Conf. on Int. Sys. for Mol. Biol., pp. 269–278 (2000)Google Scholar