Faster Variance Computation for Patterns with Gaps
Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.
Keywordsgapped patterns variance convolution tiling motifs
Unable to display preview. Download preview PDF.
- 2.Apostolico, A., Bock, M., Xu, X.: Annotated statistical indices for sequence analysis. In: Proceedings of the Compression and Complexity of Sequences, Sequences 1997, pp. 215–229. IEEE Computer Society, Washington, DC (1997)Google Scholar
- 9.Gwadera, R., Atallah, M., Szpankowski, W.: Reliable detection of episodes in event sequences. In: Knowledge and Information Systems, pp. 67–74 (2004)Google Scholar
- 11.Robin, S., Daudin, J.J., Richard, H., Sagot, M.F., Schbath, S.: Occurrence probability of structured motifs in random sequences. Journal of Computational Biology, 761–774 (2002)Google Scholar
- 12.Stolovitzky, G., Califano, A.: Statistical significance of patterns in biosequences. IBM research report (1998)Google Scholar
- 13.Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp. 297–308. Society for Industrial and Applied Mathematics, Philadelphia (2000)Google Scholar
- 14.Apostolico, A., Comin, M., Parida, L.: Conservative extraction of over-represented extensible motifs. Bioinformatics 21, i9–i18 (2005)Google Scholar
- 17.Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 344–354 (2000)Google Scholar
- 19.Kleffe, J., Borodovsky, M.: First and second moment of counts of words in random texts generated by Markov chains. Bioinformatics/Computer Applications in the Biosciences 8, 433–441 (1992)Google Scholar
- 20.Fischer, M., Paterson, M.: String-matching and other products. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA (1974)Google Scholar
- 26.Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics 17(1), S30–S38 (2001)Google Scholar