Abstract
Various criteria have been defined to evaluate the significance of sets of words, the computation of them often being difficult. We provide explicit expressions for the waiting time in such a context. In order to assess the significance of a cluster of potential binding sites, we extend them to the co-occurrence problem. We point out that these criteria values depend on a few fundamental parameters. We provide efficient algorithms to compute them, that rely on a combinatorial interpretation of the formulae. We show that our results are very tight in the so-called twilight zone and improve on previous rough approximations. One assumes that the text is generated according to a Markov stationary process. These results are developed for an extended model of consensus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Panina, E., Mironov, A., Gelfand, M.: Statistical analysis of complete bacterial genomes:Avoidance of palindromes and restriction-modification systems. Mol. Biol. 34, 215–221 (2000)
Vandenbogaert, M., Makeev, V.: Analysis of bacterial rm-systems through genomescale analysis and related taxonomic issues. Silico Biol. 3, 12 (2003)
Robin, S., Schbath, S.: Numerical comparison of several approximations on the word count distribution in random sequences. J. Comput. Biol. 8, 349–359 (2001)
Chiang, D., Moses, A., Kellis, M., Lander, E., Eisen, M.: Phylogenetically and spatially conserved word pairs associated with gene-expression in yeasts. Genome Biol. 4, R43 (2003)
Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1997)
Régnier, M.: A unified approach to word occurrences probabilities. Discrete Appl. Math. 104, 259–280 (2000); Special issue on Computational Biology
Robin, S., Daudin, J.J.: Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179–193 (1999)
Robin, S., Daudin, J.J., Richard, H., Sagot, M., Schbath, S.: Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 9, 761–773 (2001)
Pevzner, P., Borodovski, M., Mironov, A.: Linguistics of nucleotide sequences i: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynam. 6, 1013–1026 (1989)
Bender, E.A., Kochman, F.: The Distribution of Subwords Counts is Usually Normal. European J. Combin. 14, 265–275 (1993)
Guibas, L., Odlyzko, A.: String Overlaps, Pattern Matching and Nontransitive Games. J. Combin. Theory Ser. A 30, 183–208 (1981)
Tanushev, M., Arratia, R.: Central limit theorem for renewal theory for several patterns. J. Comput. Biol. 4, 35–44 (1997)
Régnier, M., Szpankowski, W.: On the approximate pattern occurrences in a text. In: Compression and Complexity of sequences, pp. 253–264. IEEE Computer Society, Los Alamitos (1997)
Klaerr-Blanchard, M., Chiapello, H., Coward, E.: Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem. 24, 57–70 (2000)
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoret. Comput. Sci. 287, 593–618 (2002)
Chrysaphinou, C., Papastavridis, S.: The occurrence of sequence of patterns in repeated dependent experiments. Theory Probab. App. 79, 167–173 (1990)
Szpankowski, W.: Average Case Analysis of Algorithms on Sequences. John Wiley and Sons, New York (2001)
Buhler, J., Tompa, M.: Finding Motifs Using Random Projections. In: RECOMB 2001, pp. 69–76. ACM, New York (2001)
Beaudoing, E., Freier, S., Wyatt, J., Claverie, J., Gautheret, D.: Patterns of Variant Polyadenylation Signal Usage in Human Genes. Genome Res. 10, 1001–1010 (2000)
van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998), http://rsat.ulb.ac.be/rsat/
Knuth, D.: The average time for carry propagation. Indag. Math. 40, 238–242 (1978)
Régnier, M.: Mathematical tools for regulatory signals extraction. In: Kolchanov, N., Hofestaedt, R. (eds.) Bioinformatics of Genome Regulation and Structure, pp. 61–70. Kluwer Academic Publisher, Dordrecht (2004)
Flajolet, P., Sedgewick, R.: Analysis of Algorithms. Addison-Wesley, Reading (1996)
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)
Crochemore, M., Rytter, W.: Jewels of Stringology, p. 310. World Scientific Publishing, Hong-Kong (2002)
Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics (ISMB special issue) 817, 30–38 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boeva, V., Clément, J., Régnier, M., Vandenbogaert, M. (2005). Assessing the Significance of Sets of Words. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_31
Download citation
DOI: https://doi.org/10.1007/11496656_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)