Assessing the Significance of Sets of Words

Boeva, Valentina; Clément, Julien; Régnier, Mireille; Vandenbogaert, Mathias

doi:10.1007/11496656_31

Valentina Boeva¹⁹,
Julien Clément²⁰,
Mireille Régnier²¹ &
…
Mathias Vandenbogaert²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

829 Accesses
4 Citations

Abstract

Various criteria have been defined to evaluate the significance of sets of words, the computation of them often being difficult. We provide explicit expressions for the waiting time in such a context. In order to assess the significance of a cluster of potential binding sites, we extend them to the co-occurrence problem. We point out that these criteria values depend on a few fundamental parameters. We provide efficient algorithms to compute them, that rely on a combinatorial interpretation of the formulae. We show that our results are very tight in the so-called twilight zone and improve on previous rough approximations. One assumes that the text is generated according to a Markov stationary process. These results are developed for an extended model of consensus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Panina, E., Mironov, A., Gelfand, M.: Statistical analysis of complete bacterial genomes:Avoidance of palindromes and restriction-modification systems. Mol. Biol. 34, 215–221 (2000)
Article Google Scholar
Vandenbogaert, M., Makeev, V.: Analysis of bacterial rm-systems through genomescale analysis and related taxonomic issues. Silico Biol. 3, 12 (2003)
Google Scholar
Robin, S., Schbath, S.: Numerical comparison of several approximations on the word count distribution in random sequences. J. Comput. Biol. 8, 349–359 (2001)
Article Google Scholar
Chiang, D., Moses, A., Kellis, M., Lander, E., Eisen, M.: Phylogenetically and spatially conserved word pairs associated with gene-expression in yeasts. Genome Biol. 4, R43 (2003)
Article Google Scholar
Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1997)
Article Google Scholar
Régnier, M.: A unified approach to word occurrences probabilities. Discrete Appl. Math. 104, 259–280 (2000); Special issue on Computational Biology
Article MATH MathSciNet Google Scholar
Robin, S., Daudin, J.J.: Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179–193 (1999)
Article MATH MathSciNet Google Scholar
Robin, S., Daudin, J.J., Richard, H., Sagot, M., Schbath, S.: Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 9, 761–773 (2001)
Article Google Scholar
Pevzner, P., Borodovski, M., Mironov, A.: Linguistics of nucleotide sequences i: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynam. 6, 1013–1026 (1989)
Google Scholar
Bender, E.A., Kochman, F.: The Distribution of Subwords Counts is Usually Normal. European J. Combin. 14, 265–275 (1993)
Article MATH MathSciNet Google Scholar
Guibas, L., Odlyzko, A.: String Overlaps, Pattern Matching and Nontransitive Games. J. Combin. Theory Ser. A 30, 183–208 (1981)
Article MATH MathSciNet Google Scholar
Tanushev, M., Arratia, R.: Central limit theorem for renewal theory for several patterns. J. Comput. Biol. 4, 35–44 (1997)
Article Google Scholar
Régnier, M., Szpankowski, W.: On the approximate pattern occurrences in a text. In: Compression and Complexity of sequences, pp. 253–264. IEEE Computer Society, Los Alamitos (1997)
Google Scholar
Klaerr-Blanchard, M., Chiapello, H., Coward, E.: Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem. 24, 57–70 (2000)
Article Google Scholar
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoret. Comput. Sci. 287, 593–618 (2002)
Article MATH MathSciNet Google Scholar
Chrysaphinou, C., Papastavridis, S.: The occurrence of sequence of patterns in repeated dependent experiments. Theory Probab. App. 79, 167–173 (1990)
MathSciNet Google Scholar
Szpankowski, W.: Average Case Analysis of Algorithms on Sequences. John Wiley and Sons, New York (2001)
MATH Google Scholar
Buhler, J., Tompa, M.: Finding Motifs Using Random Projections. In: RECOMB 2001, pp. 69–76. ACM, New York (2001)
Chapter Google Scholar
Beaudoing, E., Freier, S., Wyatt, J., Claverie, J., Gautheret, D.: Patterns of Variant Polyadenylation Signal Usage in Human Genes. Genome Res. 10, 1001–1010 (2000)
Article Google Scholar
van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998), http://rsat.ulb.ac.be/rsat/
Article Google Scholar
Knuth, D.: The average time for carry propagation. Indag. Math. 40, 238–242 (1978)
MathSciNet Google Scholar
Régnier, M.: Mathematical tools for regulatory signals extraction. In: Kolchanov, N., Hofestaedt, R. (eds.) Bioinformatics of Genome Regulation and Structure, pp. 61–70. Kluwer Academic Publisher, Dordrecht (2004)
Google Scholar
Flajolet, P., Sedgewick, R.: Analysis of Algorithms. Addison-Wesley, Reading (1996)
MATH Google Scholar
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)
Article MATH MathSciNet Google Scholar
Crochemore, M., Rytter, W.: Jewels of Stringology, p. 310. World Scientific Publishing, Hong-Kong (2002)
Book Google Scholar
Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics (ISMB special issue) 817, 30–38 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Moscow State University, Vorob’evy Gory, Russia
Valentina Boeva
Igm, Université de Marne-la-Vallée, France
Julien Clément
Inria, 78153, Le Chesnay, France
Mireille Régnier
Biozentrum, Basel Universitat, Switzerland
Mathias Vandenbogaert

Authors

Valentina Boeva
View author publications
You can also search for this author in PubMed Google Scholar
Julien Clément
View author publications
You can also search for this author in PubMed Google Scholar
Mireille Régnier
View author publications
You can also search for this author in PubMed Google Scholar
Mathias Vandenbogaert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Université Paris-Est, France
Maxime Crochemore
School of Computer Science and Engineering, Seoul National University, 151-742, Seoul, Korea
Kunsoo Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boeva, V., Clément, J., Régnier, M., Vandenbogaert, M. (2005). Assessing the Significance of Sets of Words. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_31

Download citation

DOI: https://doi.org/10.1007/11496656_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics