Skip to main content

Assessing the Significance of Sets of Words

  • Conference paper
Combinatorial Pattern Matching (CPM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Abstract

Various criteria have been defined to evaluate the significance of sets of words, the computation of them often being difficult. We provide explicit expressions for the waiting time in such a context. In order to assess the significance of a cluster of potential binding sites, we extend them to the co-occurrence problem. We point out that these criteria values depend on a few fundamental parameters. We provide efficient algorithms to compute them, that rely on a combinatorial interpretation of the formulae. We show that our results are very tight in the so-called twilight zone and improve on previous rough approximations. One assumes that the text is generated according to a Markov stationary process. These results are developed for an extended model of consensus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Panina, E., Mironov, A., Gelfand, M.: Statistical analysis of complete bacterial genomes:Avoidance of palindromes and restriction-modification systems. Mol. Biol. 34, 215–221 (2000)

    Article  Google Scholar 

  2. Vandenbogaert, M., Makeev, V.: Analysis of bacterial rm-systems through genomescale analysis and related taxonomic issues. Silico Biol. 3, 12 (2003)

    Google Scholar 

  3. Robin, S., Schbath, S.: Numerical comparison of several approximations on the word count distribution in random sequences. J. Comput. Biol. 8, 349–359 (2001)

    Article  Google Scholar 

  4. Chiang, D., Moses, A., Kellis, M., Lander, E., Eisen, M.: Phylogenetically and spatially conserved word pairs associated with gene-expression in yeasts. Genome Biol. 4, R43 (2003)

    Article  Google Scholar 

  5. Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1997)

    Article  Google Scholar 

  6. Régnier, M.: A unified approach to word occurrences probabilities. Discrete Appl. Math. 104, 259–280 (2000); Special issue on Computational Biology

    Article  MATH  MathSciNet  Google Scholar 

  7. Robin, S., Daudin, J.J.: Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179–193 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  8. Robin, S., Daudin, J.J., Richard, H., Sagot, M., Schbath, S.: Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 9, 761–773 (2001)

    Article  Google Scholar 

  9. Pevzner, P., Borodovski, M., Mironov, A.: Linguistics of nucleotide sequences i: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynam. 6, 1013–1026 (1989)

    Google Scholar 

  10. Bender, E.A., Kochman, F.: The Distribution of Subwords Counts is Usually Normal. European J. Combin. 14, 265–275 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  11. Guibas, L., Odlyzko, A.: String Overlaps, Pattern Matching and Nontransitive Games. J. Combin. Theory Ser. A 30, 183–208 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  12. Tanushev, M., Arratia, R.: Central limit theorem for renewal theory for several patterns. J. Comput. Biol. 4, 35–44 (1997)

    Article  Google Scholar 

  13. Régnier, M., Szpankowski, W.: On the approximate pattern occurrences in a text. In: Compression and Complexity of sequences, pp. 253–264. IEEE Computer Society, Los Alamitos (1997)

    Google Scholar 

  14. Klaerr-Blanchard, M., Chiapello, H., Coward, E.: Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem. 24, 57–70 (2000)

    Article  Google Scholar 

  15. Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoret. Comput. Sci. 287, 593–618 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  16. Chrysaphinou, C., Papastavridis, S.: The occurrence of sequence of patterns in repeated dependent experiments. Theory Probab. App. 79, 167–173 (1990)

    MathSciNet  Google Scholar 

  17. Szpankowski, W.: Average Case Analysis of Algorithms on Sequences. John Wiley and Sons, New York (2001)

    MATH  Google Scholar 

  18. Buhler, J., Tompa, M.: Finding Motifs Using Random Projections. In: RECOMB 2001, pp. 69–76. ACM, New York (2001)

    Chapter  Google Scholar 

  19. Beaudoing, E., Freier, S., Wyatt, J., Claverie, J., Gautheret, D.: Patterns of Variant Polyadenylation Signal Usage in Human Genes. Genome Res. 10, 1001–1010 (2000)

    Article  Google Scholar 

  20. van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998), http://rsat.ulb.ac.be/rsat/

    Article  Google Scholar 

  21. Knuth, D.: The average time for carry propagation. Indag. Math. 40, 238–242 (1978)

    MathSciNet  Google Scholar 

  22. Régnier, M.: Mathematical tools for regulatory signals extraction. In: Kolchanov, N., Hofestaedt, R. (eds.) Bioinformatics of Genome Regulation and Structure, pp. 61–70. Kluwer Academic Publisher, Dordrecht (2004)

    Google Scholar 

  23. Flajolet, P., Sedgewick, R.: Analysis of Algorithms. Addison-Wesley, Reading (1996)

    MATH  Google Scholar 

  24. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  25. Crochemore, M., Rytter, W.: Jewels of Stringology, p. 310. World Scientific Publishing, Hong-Kong (2002)

    Book  Google Scholar 

  26. Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics (ISMB special issue) 817, 30–38 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Boeva, V., Clément, J., Régnier, M., Vandenbogaert, M. (2005). Assessing the Significance of Sets of Words. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_31

Download citation

  • DOI: https://doi.org/10.1007/11496656_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26201-5

  • Online ISBN: 978-3-540-31562-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics