Skip to main content

How Can Pattern Statistics Be Useful for DNA Motif Discovery?

  • Chapter
  • First Online:
  • 1872 Accesses

Part of the book series: Statistics for Industry and Technology ((SIT))

Abstract

Statistics of motifs have been widely revisited in the last 15 years due to the increasing availability of genomic sequences. The identification of DNA motifs with biological functions is still a huge challenge of genome analysis. Many functional and essential motifs have the particularity to be very frequent all along the chromosome or to be concentrated in some particular regions (e.g. in front of genes) or to be co-oriented with the replication direction. The prediction of functional motifs is then mostly based on statistical properties of pattern occurrences in Markovian sequences. This chapter is primarily devoted to such properties with a special focus on pattern frequency. How does one compute or approximate the count distribution to assess motif exceptionality? How can we test if a motif is significantly unbalanced between two (sets of) sequences? How should one deal with degenerated patterns? How can we model occurrences to find regions significantly enriched with a given pattern? Examples of functional motifs will illustrate all these questions, and we will see how the Chi motif has been identified in Staphylococcus aureus because of its statistical properties.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The backbone of a bacterial genome is composed of the genomic regions conserved in several strains of the bacteria. Here, we used the backbone obtained from the alignment of the three strains K12, O157:H7 and CFT and available at http://genome.jouy.inra.fr/mosaic/

  2. 2.

    In contrast to Section 15.2.5, the backbone here is the one obtained from the alignment of two strains: K12 and 0157:H7.

  3. 3.

    a is the complement of t whereas c is the complement of g.

References

  1. Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein method, Statistical Science, 5, 403–434.

    MATH  MathSciNet  Google Scholar 

  2. Barbour, A. D., Chen, L. H. Y. and Loh, W.-L. (1992a). Compound Poisson approximation for nonnegative random variables via Stein’s method, Annals of Probability, 20, 1843–1866.

    Article  MATH  MathSciNet  Google Scholar 

  3. Barbour, A. D., Holst, L. and Janson, S. (1992b). Poisson Approximation, Oxford University Press, London.

    MATH  Google Scholar 

  4. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, B, 57, 289–300.

    MATH  MathSciNet  Google Scholar 

  5. Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle’s formula, Journal of Applied Probability, 28, 886–892.

    Article  MATH  MathSciNet  Google Scholar 

  6. Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processes, Annals of Applied Probability, 2, 329–357.

    Article  MATH  MathSciNet  Google Scholar 

  7. El Karoui, M., Biaudet, V., Schbath, S. and Gruss, A. (1999). Characteristics of Chi distribution on several bacterial genomes, Research in Microbiology, 150, 579–587.

    Article  Google Scholar 

  8. Erhardsson, T. (1999). Compound Poisson approximation for Markov chains using Stein’s method, Annals of Probability, 27, 565–596.

    Article  MATH  MathSciNet  Google Scholar 

  9. Erhardsson, T. (2000). Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains, Annals of Applied Probability, 10, 573–591.

    Article  MATH  MathSciNet  Google Scholar 

  10. Halpern, D., Chiapello, H., Schbath, S., Robin, S., Hennequet-Antier, C., Gruss, A. and El Karoui, M. (2007). Identification of DNA motifs implicated in maintenance of bacterial core genomes by predictive modelling, PLoS Genetics, 3, e153.

    Article  Google Scholar 

  11. Johnson, N. L., Kotz, S. and Kemp, A. W. (1992). Univariate Discrete Distributions, Wiley, New York.

    MATH  Google Scholar 

  12. Karlin, S. and Macken, C. (1991). Some statistical problems in the assessment of inhomogeneities of DNA sequence data, Journal of the American Statistical Association, 86, 27–35.

    Article  Google Scholar 

  13. Lothaire, M. (2005). Applied Combinatorics on Words, volume 105 of Encyclopedia of Mathematics and its Applications, Cambridge University Press, London.

    Google Scholar 

  14. Lundstrom, R. (1990). Stochastic models and statistical methods for DNA sequence data, Ph.D. thesis, University of Utah, Salt Lake City.

    Google Scholar 

  15. McLachlan, G. and Peel, D. (2000). Finite Mixture Models, Wiley, New York.

    Book  MATH  Google Scholar 

  16. Nuel, G. (2004). LD-SPatt: Large deviations statistics for patterns on Markov chains, Journal of Computational Biology, 11, 1023–1033.

    Article  Google Scholar 

  17. Nuel, G. (2006). Numerical solutions for patterns statistics on Markov chains, Statistical Applications in Genetics and Molecular Biology, 5, Article 26.

    Google Scholar 

  18. Nuel, G. (2008). Cumulative distribution function of a geometric Poisson distribution, Journal of Statistical Computation and Simulation, 78, 385–394.

    Article  MATH  MathSciNet  Google Scholar 

  19. Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in DNA sequences, Journal of the Royal Statistical Society, B, 57, 205–220.

    MATH  Google Scholar 

  20. Reinert, G. and Schbath, S. (1998). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains, Journal of Computational Biology, 5, 223–254.

    Article  Google Scholar 

  21. Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words, Journal of Computational Biology, 7, 1–46.

    Article  Google Scholar 

  22. Robin, S. (2002). A compound Poisson model for words occurrences in DNA sequences, Journal of the Royal Statistical Society, C, 51, 437–451.

    Article  MATH  MathSciNet  Google Scholar 

  23. Robin, S. and Daudin, J.-J. (1999). Exact distribution of word occurrences in a random sequence of letters, Journal of Applied Probability, 36, 179–193.

    Article  MATH  MathSciNet  Google Scholar 

  24. Robin, S. and Daudin, J.-J. (2001). Exact distribution of the distances between any occurences of a set of words, Annals of the Institute of Statistical Mathematics, 53, 895–905.

    Article  MATH  MathSciNet  Google Scholar 

  25. Robin, S., Daudin, J.-J., Richard, H., Sagot, M.-F. and Schbath, S. (2002). Occurrence probability of structured motifs in random sequences, Journal of Computational Biology, 9, 761–773.

    Article  Google Scholar 

  26. Robin, S., Rodolphe, F. and Schbath, S. (2005). DNA, Words and Models, Cambridge University Press, English version of ADN, mots et modèles, BELIN 2003.

    Google Scholar 

  27. Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences, Journal of Computational Biology, 8, 349–359.

    Article  Google Scholar 

  28. Robin, S., Schbath, S. and Vandewalle, V. (2007). Statistical tests to compare motif count exceptionalities, BMC Bioinformatics, 8, 1–20.

    Article  Google Scholar 

  29. Roquain, E. and Schbath, S. (2007). Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain, Advances in Applied Probability, 39, 128–140.

    Article  MATH  MathSciNet  Google Scholar 

  30. Schbath, S. (1995a). Compound Poisson approximation of word counts in DNA sequences, ESAIM: Probability and Statistics, 1, 1–16.

    Article  MATH  MathSciNet  Google Scholar 

  31. Schbath, S. (1995b). Etude asymptotique du nombre d’occurrences d’un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d’ADN, Ph.D. thesis, Université René Descartes, Paris V.

    Google Scholar 

  32. Schbath, S., Prum, B. and de Turckheim, E. (1995). Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences, Journal of Computational Biology, 2, 417–437.

    Article  Google Scholar 

  33. Stefanov, V. (2008). Occurrence of Patterns and Motifs in Random Strings, Scan Statistics: Methods and Applications, Glaz, J., Pozdnyakov, V. and Wallenstein, S., eds., Birkhäuser, Boston, MA, 323–337.

    Google Scholar 

  34. Stefanov, V., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences, Discrete Applied Mathematics, 155, 868–880.

    Article  MATH  MathSciNet  Google Scholar 

  35. Touzain, F., Schbath, S., Debled-Rennesson, I., Aigle, B., Leblond, P. and Kucherov, G. (2008). SIGffRid: a tool to search for σ factor binding sites in bacterial genomes using comparative approach and biologically driven statistics, BMC Bioinformatics, 9, 1–23.

    Article  Google Scholar 

  36. Whittle, P. (1955). Some distribution and moment formulae for the Markov chain, Journal of the Royal Statistical Society, B, 17, 235–242.

    MATH  MathSciNet  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Birkhäuser Boston, a part of Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Schbath, S., Robin, S. (2009). How Can Pattern Statistics Be Useful for DNA Motif Discovery?. In: Glaz, J., Pozdnyakov, V., Wallenstein, S. (eds) Scan Statistics. Statistics for Industry and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4749-0_15

Download citation

Publish with us

Policies and ethics