Abstract
Statistics of motifs have been widely revisited in the last 15 years due to the increasing availability of genomic sequences. The identification of DNA motifs with biological functions is still a huge challenge of genome analysis. Many functional and essential motifs have the particularity to be very frequent all along the chromosome or to be concentrated in some particular regions (e.g. in front of genes) or to be co-oriented with the replication direction. The prediction of functional motifs is then mostly based on statistical properties of pattern occurrences in Markovian sequences. This chapter is primarily devoted to such properties with a special focus on pattern frequency. How does one compute or approximate the count distribution to assess motif exceptionality? How can we test if a motif is significantly unbalanced between two (sets of) sequences? How should one deal with degenerated patterns? How can we model occurrences to find regions significantly enriched with a given pattern? Examples of functional motifs will illustrate all these questions, and we will see how the Chi motif has been identified in Staphylococcus aureus because of its statistical properties.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The backbone of a bacterial genome is composed of the genomic regions conserved in several strains of the bacteria. Here, we used the backbone obtained from the alignment of the three strains K12, O157:H7 and CFT and available at http://genome.jouy.inra.fr/mosaic/
- 2.
In contrast to Section 15.2.5, the backbone here is the one obtained from the alignment of two strains: K12 and 0157:H7.
- 3.
a is the complement of t whereas c is the complement of g.
References
Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein method, Statistical Science, 5, 403–434.
Barbour, A. D., Chen, L. H. Y. and Loh, W.-L. (1992a). Compound Poisson approximation for nonnegative random variables via Stein’s method, Annals of Probability, 20, 1843–1866.
Barbour, A. D., Holst, L. and Janson, S. (1992b). Poisson Approximation, Oxford University Press, London.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, B, 57, 289–300.
Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle’s formula, Journal of Applied Probability, 28, 886–892.
Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processes, Annals of Applied Probability, 2, 329–357.
El Karoui, M., Biaudet, V., Schbath, S. and Gruss, A. (1999). Characteristics of Chi distribution on several bacterial genomes, Research in Microbiology, 150, 579–587.
Erhardsson, T. (1999). Compound Poisson approximation for Markov chains using Stein’s method, Annals of Probability, 27, 565–596.
Erhardsson, T. (2000). Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains, Annals of Applied Probability, 10, 573–591.
Halpern, D., Chiapello, H., Schbath, S., Robin, S., Hennequet-Antier, C., Gruss, A. and El Karoui, M. (2007). Identification of DNA motifs implicated in maintenance of bacterial core genomes by predictive modelling, PLoS Genetics, 3, e153.
Johnson, N. L., Kotz, S. and Kemp, A. W. (1992). Univariate Discrete Distributions, Wiley, New York.
Karlin, S. and Macken, C. (1991). Some statistical problems in the assessment of inhomogeneities of DNA sequence data, Journal of the American Statistical Association, 86, 27–35.
Lothaire, M. (2005). Applied Combinatorics on Words, volume 105 of Encyclopedia of Mathematics and its Applications, Cambridge University Press, London.
Lundstrom, R. (1990). Stochastic models and statistical methods for DNA sequence data, Ph.D. thesis, University of Utah, Salt Lake City.
McLachlan, G. and Peel, D. (2000). Finite Mixture Models, Wiley, New York.
Nuel, G. (2004). LD-SPatt: Large deviations statistics for patterns on Markov chains, Journal of Computational Biology, 11, 1023–1033.
Nuel, G. (2006). Numerical solutions for patterns statistics on Markov chains, Statistical Applications in Genetics and Molecular Biology, 5, Article 26.
Nuel, G. (2008). Cumulative distribution function of a geometric Poisson distribution, Journal of Statistical Computation and Simulation, 78, 385–394.
Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in DNA sequences, Journal of the Royal Statistical Society, B, 57, 205–220.
Reinert, G. and Schbath, S. (1998). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains, Journal of Computational Biology, 5, 223–254.
Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words, Journal of Computational Biology, 7, 1–46.
Robin, S. (2002). A compound Poisson model for words occurrences in DNA sequences, Journal of the Royal Statistical Society, C, 51, 437–451.
Robin, S. and Daudin, J.-J. (1999). Exact distribution of word occurrences in a random sequence of letters, Journal of Applied Probability, 36, 179–193.
Robin, S. and Daudin, J.-J. (2001). Exact distribution of the distances between any occurences of a set of words, Annals of the Institute of Statistical Mathematics, 53, 895–905.
Robin, S., Daudin, J.-J., Richard, H., Sagot, M.-F. and Schbath, S. (2002). Occurrence probability of structured motifs in random sequences, Journal of Computational Biology, 9, 761–773.
Robin, S., Rodolphe, F. and Schbath, S. (2005). DNA, Words and Models, Cambridge University Press, English version of ADN, mots et modèles, BELIN 2003.
Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences, Journal of Computational Biology, 8, 349–359.
Robin, S., Schbath, S. and Vandewalle, V. (2007). Statistical tests to compare motif count exceptionalities, BMC Bioinformatics, 8, 1–20.
Roquain, E. and Schbath, S. (2007). Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain, Advances in Applied Probability, 39, 128–140.
Schbath, S. (1995a). Compound Poisson approximation of word counts in DNA sequences, ESAIM: Probability and Statistics, 1, 1–16.
Schbath, S. (1995b). Etude asymptotique du nombre d’occurrences d’un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d’ADN, Ph.D. thesis, Université René Descartes, Paris V.
Schbath, S., Prum, B. and de Turckheim, E. (1995). Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences, Journal of Computational Biology, 2, 417–437.
Stefanov, V. (2008). Occurrence of Patterns and Motifs in Random Strings, Scan Statistics: Methods and Applications, Glaz, J., Pozdnyakov, V. and Wallenstein, S., eds., Birkhäuser, Boston, MA, 323–337.
Stefanov, V., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences, Discrete Applied Mathematics, 155, 868–880.
Touzain, F., Schbath, S., Debled-Rennesson, I., Aigle, B., Leblond, P. and Kucherov, G. (2008). SIGffRid: a tool to search for σ factor binding sites in bacterial genomes using comparative approach and biologically driven statistics, BMC Bioinformatics, 9, 1–23.
Whittle, P. (1955). Some distribution and moment formulae for the Markov chain, Journal of the Royal Statistical Society, B, 17, 235–242.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Birkhäuser Boston, a part of Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Schbath, S., Robin, S. (2009). How Can Pattern Statistics Be Useful for DNA Motif Discovery?. In: Glaz, J., Pozdnyakov, V., Wallenstein, S. (eds) Scan Statistics. Statistics for Industry and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4749-0_15
Download citation
DOI: https://doi.org/10.1007/978-0-8176-4749-0_15
Published:
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4748-3
Online ISBN: 978-0-8176-4749-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)