How Can Pattern Statistics Be Useful for DNA Motif Discovery?

Schbath, Sophie; Robin, Stéphane

doi:10.1007/978-0-8176-4749-0_15

How Can Pattern Statistics Be Useful for DNA Motif Discovery?

Sophie Schbath &
Stéphane Robin

Chapter
First Online: 01 January 2009

1872 Accesses

Part of the book series: Statistics for Industry and Technology ((SIT))

Abstract

Statistics of motifs have been widely revisited in the last 15 years due to the increasing availability of genomic sequences. The identification of DNA motifs with biological functions is still a huge challenge of genome analysis. Many functional and essential motifs have the particularity to be very frequent all along the chromosome or to be concentrated in some particular regions (e.g. in front of genes) or to be co-oriented with the replication direction. The prediction of functional motifs is then mostly based on statistical properties of pattern occurrences in Markovian sequences. This chapter is primarily devoted to such properties with a special focus on pattern frequency. How does one compute or approximate the count distribution to assess motif exceptionality? How can we test if a motif is significantly unbalanced between two (sets of) sequences? How should one deal with degenerated patterns? How can we model occurrences to find regions significantly enriched with a given pattern? Examples of functional motifs will illustrate all these questions, and we will see how the Chi motif has been identified in Staphylococcus aureus because of its statistical properties.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The backbone of a bacterial genome is composed of the genomic regions conserved in several strains of the bacteria. Here, we used the backbone obtained from the alignment of the three strains K12, O157:H7 and CFT and available at http://genome.jouy.inra.fr/mosaic/
2.
In contrast to Section 15.2.5, the backbone here is the one obtained from the alignment of two strains: K12 and 0157:H7.
3.
a is the complement of t whereas c is the complement of g.

References

Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein method, Statistical Science, 5, 403–434.
MATH MathSciNet Google Scholar
Barbour, A. D., Chen, L. H. Y. and Loh, W.-L. (1992a). Compound Poisson approximation for nonnegative random variables via Stein’s method, Annals of Probability, 20, 1843–1866.
Article MATH MathSciNet Google Scholar
Barbour, A. D., Holst, L. and Janson, S. (1992b). Poisson Approximation, Oxford University Press, London.
MATH Google Scholar
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, B, 57, 289–300.
MATH MathSciNet Google Scholar
Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle’s formula, Journal of Applied Probability, 28, 886–892.
Article MATH MathSciNet Google Scholar
Dembo, A. and Karlin, S. (1992). Poisson approximations for r-scan processes, Annals of Applied Probability, 2, 329–357.
Article MATH MathSciNet Google Scholar
El Karoui, M., Biaudet, V., Schbath, S. and Gruss, A. (1999). Characteristics of Chi distribution on several bacterial genomes, Research in Microbiology, 150, 579–587.
Article Google Scholar
Erhardsson, T. (1999). Compound Poisson approximation for Markov chains using Stein’s method, Annals of Probability, 27, 565–596.
Article MATH MathSciNet Google Scholar
Erhardsson, T. (2000). Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains, Annals of Applied Probability, 10, 573–591.
Article MATH MathSciNet Google Scholar
Halpern, D., Chiapello, H., Schbath, S., Robin, S., Hennequet-Antier, C., Gruss, A. and El Karoui, M. (2007). Identification of DNA motifs implicated in maintenance of bacterial core genomes by predictive modelling, PLoS Genetics, 3, e153.
Article Google Scholar
Johnson, N. L., Kotz, S. and Kemp, A. W. (1992). Univariate Discrete Distributions, Wiley, New York.
MATH Google Scholar
Karlin, S. and Macken, C. (1991). Some statistical problems in the assessment of inhomogeneities of DNA sequence data, Journal of the American Statistical Association, 86, 27–35.
Article Google Scholar
Lothaire, M. (2005). Applied Combinatorics on Words, volume 105 of Encyclopedia of Mathematics and its Applications, Cambridge University Press, London.
Google Scholar
Lundstrom, R. (1990). Stochastic models and statistical methods for DNA sequence data, Ph.D. thesis, University of Utah, Salt Lake City.
Google Scholar
McLachlan, G. and Peel, D. (2000). Finite Mixture Models, Wiley, New York.
Book MATH Google Scholar
Nuel, G. (2004). LD-SPatt: Large deviations statistics for patterns on Markov chains, Journal of Computational Biology, 11, 1023–1033.
Article Google Scholar
Nuel, G. (2006). Numerical solutions for patterns statistics on Markov chains, Statistical Applications in Genetics and Molecular Biology, 5, Article 26.
Google Scholar
Nuel, G. (2008). Cumulative distribution function of a geometric Poisson distribution, Journal of Statistical Computation and Simulation, 78, 385–394.
Article MATH MathSciNet Google Scholar
Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in DNA sequences, Journal of the Royal Statistical Society, B, 57, 205–220.
MATH Google Scholar
Reinert, G. and Schbath, S. (1998). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains, Journal of Computational Biology, 5, 223–254.
Article Google Scholar
Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words, Journal of Computational Biology, 7, 1–46.
Article Google Scholar
Robin, S. (2002). A compound Poisson model for words occurrences in DNA sequences, Journal of the Royal Statistical Society, C, 51, 437–451.
Article MATH MathSciNet Google Scholar
Robin, S. and Daudin, J.-J. (1999). Exact distribution of word occurrences in a random sequence of letters, Journal of Applied Probability, 36, 179–193.
Article MATH MathSciNet Google Scholar
Robin, S. and Daudin, J.-J. (2001). Exact distribution of the distances between any occurences of a set of words, Annals of the Institute of Statistical Mathematics, 53, 895–905.
Article MATH MathSciNet Google Scholar
Robin, S., Daudin, J.-J., Richard, H., Sagot, M.-F. and Schbath, S. (2002). Occurrence probability of structured motifs in random sequences, Journal of Computational Biology, 9, 761–773.
Article Google Scholar
Robin, S., Rodolphe, F. and Schbath, S. (2005). DNA, Words and Models, Cambridge University Press, English version of ADN, mots et modèles, BELIN 2003.
Google Scholar
Robin, S. and Schbath, S. (2001). Numerical comparison of several approximations of the word count distribution in random sequences, Journal of Computational Biology, 8, 349–359.
Article Google Scholar
Robin, S., Schbath, S. and Vandewalle, V. (2007). Statistical tests to compare motif count exceptionalities, BMC Bioinformatics, 8, 1–20.
Article Google Scholar
Roquain, E. and Schbath, S. (2007). Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain, Advances in Applied Probability, 39, 128–140.
Article MATH MathSciNet Google Scholar
Schbath, S. (1995a). Compound Poisson approximation of word counts in DNA sequences, ESAIM: Probability and Statistics, 1, 1–16.
Article MATH MathSciNet Google Scholar
Schbath, S. (1995b). Etude asymptotique du nombre d’occurrences d’un mot dans une chaîne de Markov et application à la recherche de mots de fréquence exceptionnelle dans les séquences d’ADN, Ph.D. thesis, Université René Descartes, Paris V.
Google Scholar
Schbath, S., Prum, B. and de Turckheim, E. (1995). Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences, Journal of Computational Biology, 2, 417–437.
Article Google Scholar
Stefanov, V. (2008). Occurrence of Patterns and Motifs in Random Strings, Scan Statistics: Methods and Applications, Glaz, J., Pozdnyakov, V. and Wallenstein, S., eds., Birkhäuser, Boston, MA, 323–337.
Google Scholar
Stefanov, V., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences, Discrete Applied Mathematics, 155, 868–880.
Article MATH MathSciNet Google Scholar
Touzain, F., Schbath, S., Debled-Rennesson, I., Aigle, B., Leblond, P. and Kucherov, G. (2008). SIGffRid: a tool to search for σ factor binding sites in bacterial genomes using comparative approach and biologically driven statistics, BMC Bioinformatics, 9, 1–23.
Article Google Scholar
Whittle, P. (1955). Some distribution and moment formulae for the Markov chain, Journal of the Royal Statistical Society, B, 17, 235–242.
MATH MathSciNet Google Scholar

Download references

Authors

Sophie Schbath
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Robin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. Statistics, University of Connecticut, Glenbrook Rd. 215, Storrs, 06269-3120, U.S.A.
Joseph Glaz
Dept. Statistics, University of Connecticut, Glenbrook Road 215, Storrs, 06269-4120, U.S.A.
Vladimir Pozdnyakov
Center for Biomathematical Sciences, Mount Sinai School of Medicine, Gustave L Levy Place 1, New York, 10029, U.S.A.
Sylvan Wallenstein

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schbath, S., Robin, S. (2009). How Can Pattern Statistics Be Useful for DNA Motif Discovery?. In: Glaz, J., Pozdnyakov, V., Wallenstein, S. (eds) Scan Statistics. Statistics for Industry and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4749-0_15

Download citation

DOI: https://doi.org/10.1007/978-0-8176-4749-0_15
Published: 27 April 2009
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4748-3
Online ISBN: 978-0-8176-4749-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics