Motif Statistics

Nicodème, Pierre; Salvy, Bruno; Flajolet, Philippe

doi:10.1007/3-540-48481-7_18

Pierre Nicodème⁵,
Bruno Salvy⁶ &
Philippe Flajolet⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1643))

Included in the following conference series:

European Symposium on Algorithms

705 Accesses
20 Citations

Abstract

We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.

This work has been partially supported by the Long Term Research Project Alcom-IT (#20244) of the European Union.

An extended version of this abstract is available as INRIA Research Report 3606.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Atteson, K. Calculating the exact probability of language-like patterns in biomolecular sequences. In Sixth International Conference on Intelligent Systems for Molecular Biology (1998), AAAI Press, pp. 17–24.
Google Scholar
Bairoch, A., Bucher, P., and Ofman, K. The PROSITE database, its status in 1997. Nucleic Acids Res. 25 (1997), 217–221. MEDLINE: 97169396, http://expasy.hcuge.ch/sprot/prosite.html.
Article Google Scholar
Bender, E. A. Central and local limit theorems applied to asymptotic enumeration. Journal of Combinatorial Theory 15 (1973), 91–111.
Article MATH MathSciNet Google Scholar
Bender, E. A., and Kochman, F. The distribution of subword counts is usually normal. European Journal of Combinatorics 14 (1993), 265–275.
Article MATH MathSciNet Google Scholar
Bender, E. A., Richmond, L. B., and Williamson, S. G. Central and local limit theorems applied to asymptotic enumeration. III. Matrix recursions. Journal of Combinatorial Theory 35, 3 (1983), 264–278.
MathSciNet Google Scholar
Berry, G., and Sethi, R. From regular expressions to deterministic automata. Theoretical Computer Science 48, 1 (1986), 117–126.
Article MATH MathSciNet Google Scholar
Billingsley, P. Probability and Measure, 2nd ed. JohnWiley & Sons, 1986.
Google Scholar
Brüggemann-Klein, A. Regular expressions into finite automata. Theoretical Computer Science 120, 2 (1993), 197–213.
Article MATH MathSciNet Google Scholar
Chomsky, N., and Schützenberger, M. P. The algebraic theory of context-free languages. In Computer programming and formal systems. North-Holland, Amsterdam, 1963, pp. 118–161.
Google Scholar
Flajolet, P., Kirschenhofer, P., and Tichy, R. F. Deviations from uniformity in random strings. Probability Theory and Related Fields 80 (1988), 139–150.
Article MATH MathSciNet Google Scholar
Flajolet, P., and Sedgewick, R. The average case analysis of algorithms: Multivariate asymptotics and limit distributions. Research Report 3162, Institut National de Recherche en Informatique et en Automatique, 1997. 123 pages.
Google Scholar
Gantmacher, F. R. The theory of matrices. Vols. 1, 2. Chelsea Publishing Co., NewYork, 1959. Translated by K. A. Hirsch.
MATH Google Scholar
Gourdon, X., and Salvy, B. Effective asymptotics of linear recurrences with rational coefficients. Discrete Mathematics 153, 1-3 (1996), 145–163.
Article MATH MathSciNet Google Scholar
Guibas, L. J., and Odlyzko, A. M. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A 30, 2 (1981), 183–208.
Article MATH MathSciNet Google Scholar
Hopcroft, J. E., and Ullman, J. D. Introduction to automata theory, languages, and computation. Addison-Wesley Publishing Co., Reading, Mass., 1979. Addison-Wesley Series in Computer Science.
MATH Google Scholar
Hwang, H. K. Théorémes limites pour les structures combinatoires et les fonctions arithmétiques. PhD thesis, École polytechnique, Palaiseau, France, Dec. 1994.
Google Scholar
Kelley, D. Automata and formal languages. Prentice Hall Inc., Englewood Cliffs, NJ, 1995. An introduction.
Google Scholar
Knuth, D. E. The art of computer programming. Vol. 2. Seminumerical algorithms, second ed. Computer Science and Information Processing. Addison-Wesley Publishing Co., Reading, Mass., 1981.
MATH Google Scholar
Kozen, D. C. Automata and computability. Springer-Verlag, NewYork, 1997.
MATH Google Scholar
Pevzner, P. A., Borodovski, M. Y., and Mironov, A. A. Linguistic of nucleotide sequences: The significance of deviation from mean statistical characteristics and prediction of the frequencies of occurrence of words. Journal of Biomolecular Structure Dyn. 6 (1989), 1013–1026.
Google Scholar
Prasolov, V. V. Problems and theorems in linear algebra. American Mathematical Society, Providence, RI, 1994. Translated from the Russian manuscript by D. A. Leĭtes.
MATH Google Scholar
Prum, B., Rodolphe, F., and DE Turckheim, É. Finding words with unexpected frequencies in deoxyribonucleic acid sequences. Journal of the Royal Statistical Society. Series B 57, 1 (1995), 205–220.
MATH MathSciNet Google Scholar
Rayward-Smith, V. J. A first course in formal language theory. Blackwell Scientific Publications Ltd., Oxford, 1983.
Google Scholar
Régnier, M. A unified approach to words statistics. In Second Annual International Conference on Computational Molecular Biology (New-York, 1998), ACM Press, pp. 207–213.
Google Scholar
Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algoritmica (1998). To appear.
Google Scholar
Reinert, G., and Schbath, S. Compound Poisson approximations for occurrences of multiple words in Markov chains. Journal of Computational Biology 5, 2 (1998), 223–253.
Article Google Scholar
Salvy, B., and Zimmermann, P. Gfun: a Maple package for the manipulation of generating and holonomic functions in one variable. ACM Transactions on Mathematical Software 20, 2 (1994), 163–177.
Article MATH Google Scholar
Schbath, S., Prum, B., and de Turckheim, É. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. Journal of Computational Biology 2, 3 (1995), 417–437.
Google Scholar
Sewell, R. F., and Durbin, R. Method for calculation of probability of matching a bounded regular expression in a random data string. Journal of Computational Biology 2, 1 (1995), 25–31.
Google Scholar
Vallée, B. Dynamical sources in information theory: Fundamental intervals and word prefixes. Les cahiers du GREYC, Université de Caen, 1998. 32p.
Google Scholar
Waterman, M. S. Introduction to Computational Biology: Maps, sequences and genomes. Chapman & Hall, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

DKFZ Theoretische Bioinformatik, Germany
Pierre Nicodème
Algorithms Project. Inria Rocquencourt, France
Bruno Salvy & Philippe Flajolet

Authors

Pierre Nicodème
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Salvy
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Flajolet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Applied Mathematics and DIMATIA Centre, Charles University, Malostranská nám. 25, CZ-11800, Prague 1, Czech Republic
Jaroslav Nešetřil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nicodème, P., Salvy, B., Flajolet, P. (1999). Motif Statistics. In: Nešetřil, J. (eds) Algorithms - ESA’ 99. ESA 1999. Lecture Notes in Computer Science, vol 1643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48481-7_18

Download citation

DOI: https://doi.org/10.1007/3-540-48481-7_18
Published: 14 January 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66251-8
Online ISBN: 978-3-540-48481-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics