Abstract
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.
This work has been partially supported by the Long Term Research Project Alcom-IT (#20244) of the European Union.
An extended version of this abstract is available as INRIA Research Report 3606.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Atteson, K. Calculating the exact probability of language-like patterns in biomolecular sequences. In Sixth International Conference on Intelligent Systems for Molecular Biology (1998), AAAI Press, pp. 17–24.
Bairoch, A., Bucher, P., and Ofman, K. The PROSITE database, its status in 1997. Nucleic Acids Res. 25 (1997), 217–221. MEDLINE: 97169396, http://expasy.hcuge.ch/sprot/prosite.html.
Bender, E. A. Central and local limit theorems applied to asymptotic enumeration. Journal of Combinatorial Theory 15 (1973), 91–111.
Bender, E. A., and Kochman, F. The distribution of subword counts is usually normal. European Journal of Combinatorics 14 (1993), 265–275.
Bender, E. A., Richmond, L. B., and Williamson, S. G. Central and local limit theorems applied to asymptotic enumeration. III. Matrix recursions. Journal of Combinatorial Theory 35, 3 (1983), 264–278.
Berry, G., and Sethi, R. From regular expressions to deterministic automata. Theoretical Computer Science 48, 1 (1986), 117–126.
Billingsley, P. Probability and Measure, 2nd ed. JohnWiley & Sons, 1986.
Brüggemann-Klein, A. Regular expressions into finite automata. Theoretical Computer Science 120, 2 (1993), 197–213.
Chomsky, N., and Schützenberger, M. P. The algebraic theory of context-free languages. In Computer programming and formal systems. North-Holland, Amsterdam, 1963, pp. 118–161.
Flajolet, P., Kirschenhofer, P., and Tichy, R. F. Deviations from uniformity in random strings. Probability Theory and Related Fields 80 (1988), 139–150.
Flajolet, P., and Sedgewick, R. The average case analysis of algorithms: Multivariate asymptotics and limit distributions. Research Report 3162, Institut National de Recherche en Informatique et en Automatique, 1997. 123 pages.
Gantmacher, F. R. The theory of matrices. Vols. 1, 2. Chelsea Publishing Co., NewYork, 1959. Translated by K. A. Hirsch.
Gourdon, X., and Salvy, B. Effective asymptotics of linear recurrences with rational coefficients. Discrete Mathematics 153, 1-3 (1996), 145–163.
Guibas, L. J., and Odlyzko, A. M. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A 30, 2 (1981), 183–208.
Hopcroft, J. E., and Ullman, J. D. Introduction to automata theory, languages, and computation. Addison-Wesley Publishing Co., Reading, Mass., 1979. Addison-Wesley Series in Computer Science.
Hwang, H. K. Théorémes limites pour les structures combinatoires et les fonctions arithmétiques. PhD thesis, École polytechnique, Palaiseau, France, Dec. 1994.
Kelley, D. Automata and formal languages. Prentice Hall Inc., Englewood Cliffs, NJ, 1995. An introduction.
Knuth, D. E. The art of computer programming. Vol. 2. Seminumerical algorithms, second ed. Computer Science and Information Processing. Addison-Wesley Publishing Co., Reading, Mass., 1981.
Kozen, D. C. Automata and computability. Springer-Verlag, NewYork, 1997.
Pevzner, P. A., Borodovski, M. Y., and Mironov, A. A. Linguistic of nucleotide sequences: The significance of deviation from mean statistical characteristics and prediction of the frequencies of occurrence of words. Journal of Biomolecular Structure Dyn. 6 (1989), 1013–1026.
Prasolov, V. V. Problems and theorems in linear algebra. American Mathematical Society, Providence, RI, 1994. Translated from the Russian manuscript by D. A. Leĭtes.
Prum, B., Rodolphe, F., and DE Turckheim, É. Finding words with unexpected frequencies in deoxyribonucleic acid sequences. Journal of the Royal Statistical Society. Series B 57, 1 (1995), 205–220.
Rayward-Smith, V. J. A first course in formal language theory. Blackwell Scientific Publications Ltd., Oxford, 1983.
Régnier, M. A unified approach to words statistics. In Second Annual International Conference on Computational Molecular Biology (New-York, 1998), ACM Press, pp. 207–213.
Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algoritmica (1998). To appear.
Reinert, G., and Schbath, S. Compound Poisson approximations for occurrences of multiple words in Markov chains. Journal of Computational Biology 5, 2 (1998), 223–253.
Salvy, B., and Zimmermann, P. Gfun: a Maple package for the manipulation of generating and holonomic functions in one variable. ACM Transactions on Mathematical Software 20, 2 (1994), 163–177.
Schbath, S., Prum, B., and de Turckheim, É. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. Journal of Computational Biology 2, 3 (1995), 417–437.
Sewell, R. F., and Durbin, R. Method for calculation of probability of matching a bounded regular expression in a random data string. Journal of Computational Biology 2, 1 (1995), 25–31.
Vallée, B. Dynamical sources in information theory: Fundamental intervals and word prefixes. Les cahiers du GREYC, Université de Caen, 1998. 32p.
Waterman, M. S. Introduction to Computational Biology: Maps, sequences and genomes. Chapman & Hall, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nicodème, P., Salvy, B., Flajolet, P. (1999). Motif Statistics. In: Nešetřil, J. (eds) Algorithms - ESA’ 99. ESA 1999. Lecture Notes in Computer Science, vol 1643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48481-7_18
Download citation
DOI: https://doi.org/10.1007/3-540-48481-7_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66251-8
Online ISBN: 978-3-540-48481-3
eBook Packages: Springer Book Archive