Skip to main content

Motif Statistics

Abstract

  • Conference paper
  • First Online:
Algorithms - ESA’ 99 (ESA 1999)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1643))

Included in the following conference series:

Abstract

We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers “motifs” widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.

This work has been partially supported by the Long Term Research Project Alcom-IT (#20244) of the European Union.

An extended version of this abstract is available as INRIA Research Report 3606.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Atteson, K. Calculating the exact probability of language-like patterns in biomolecular sequences. In Sixth International Conference on Intelligent Systems for Molecular Biology (1998), AAAI Press, pp. 17–24.

    Google Scholar 

  2. Bairoch, A., Bucher, P., and Ofman, K. The PROSITE database, its status in 1997. Nucleic Acids Res. 25 (1997), 217–221. MEDLINE: 97169396, http://expasy.hcuge.ch/sprot/prosite.html.

    Article  Google Scholar 

  3. Bender, E. A. Central and local limit theorems applied to asymptotic enumeration. Journal of Combinatorial Theory 15 (1973), 91–111.

    Article  MATH  MathSciNet  Google Scholar 

  4. Bender, E. A., and Kochman, F. The distribution of subword counts is usually normal. European Journal of Combinatorics 14 (1993), 265–275.

    Article  MATH  MathSciNet  Google Scholar 

  5. Bender, E. A., Richmond, L. B., and Williamson, S. G. Central and local limit theorems applied to asymptotic enumeration. III. Matrix recursions. Journal of Combinatorial Theory 35, 3 (1983), 264–278.

    MathSciNet  Google Scholar 

  6. Berry, G., and Sethi, R. From regular expressions to deterministic automata. Theoretical Computer Science 48, 1 (1986), 117–126.

    Article  MATH  MathSciNet  Google Scholar 

  7. Billingsley, P. Probability and Measure, 2nd ed. JohnWiley & Sons, 1986.

    Google Scholar 

  8. Brüggemann-Klein, A. Regular expressions into finite automata. Theoretical Computer Science 120, 2 (1993), 197–213.

    Article  MATH  MathSciNet  Google Scholar 

  9. Chomsky, N., and Schützenberger, M. P. The algebraic theory of context-free languages. In Computer programming and formal systems. North-Holland, Amsterdam, 1963, pp. 118–161.

    Google Scholar 

  10. Flajolet, P., Kirschenhofer, P., and Tichy, R. F. Deviations from uniformity in random strings. Probability Theory and Related Fields 80 (1988), 139–150.

    Article  MATH  MathSciNet  Google Scholar 

  11. Flajolet, P., and Sedgewick, R. The average case analysis of algorithms: Multivariate asymptotics and limit distributions. Research Report 3162, Institut National de Recherche en Informatique et en Automatique, 1997. 123 pages.

    Google Scholar 

  12. Gantmacher, F. R. The theory of matrices. Vols. 1, 2. Chelsea Publishing Co., NewYork, 1959. Translated by K. A. Hirsch.

    MATH  Google Scholar 

  13. Gourdon, X., and Salvy, B. Effective asymptotics of linear recurrences with rational coefficients. Discrete Mathematics 153, 1-3 (1996), 145–163.

    Article  MATH  MathSciNet  Google Scholar 

  14. Guibas, L. J., and Odlyzko, A. M. String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A 30, 2 (1981), 183–208.

    Article  MATH  MathSciNet  Google Scholar 

  15. Hopcroft, J. E., and Ullman, J. D. Introduction to automata theory, languages, and computation. Addison-Wesley Publishing Co., Reading, Mass., 1979. Addison-Wesley Series in Computer Science.

    MATH  Google Scholar 

  16. Hwang, H. K. Théorémes limites pour les structures combinatoires et les fonctions arithmétiques. PhD thesis, École polytechnique, Palaiseau, France, Dec. 1994.

    Google Scholar 

  17. Kelley, D. Automata and formal languages. Prentice Hall Inc., Englewood Cliffs, NJ, 1995. An introduction.

    Google Scholar 

  18. Knuth, D. E. The art of computer programming. Vol. 2. Seminumerical algorithms, second ed. Computer Science and Information Processing. Addison-Wesley Publishing Co., Reading, Mass., 1981.

    MATH  Google Scholar 

  19. Kozen, D. C. Automata and computability. Springer-Verlag, NewYork, 1997.

    MATH  Google Scholar 

  20. Pevzner, P. A., Borodovski, M. Y., and Mironov, A. A. Linguistic of nucleotide sequences: The significance of deviation from mean statistical characteristics and prediction of the frequencies of occurrence of words. Journal of Biomolecular Structure Dyn. 6 (1989), 1013–1026.

    Google Scholar 

  21. Prasolov, V. V. Problems and theorems in linear algebra. American Mathematical Society, Providence, RI, 1994. Translated from the Russian manuscript by D. A. Leĭtes.

    MATH  Google Scholar 

  22. Prum, B., Rodolphe, F., and DE Turckheim, É. Finding words with unexpected frequencies in deoxyribonucleic acid sequences. Journal of the Royal Statistical Society. Series B 57, 1 (1995), 205–220.

    MATH  MathSciNet  Google Scholar 

  23. Rayward-Smith, V. J. A first course in formal language theory. Blackwell Scientific Publications Ltd., Oxford, 1983.

    Google Scholar 

  24. Régnier, M. A unified approach to words statistics. In Second Annual International Conference on Computational Molecular Biology (New-York, 1998), ACM Press, pp. 207–213.

    Google Scholar 

  25. Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algoritmica (1998). To appear.

    Google Scholar 

  26. Reinert, G., and Schbath, S. Compound Poisson approximations for occurrences of multiple words in Markov chains. Journal of Computational Biology 5, 2 (1998), 223–253.

    Article  Google Scholar 

  27. Salvy, B., and Zimmermann, P. Gfun: a Maple package for the manipulation of generating and holonomic functions in one variable. ACM Transactions on Mathematical Software 20, 2 (1994), 163–177.

    Article  MATH  Google Scholar 

  28. Schbath, S., Prum, B., and de Turckheim, É. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. Journal of Computational Biology 2, 3 (1995), 417–437.

    Google Scholar 

  29. Sewell, R. F., and Durbin, R. Method for calculation of probability of matching a bounded regular expression in a random data string. Journal of Computational Biology 2, 1 (1995), 25–31.

    Google Scholar 

  30. Vallée, B. Dynamical sources in information theory: Fundamental intervals and word prefixes. Les cahiers du GREYC, Université de Caen, 1998. 32p.

    Google Scholar 

  31. Waterman, M. S. Introduction to Computational Biology: Maps, sequences and genomes. Chapman & Hall, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nicodème, P., Salvy, B., Flajolet, P. (1999). Motif Statistics. In: Nešetřil, J. (eds) Algorithms - ESA’ 99. ESA 1999. Lecture Notes in Computer Science, vol 1643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48481-7_18

Download citation

  • DOI: https://doi.org/10.1007/3-540-48481-7_18

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66251-8

  • Online ISBN: 978-3-540-48481-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics