Random generation of words of context-free languages according to the frequencies of letters

  • Alain Denise
  • Olivier Roques
  • Michel Termier
Conference paper
Part of the Trends in Mathematics book series (TM)


Let L be a context-free language on an alphabet X={ x 1,x2,…, xk} and n a positive integer. We consider the problem of generating at random words of L with re-spect to a given distribution of the number of occurrences of the letters. We consider two alternatives of the problem. In the first one, a vector of natural numbers (n1, n2,…,nk) such that n1 + n2+… + nk = n is given, and the words must be generated uniformly among the set of words of L which contain exactly ni letters xi (1 ≤ i ≤ k). The second alternative consists, given v = (vi,…, vk) a vector of positive real numbers such that vi +… + vk = 1, to generate at random words among the whole set of words of L of length n, in such a way that the expected number of occurrences of any letter x i equals nvi (1 ≤i ≤ k), and two words having the same distribution of letters have the same probability to be generated. For this purpose, we design and study two alternatives of the recursive method which is classically employed for the uniform generation of combinatorial structures. This type of “controlled” non-uniform generation can be applied in the field of statistical analysis of genomic sequences.


Random Generation Terminal Symbol Exact Frequency Discrete Apply Mathematic Rational Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    E. Coward. Shufflet: Shuffling sequences while conserving the k-let counts. Bioinformatics, 15(12):1058–1059, 1999.CrossRefGoogle Scholar
  2. [2]
    A. Denise and P. Zimmermann. Uniform random generation of decomposable structures using floating-point arithmetic. Theoretical Computer Science, 218:233–248, 1999.MathSciNetzbMATHCrossRefGoogle Scholar
  3. [3]
    L. Devroye. Non-uniform random variate generation Springer-Verlag, 1986.zbMATHGoogle Scholar
  4. [4]
    M. Drmota. Systems of functional equations. Random Structures and Algorithms, 10:103–124, 1997.MathSciNetzbMATHCrossRefGoogle Scholar
  5. [5]
    I. Dutour and J.-M. Fédou. Object grammars and random generation. Discrete Mathematics and Theoretical Computer Science, 2:47–61, 1998.MathSciNetzbMATHGoogle Scholar
  6. [6]
    J-.C. Faugère. GB.
  7. [7]
    J.W. Fickett. ORFs and genes: how strong a connection? J Comput Biol, 2(1):117–123, 1995.CrossRefGoogle Scholar
  8. [8]
    W.M. Fitch. Random sequences. Journal of Molecular Biology, 163:171–176, 1983.CrossRefGoogle Scholar
  9. [9]
    Ph. Flajolet, P. Zimmermann, and B. Van Cutsem. A calculus for the random generation of labelled combinatorial structures. Theoretical Computer Science, 132:1–35, 1994.MathSciNetzbMATHCrossRefGoogle Scholar
  10. [10]
    M. Goldwurm. Random generation of words in an algebraic language in linear binary space. Information Processing Letters, 54:229–233, 1995.MathSciNetzbMATHCrossRefGoogle Scholar
  11. [11]
    R.L. Graham, D.E. Knuth, and O. Patashnik. Concrete Mathematics Addison Wesley, 2nd edition, 1997. French translation: Mathematiques concretes, International Thomson Publishing, 1998.Google Scholar
  12. [12]
    T. Hickey and J. Cohen. Uniform random generation of strings in a context-free language. SIAM J. Comput, 12(4):645–655, 1983.MathSciNetzbMATHCrossRefGoogle Scholar
  13. [13]
    D. Kandel, Y. Matias, R. Unger, and P. Winkler. Shuffling biological sequences. Discrete Applied Mathematics, 71:171–185,1996.MathSciNetzbMATHCrossRefGoogle Scholar
  14. [14]
    D.J. Lipman and W.R. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435–1441, 1985.CrossRefGoogle Scholar
  15. [15]
    D.J. Lipman, W.J. Wilbur, T.F. Smith, and M.S. Waterman. On the statistical signifiance of nucleic acid similarities. Nucleic Acids Research, 12:215–226, 1984.CrossRefGoogle Scholar
  16. [16]
    H. G. Mairson. Generating words in a context free language uniformly at random. Information Processing Letters, 49:95–99, 1994.MathSciNetzbMATHCrossRefGoogle Scholar
  17. [17]
    P. Nicodème, B. Salvy, and Ph. Flajolet. Motif statistics. In European Symposium on Algorithms-ESA99, pages 194–211. Lecture Notes in Computer Science vol. 1643,1999.Google Scholar
  18. [18]
    A. Nijenhuis and H.S. Wilf. Combinatorial algorithms Academic Press, New York, 2nd edition, 1978.zbMATHGoogle Scholar
  19. [19]
    Ph. Flajolet and R. Sedgewick. The average case analysis of algorithms: Multivariate asymptotics and limit distribution. RR INRIA, Number 3162, 1997.Google Scholar
  20. [20]
    M. Régnier. A unified approach to word occurrence probabilities. Discrete Applied Mathematics, 2000. To appear in a special issue on Computational Biology; preliminary version at RECOMB’98Google Scholar
  21. [21]
    R. Sedgewick and Ph. Flajolet. An introduction to the analysis of algorithms Addison Wesley, 1996. French translation: Introduction à l’analyse des algorithmes, International Thomson Publishing, 1996.zbMATHGoogle Scholar
  22. [22]
    M. Termier and A. Kalogeropoulos. Discrimination between fortuitous and biologically constrained Open Reading Frames in DNA sequences of Saccharomyces cerevisiae. Yeast, 12:369–384, 1996.CrossRefGoogle Scholar
  23. [23]
    A. Vanet, L. Marsan, and M.-F. Sagot. Promoter sequences and algorithmical methods for identifying them. Res. Microbiol, 150:779–799, 1999.CrossRefGoogle Scholar

Copyright information

© Springer Basel AG 2000

Authors and Affiliations

  • Alain Denise
    • 1
  • Olivier Roques
    • 2
  • Michel Termier
    • 3
  1. 1.LRI, UMR CNRSUniversité Paris-Sud XIFrance
  2. 2.LaBRI, UMR CNRSUniversité Bordeaux IFrance
  3. 3.IGM, UMR CNRSUniversité Paris-Sud XIFrance

Personalised recommendations