Abstract
Let L be a context-free language on an alphabet X={ x 1,x2,…, xk} and n a positive integer. We consider the problem of generating at random words of L with re-spect to a given distribution of the number of occurrences of the letters. We consider two alternatives of the problem. In the first one, a vector of natural numbers (n1, n2,…,nk) such that n1 + n2+… + nk = n is given, and the words must be generated uniformly among the set of words of L which contain exactly ni letters xi (1 ≤ i ≤ k). The second alternative consists, given v = (vi,…, vk) a vector of positive real numbers such that vi +… + vk = 1, to generate at random words among the whole set of words of L of length n, in such a way that the expected number of occurrences of any letter x i equals nvi (1 ≤i ≤ k), and two words having the same distribution of letters have the same probability to be generated. For this purpose, we design and study two alternatives of the recursive method which is classically employed for the uniform generation of combinatorial structures. This type of “controlled” non-uniform generation can be applied in the field of statistical analysis of genomic sequences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
E. Coward. Shufflet: Shuffling sequences while conserving the k-let counts. Bioinformatics, 15(12):1058–1059, 1999.
A. Denise and P. Zimmermann. Uniform random generation of decomposable structures using floating-point arithmetic. Theoretical Computer Science, 218:233–248, 1999.
L. Devroye. Non-uniform random variate generation Springer-Verlag, 1986.
M. Drmota. Systems of functional equations. Random Structures and Algorithms, 10:103–124, 1997.
I. Dutour and J.-M. Fédou. Object grammars and random generation. Discrete Mathematics and Theoretical Computer Science, 2:47–61, 1998.
J-.C. Faugère. GB. http://calfor.lip6.fr/GB.html.
J.W. Fickett. ORFs and genes: how strong a connection? J Comput Biol, 2(1):117–123, 1995.
W.M. Fitch. Random sequences. Journal of Molecular Biology, 163:171–176, 1983.
Ph. Flajolet, P. Zimmermann, and B. Van Cutsem. A calculus for the random generation of labelled combinatorial structures. Theoretical Computer Science, 132:1–35, 1994.
M. Goldwurm. Random generation of words in an algebraic language in linear binary space. Information Processing Letters, 54:229–233, 1995.
R.L. Graham, D.E. Knuth, and O. Patashnik. Concrete Mathematics Addison Wesley, 2nd edition, 1997. French translation: Mathematiques concretes, International Thomson Publishing, 1998.
T. Hickey and J. Cohen. Uniform random generation of strings in a context-free language. SIAM J. Comput, 12(4):645–655, 1983.
D. Kandel, Y. Matias, R. Unger, and P. Winkler. Shuffling biological sequences. Discrete Applied Mathematics, 71:171–185,1996.
D.J. Lipman and W.R. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435–1441, 1985.
D.J. Lipman, W.J. Wilbur, T.F. Smith, and M.S. Waterman. On the statistical signifiance of nucleic acid similarities. Nucleic Acids Research, 12:215–226, 1984.
H. G. Mairson. Generating words in a context free language uniformly at random. Information Processing Letters, 49:95–99, 1994.
P. Nicodème, B. Salvy, and Ph. Flajolet. Motif statistics. In European Symposium on Algorithms-ESA99, pages 194–211. Lecture Notes in Computer Science vol. 1643,1999.
A. Nijenhuis and H.S. Wilf. Combinatorial algorithms Academic Press, New York, 2nd edition, 1978.
Ph. Flajolet and R. Sedgewick. The average case analysis of algorithms: Multivariate asymptotics and limit distribution. RR INRIA, Number 3162, 1997.
M. Régnier. A unified approach to word occurrence probabilities. Discrete Applied Mathematics, 2000. To appear in a special issue on Computational Biology; preliminary version at RECOMB’98
R. Sedgewick and Ph. Flajolet. An introduction to the analysis of algorithms Addison Wesley, 1996. French translation: Introduction à l’analyse des algorithmes, International Thomson Publishing, 1996.
M. Termier and A. Kalogeropoulos. Discrimination between fortuitous and biologically constrained Open Reading Frames in DNA sequences of Saccharomyces cerevisiae. Yeast, 12:369–384, 1996.
A. Vanet, L. Marsan, and M.-F. Sagot. Promoter sequences and algorithmical methods for identifying them. Res. Microbiol, 150:779–799, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Basel AG
About this paper
Cite this paper
Denise, A., Roques, O., Termier, M. (2000). Random generation of words of context-free languages according to the frequencies of letters. In: Gardy, D., Mokkadem, A. (eds) Mathematics and Computer Science. Trends in Mathematics. Birkhäuser, Basel. https://doi.org/10.1007/978-3-0348-8405-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-0348-8405-1_10
Publisher Name: Birkhäuser, Basel
Print ISBN: 978-3-0348-9553-8
Online ISBN: 978-3-0348-8405-1
eBook Packages: Springer Book Archive