LATIN 2018: LATIN 2018: Theoretical Informatics pp 413-426

# On the Biased Partial Word Collector Problem

• Philippe Duchon
• Cyril Nicaud
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10807)

## Abstract

In this article we consider the following question: N words of length L are generated using a biased memoryless source, i.e. each letter is taken independently according to some fixed distribution on the alphabet, and collected in a set (duplicates are removed); what are the frequencies of the letters in a typical element of this random set? We prove that the typical frequency distribution of such a word can be characterized by considering the parameter $$\ell = L/\log N$$. We exhibit two thresholds $$\ell _0<\ell _1$$ that only depend on the source, such that if $$\ell \le \ell _0$$, the distribution resembles the uniform distribution; if $$\ell \ge \ell _1$$ it resembles the distribution of the source; and for $$\ell _0\le \ell \le \ell _1$$ we characterize the distribution as an interpolation of the two extremal distributions.

## Notes

### Acknowledgments

The authors are grateful to Arnaud Carayol for his precious help when preparing this article, and an anonymous reviewer for suggesting the promising alternative $$\alpha$$-parametrization of the problem.

## References

1. 1.
Du Boisberranger, J., Gardy, D., Ponty, Y.: The weighted words collector. In: AOFA - 23rd International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms - 2012, pp. 243–264. DMTCS (2012)Google Scholar
2. 2.
Dubhashi, D., Ranjan, D.: Balls and bins: a study in negative dependence. Random Struct. Algorithms 13(2), 99–124 (1998)
3. 3.
Duchon, P., Nicaud, C., Pivoteau, C.: Gapped pattern statistics. In: Kärkkäinen, J., Radoszewski, J., Rytter, W. (eds.) 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, 4–6 July 2017, Warsaw, Poland. LIPIcs, vol. 78, pp. 21:1–21:12. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2017)Google Scholar
4. 4.
Gheorghiciuc, I., Ward, M.D.: On correlation polynomials and subword complexity. In: Discrete Mathematics and Theoretical Computer Science, DMTCS Proceedings, vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07), January 2007Google Scholar
5. 5.
MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
6. 6.
Rubinchik, M., Shur, A.M.: The number of distinct subpalindromes in random words. Fundam. Inform. 145(3), 371–384 (2016)
7. 7.
Van Der Vaart, A.W., Wellner, J.A.: Weak convergence. In: Van Der Vaart, A.W., Wellner, J.A. (eds.) Weak Convergence and Empirical Processes, pp. 16–28. Springer, New York (1996).