Skip to main content
Log in

Learning Poisson Binomial Distributions

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We consider a basic problem in unsupervised learning: learning an unknown Poisson binomial distribution. A Poisson binomial distribution (PBD) over \(\{0,1,\ldots ,n\}\) is the distribution of a sum of \(n\) independent Bernoulli random variables which may have arbitrary, potentially non-equal, expectations. These distributions were first studied by Poisson (Recherches sur la Probabilitè des jugements en matié criminelle et en matiére civile. Bachelier, Paris, 1837) and are a natural \(n\)-parameter generalization of the familiar Binomial Distribution. Surprisingly, prior to our work this basic learning problem was poorly understood, and known results for it were far from optimal. We essentially settle the complexity of the learning problem for this basic class of distributions. As our first main result we give a highly efficient algorithm which learns to \(\epsilon \)-accuracy (with respect to the total variation distance) using \(\tilde{O}(1/ \epsilon ^{3})\) samples independent of \(n\). The running time of the algorithm is quasilinear in the size of its input data, i.e., \(\tilde{O}(\log (n)/\epsilon ^{3})\) bit-operations (we write \(\tilde{O}(\cdot )\) to hide factors which are polylogarithmic in the argument to \(\tilde{O}(\cdot )\); thus, for example, \(\tilde{O}(a \log b)\) denotes a quantity which is \(O(a \log b \cdot \log ^c(a \log b))\) for some absolute constant \(c\). Observe that each draw from the distribution is a \(\log (n)\)-bit string). Our second main result is a proper learning algorithm that learns to \(\epsilon \)-accuracy using \(\tilde{O}(1/\epsilon ^{2})\) samples, and runs in time \((1/\epsilon )^{\mathrm {poly}(\log (1/\epsilon ))} \cdot \log n\). This sample complexity is nearly optimal, since any algorithm for this problem must use \(\Omega (1/\epsilon ^{2})\) samples. We also give positive and negative results for some extensions of this learning problem to weighted sums of independent Bernoulli random variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. We thank Yuval Peres and Sam Watson for this information [42].

  2. [35] used the Kullback-Leibler divergence as their distance measure but we find it more natural to use variation distance.

  3. In particular, our algorithm will output a list of pointers, mapping every point in \([a,b]\) to some memory location where the probability assigned to that point by \(H_S\) is written.

References

  1. Acharya, J., Jafarpour, A., Orlitsky, A., Suresh, A.T.: Sorting with adversarial comparators and application to density estimation. In: the IEEE International Symposium on Information Theory (ISIT) (2014)

  2. Berry, Andrew C.: The accuracy of the Gaussian approximation to the sum of independent variates. Trans. Am. Math. Soc. 49(1), 122–136 (1941)

    Article  Google Scholar 

  3. Barbour, A.D., Holst, L., Janson, S.: Poisson Approximation. Oxford University Press, New York (1992)

    MATH  Google Scholar 

  4. Birgé, L.: Estimating a density under order restrictions: nonasymptotic minimax risk. Ann. Stat. 15(3), 995–1012 (1987)

    Article  MATH  Google Scholar 

  5. Birgé, L.: On the risk of histograms for estimating decreasing densities. Ann. Stat. 15(3), 1013–1022 (1987)

    Article  MATH  Google Scholar 

  6. Birgé, L.: Estimation of unimodal densities without smoothness assumptions. Ann. Stat. 25(3), 970–981 (1997)

    Article  MATH  Google Scholar 

  7. Barbour, A.D., Lindvall, T.: Translated Poisson approximation for Markov chains. J. Theor. Probab. 19(3), 609–630 (2006)

  8. Brent, R.P.: Multiple-precision zero-finding methods and the complexity of elementary function evaluation. In: Traub, J.F. (ed.) Analytic Computational Complexity, pp. 151–176. Academic Press, New York (1975)

    Google Scholar 

  9. Brent, R.P.: Fast multiple-precision evaluation of elementary functions. J. ACM 23(2), 242–251 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  10. Belkin, M., Sinha, K.: Polynomial learning of distribution families. In: FOCS, pp. 103–112 (2010)

  11. Le Cam, L.: An approximation theorem for the Poisson binomial distribution. Pac. J. Math. 10, 1181–1197 (1960)

    Article  MATH  Google Scholar 

  12. Chan, S., Diakonikolas, I., Servedio, R., Sun, X.: Learning mixtures of structured distributions over discrete domains. In: SODA, pp. 1380–1394 (2013)

  13. Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952)

    Article  MATH  MathSciNet  Google Scholar 

  14. Chen, L.H.Y.: On the convergence of Poisson binomial to Poisson distributions. Ann. Probab. 2, 178–180 (1974)

    Article  MATH  Google Scholar 

  15. Chen, S.X., Liu, J.S.: Statistical applications of the Poisson-binomial and conditional Bernoulli distributions. Stat. Sinica 7, 875–892 (1997)

    MATH  Google Scholar 

  16. Daskalakis, C.: An efficient PTAS for two-strategy anonymous games. In: WINE 2008, pp. 186–197. Full version available as ArXiV report (2008)

  17. Daskalakis, C., Diakonikolas, I., O’Donnell, R., Servedio, R., Tan, L.-Y.: Learning sums of independent integer random variables. In: FOCS (2013)

  18. Daskalakis, C., Diakonikolas, I., Servedio, R.: Learning Poisson binomial distributions. In: STOC, pp. 709–728 (2012)

  19. Daskalakis, C., Diakonikolas, I., Servedio, R.A.: Learning \(k\)-modal distributions via testing. In: SODA, pp. 1371–1385 (2012)

  20. Daskalakis, C., Kamath, G.: Faster and sample near-optimal algorithms for proper learning mixtures of gaussians. In: the 27th Conference on Learning Theory (COLT) (2014)

  21. Devroye, L., Lugosi, G.: Nonasymptotic universal smoothing factors, kernel complexity and Yatracos classes. Ann. Stat. 25, 2626–2637 (1996)

    MathSciNet  Google Scholar 

  22. Devroye, L., Lugosi, G.: A universally acceptable smoothing factor for kernel density estimation. Ann. Stat. 24, 2499–2512 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  23. Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation, Springer Series in Statistics. Springer, Berlin (2001)

    Book  Google Scholar 

  24. Deheuvels, P., Pfeifer, D.: A semigroup approach to Poisson approximation. Ann. Probab. 14, 663–676 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  25. Dubhashi, D., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, Cambridge (2009)

    Book  MATH  Google Scholar 

  26. Daskalakis, C., Papadimitriou, C.: On oblivious PTAS’s for Nash equilibrium. In: STOC 2009, pp. 75–84. Full version available as ArXiV report (2011)

  27. Daskalakis, C., Papadimitriou, C.: Sparse covers for sums of indicators. Arxiv Report (2013) http://arxiv.org/abs/1306.1265

  28. Ehm, Werner: Binomial approximation to the Poisson binomial distribution. Stat. Probab. Lett. 11, 7–16 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  29. Esseen, Carl-Gustav: On the Liapunoff limit of error in the theory of probability. Arkiv för matematik, astronomi och fysik A, 1–19 (1942)

    Google Scholar 

  30. Fillebrown, Sandra: Faster computation of Bernoulli numbers. J. Algorithms 13(3), 431–445 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  31. Hodges, S.L., Le Cam, L.: The Poisson approximation to the binomial distribution. Ann. Math. Stat. 31, 740–747 (1960)

    Article  Google Scholar 

  32. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)

    Article  MATH  MathSciNet  Google Scholar 

  33. Johnson, J.L.: Probability and Statistics for Computer Science. Wiley, New York (2003)

    MATH  Google Scholar 

  34. Keilson, J., Gerber, H.: Some results for discrete unimodality. J. Am. Stat. Assoc. 66(334), 386–389 (1971)

    Article  MATH  Google Scholar 

  35. Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R., Sellie, L.: On the learnability of discrete distributions. In: Proceedings of the 26th Symposium on Theory of Computing, pp. 273–282 (1994)

  36. Kalai, A.T., Moitra, A., Valiant, G.: Efficiently learning mixtures of two Gaussians. In: STOC, pp. 553–562 (2010)

  37. Knuth, Donald E.: The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd edn. Addison-Wesley, Boston (1981)

    Google Scholar 

  38. Mikhailov, V.G.: On a refinement of the central limit theorem for sums of independent random indicators. Theory Probab. Appl. 38, 479–489 (1993)

    Article  MathSciNet  Google Scholar 

  39. Moitra, A., Valiant, G.: Settling the polynomial learnability of mixtures of Gaussians. In: FOCS, pp. 93–102 (2010)

  40. Kotz, S., Johnson, N.L., Kemp, A.W.: Univariate Discrete Distributions. Wiley, New York (2005)

    MATH  Google Scholar 

  41. Poisson, S.D.: Recherches sur la Probabilitè des jugements en matié criminelle et en matiére civile. Bachelier, Paris (1837)

    Google Scholar 

  42. Peres, Y., Watson, S.: Personal communication (2011)

  43. Röllin, A.: Translated Poisson approximation using exchangeable pair couplings. Ann. Appl. Probab. 17(5/6), 1596–1614 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  44. Roos, B.: Binomial approximation to the Poisson binomial distribution: the Krawtchouk expansion. Theory Probab. Appl. 45, 328–344 (2000)

    Google Scholar 

  45. Salamin, Eugene: Computation of pi using arithmetic-geometric mean. Math. Comput. 30(135), 565–570 (1976)

    MATH  MathSciNet  Google Scholar 

  46. Sato, Ken-Iti: Convolution of unimodal distributions can produce any number of modes. Ann. Probab. 21(3), 1543–1549 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  47. Suresh, A.T., Orlitsky, A., Acharya, J., Jafarpour, A.: Near-optimal-sample estimators for spherical gaussian mixtures. In: the Annual Conference on Neural Information Processing Systems (NIPS) (2014)

  48. Soon, S.Y.T.: Binomial approximation for dependent indicators. Stat. Sin. 6, 703–714 (1996)

    MATH  MathSciNet  Google Scholar 

  49. Schönhage, A., Strassen, V.: Schnelle multiplikation grosser zahlen. Computing 7, 281–292 (1971)

    Article  MATH  Google Scholar 

  50. Steele, J.M.: Le Cam’s inequality and Poisson approximation. Am. Math. Monthly 101, 48–54 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  51. Volkova, AYu.: A refinement of the central limit theorem for sums of independent random indicators. Theory Probab. Appl. 40, 791–794 (1995)

    Article  MathSciNet  Google Scholar 

  52. Valiant, G., Valiant, P.: Estimating the unseen: an \(n/\log (n)\)-sample estimator for entropy and support size, shown optimal via new CLTs. In: STOC, pp. 685–694 (2011)

  53. Wang, Y.H.: On the number of successes in independent trials. Stat. Sin. 3, 295–312 (1993)

    MATH  Google Scholar 

  54. Whittaker, E.T.: A course of modern analysis. Cambridge University Press, Cambridge (1980)

    Google Scholar 

  55. Yatracos, Y.G.: Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. Ann. Stat. 13, 768–774 (1985)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilias Diakonikolas.

Additional information

Constantinos Daskalakis’s research supported by a Sloan Foundation Fellowship, a Microsoft Research Faculty Fellowship, and NSF Award CCF-0953960 (CAREER) and CCF-1101491. Ilias Diakonikolas’s research supported by a Simons Foundation Postdoctoral Fellowship. Some of this work was done while at Columbia University, supported by NSF Grant CCF-0728736, and by an Alexander S. Onassis Foundation Fellowship. Rocco A. Servedio’s research Supported by NSF Grants CNS-0716245, CCF-1115703, and CCF-1319788 and by DARPA award HR0011-08-1-0069.

Appendices

Appendix 1: Extension of the Cover Theorem: Proof of Theorem 4

Theorem 4 is restating the main cover theorem (Theorem 1) of [27], except that it claims an additional property, namely what follows the word “finally” in the statement of the theorem (we will sometimes refer to this property as the last part of Theorem 4 in the following discussion). Our goal is to show that the cover of [27] already satisfies this property without any modifications, thereby establishing Theorem 4. To avoid reproducing the involved constructions of [27], we will assume that the reader has some familiarity with them. Still, our proof here will be self-contained.

First, we note that the \(\epsilon \)-cover \({\mathcal {S}}_\epsilon \) of Theorem 1 of [27] is a subset of a larger \({\epsilon \over 2}\)-cover \({\mathcal {S}}'_{\epsilon /2}\) of size \(n^{2}+n\cdot (1/\epsilon )^{O(1/\epsilon ^{2})}\), which includes all the \(k\)-sparse and all the \(k\)-heavy Binomial PBDs (up to permutations of the underlying \(p_{i}\)’s), for some \(k=O(1/\epsilon )\). Let us call \({\mathcal {S}}'_{\epsilon /2}\) the “large \({\epsilon \over 2}\)-cover” to distinguish it from \({\mathcal {S}}_\epsilon \), which we will call the “small \(\epsilon \)-cover.” The reader is referred to Theorem 2 in [27] (and the discussion following that theorem) for a description of the large \(\epsilon \over 2\)-cover, and to Section 3.2 of [27] for how this cover is used to construct the small \(\epsilon \)-cover. In particular, the small \(\epsilon \)-cover is a subset of the large \({\epsilon /2}\)-cover, including only a subset of the sparse form distributions in the large \({\epsilon /2}\)-cover. Moreover, for every sparse form distribution in the large \(\epsilon /2\)-cover, the small \(\epsilon \)-cover includes at least one sparse form distribution that is \(\epsilon /2\)-close in total variation distance. Hence, if the large \(\epsilon /2\)-cover satisfies the last part of Theorem 4 (with \(\epsilon /2\) instead of \(\epsilon \) and \({\mathcal {S}}_{\epsilon /2}'\) instead of \({\mathcal {S}}_\epsilon \)), it follows that the small \(\epsilon \)-cover also satisfies the last part of Theorem 4.

So we proceed to argue that, for all \(\epsilon \), the large \(\epsilon \)-cover implied by Theorem 2 of [27] satisfies the last part of Theorem 4. Let us first review how the large cover is constructed (see Section 4 of [27] for the details). For every collection of indicators \(\{X_{i}\}_{i=1}^n\) with expectations \(\{\mathbf{E}[X_{i}]=p_{i}\}_{i}\), the collection is subjected to two filters, called the Stage 1 and Stage 2 filters, and described respectively in Sections 4.1 and 4.2 of [27]. Using the same notation as [27], let us denote by \(\{Z_{i}\}_{i}\) the collection output by the Stage 1 filter and by \(\{Y_{i}\}_{i}\) the collection output by the Stage 2 filter. The collection \(\{Y_{i}\}_{i}\) output by the Stage 2 filter satisfies \(d_{\mathrm TV}(\sum _{i} X_{i},\sum _{i} Y_{i})\le \epsilon \), and is included in the cover (possibly after permuting the \(Y_{i}\)’s). Moreover, it is in sparse or heavy Binomial form. This way, it is made sure that, for every \(\{X_{i}\}_{i}\), there exists some \(\{Y_{i}\}_{i}\) in the cover that is \(\epsilon \)-close and is in sparse or heavy Binomial form. We proceed to show that the cover thus defined satisfies the last part of Theorem 4.

For \(\{X_{i}\}_{i}\), \(\{Y_{i}\}_{i}\) and \(\{Z_{i}\}_{i}\) as above, let \((\mu , \sigma ^{2})\), \((\mu _Z, \sigma _Z^{2})\) and \((\mu _Y, \sigma _Y^{2})\) denote respectively the (mean, variance) pairs of the variables \(X=\sum _{i} X_{i}\), \(Z=\sum _{i} Z_{i}\) and \(Y=\sum _{i} Y_{i}\). We argue first that the pair \((\mu _Z, \sigma _Z^{2})\) satisfies \(|\mu - \mu _Z| = O(\epsilon )\) and \(|\sigma ^{2}-\sigma _Z^{2}| = O(\epsilon \cdot (1+\sigma ^{2}))\). Next we argue that, if the collection \(\{Y_{i}\}_{i}\) output by the Stage 2 filter is in heavy Binomial form, then \((\mu _Y, \sigma _Y^{2})\) satisfies \(|\mu - \mu _Y| = {O(1)}\) and \(|\sigma ^{2}-\sigma _Y^{2}| = O(1 + \epsilon \cdot (1+\sigma ^{2}))\), concluding the proof.

  • Proof for \((\mu _Z, \sigma _Z^{2})\): The Stage 1 filter only modifies the indicators \(X_{i}\) with \(p_{i} \in (0,1/k) \cup (1-1/k,1)\), for some well-chosen \(k = O(1/\epsilon )\). For convenience let us define \({\mathcal {L}}_{k}=\{i|p_{i}\in (0,1/k)\}\) and \({\mathcal {H}}_{k} =\{i|p_{i}\in (1-1/k,1)\}\) as in [27]. The filter of Stage 1 rounds the expectations of the indicators indexed by \({\mathcal {L}}_{k}\) to some value in \(\{0,1/k\}\) so that no single expectation is altered by more than an additive \(1/k\), and the sum of these expectations is not modified by more than an additive \(1/k\). Similarly, the expectations of the indicators indexed by \({\mathcal {H}}_{k}\) are rounded to some value in \(\{1-1/k,1\}\). See the details of how the rounding is performed in Section 4.1 of [27]. Let us then denote by \(\{p_{i}'\}_{i}\) the expectations of the indicators \(\{Z_{i}\}_{i}\) resulting from the rounding. We argue that the mean and variance of \(Z=\sum _{i} Z_{i}\) is close to the mean and variance of \(X\). Indeed,

    $$\begin{aligned} |\mu - \mu _Z|&= \left| \sum _{i} p_{i} - \sum _{i} p_{i}'\right| \nonumber \\&= \left| \sum _{i\in {\mathcal {L}}_{k} \cup {\mathcal {H}}_{k}} p_{i} - \sum _{i \in {\mathcal {L}}_{k} \cup {\mathcal {H}}_{k}} p_{i}'\right| \nonumber \\&\le O(1/k)=O(\epsilon ). \end{aligned}$$
    (11)

    Similarly,

    $$\begin{aligned} |\sigma ^{2} - \sigma _Z^{2}|&= \left| \sum _{i} p_{i}(1-p_{i}) - \sum _{i} p_{i}'(1-p_{i}')\right| \\&\le \left| \sum _{i\in {\mathcal {L}}_{k}} p_{i}(1-p_{i}) - \sum _{i \in {\mathcal {L}}_{k}} p_{i}' (1-p_{i}')\right| \\&+ \left| \sum _{i\in {\mathcal {H}}_{k}} p_{i}(1-p_{i}) - \sum _{i \in {\mathcal {H}}_{k}} p_{i}' (1-p_{i}')\right| . \end{aligned}$$

    We proceed to bound the two terms of the RHS separately. Since the argument is symmetric for \({\mathcal {L}}_{k}\) and \({\mathcal {H}}_{k}\) we only do \({\mathcal {L}}_{k}\). We have

    $$\begin{aligned} \left| \sum _{i\in {\mathcal {L}}_{k}} p_{i}(1-p_{i}) - \sum _{i \in {\mathcal {L}}_{k}} p_{i}' (1-p_{i}')\right|&= \left| \sum _{i\in {\mathcal {L}}_{k}} (p_{i}-p_{i}')(1-(p_{i}+p_{i}'))\right| \\&= \left| \sum _{i\in {\mathcal {L}}_{k}} (p_{i}-p_{i}')-\sum _{i\in {\mathcal {L}}_{k}}(p_{i}-p_{i}')(p_{i}+p_{i}')\right| \\&\le \left| \sum _{i\in {\mathcal {L}}_{k}}(p_{i}\!-\!p_{i}')\right| \!+\!\left| \sum _{i\in {\mathcal {L}}_{k}}(p_{i}\!-\!p_{i}')(p_{i}\!+\!p_{i}')\right| \\&\le {1\over k}+\sum _{i\in {\mathcal {L}}_{k}} |p_{i}-p_{i}'|(p_{i}+p_{i}')\\&\le {1\over k}+{1\over k} \sum _{i\in {\mathcal {L}}_{k}} (p_{i}+p_{i}')\\&\le {1\over k}+{1\over k} \left( 2 \sum _{i\in {\mathcal {L}}_{k}} p_{i}+1/k\right) \\&{=} {1\over k}+{1\over k} \left( {2 \over 1-1/k} \sum _{i\in {\mathcal {L}}_{k}} p_{i} (1-{1/k})+1/k\right) \\&\le {1\over k}+{1\over k} \left( {2 \over 1-1/k} \sum _{i\in {\mathcal {L}}_{k}} p_{i} (1-p_{i})+1/k\right) \\&\le {1\over k}+{1\over k^{2}}+{2\over k-1} \sum _{i\in {\mathcal {L}}_{k}} p_{i} (1-p_{i}). \end{aligned}$$

    Using the above (and a symmetric argument for index set \({\mathcal {H}}_{k}\)) we obtain:

    $$\begin{aligned} |\sigma ^{2} - \sigma _Z^{2}| \le {2\over k}+{2\over k^{2}}+{2\over k-1} \sigma ^{2} = O(\epsilon )(1+\sigma ^{2}). \end{aligned}$$
    (12)
  • Proof for \((\mu _Y, \sigma _Y^{2})\): After the Stage 1 filter is applied to the collection \(\{X_{i}\}_{i}\), the resulting collection of random variables \(\{Z_{i}\}_{i}\) has expectations \(p'_{i} \in \{0,1\} \cup [1/k,1-1/k]\), for all \(i\). The Stage 2 filter has different form depending on the cardinality of the set \({\mathcal {M}}=\{i~|~p_{i}' \in [1/k,1-1/k]\}\). In particular, if \(|{\mathcal {M}}| > k^{3}\) the output of the Stage 2 filter is in heavy Binomial form, while if \(|{\mathcal {M}}| \le k^{3}\) the output of the Stage 2 filter is in sparse form. As we are only looking to provide guarantee for the distributions in heavy Binomial form, it suffices to only consider the former case next.

    • \(|{\mathcal {M}}| > k^{3}\): Let \(\{Y_{i}\}_{i}\) be the collection produced by Stage 2 and let \(Y=\sum _{i} Y_{i}\). Then Lemma 4 of [27] implies that

      $$\begin{aligned} |\mu _Z- \mu _Y| = O(1)~~\text {and}~~|\sigma ^{2}_Z - \sigma ^{2}_Y| = O(1). \end{aligned}$$

      Combining this with (11) and (12) gives

      $$\begin{aligned} |\mu - \mu _Y| = O(1)~~\text {and}~~|\sigma ^{2} - \sigma ^{2}_Y| = O(1 + \epsilon \cdot (1+\sigma ^{2})). \end{aligned}$$

This concludes the proof of Theorem 4. \(\square \)

Appendix 2: Birgé’s Theorem: Learning Unimodal Distributions

Here we briefly explain how Theorem 5 follows from [6]. We assume that the reader is moderately familiar with the paper [6].

Birgé (see his Theorem 1 and Corollary 1) upper bounds the expected variation distance between the target distribution (which he denotes \(f\)) and the hypothesis distribution that is constructed by his algorithm (which he denotes \(\hat{f}_n\); it should be noted, though, that his “\(n\)” parameter denotes the number of samples used by the algorithm, while we will denote this by “\(m\)”, reserving “\(n\)” for the domain \(\{1,\dots ,n\}\) of the distribution). More precisely, [6] shows that this expected variation distance is at most that of the Grenander estimator (applied to learn a unimodal distribution when the mode is known) plus a lower-order term. For our Theorem 5 we take Birgé’s “\(\eta \)” parameter to be \(\epsilon \). With this choice of \(\eta ,\) by the results of [4, 5] bounding the expected error of the Grenander estimator, if \(m=O(\log (n)/\epsilon ^{3})\) samples are used in Birgé’s algorithm then the expected variation distance between the target distribution and his hypothesis distribution is at most \(O(\epsilon ).\) To go from expected error \({O(\epsilon )}\) to an \({O(\epsilon )}\)-accurate hypothesis with probability at least \(1-\delta \), we run the above-described algorithm \(O(\log (1/\delta ))\) times so that with probability at least \(1-\delta \) some hypothesis obtained is \({O(\epsilon )}\)-accurate. Then we use our hypothesis testing procedure of Lemma 8, or, more precisely, the extension provided in Lemma 10, to identify an \(O(\epsilon )\)-accurate hypothesis from within this pool of \(O(\log (1/\delta ))\) hypotheses. (The use of Lemma 10 is why the running time of Theorem 5 depends quadratically on \(\log (1/\delta )\) and why the sample complexity contains the second \({\frac{1}{\epsilon ^{2}}} \log {\frac{1}{\delta }} \log \log {\frac{1}{\delta }}\) term.)

It remains only to argue that a single run of Birgé’s algorithm on a sample of size \(m = O(\log (n)/\epsilon ^{3})\) can be carried out in \(\tilde{O}(\log ^{2}(n)/\epsilon ^{3})\) bit operations (recall that each sample is a \(\log (n)\)-bit string). His algorithm begins by locating an \(r \in [n]\) that approximately minimizes the value of his function \(d(r)\) (see Section 3 of [6]) to within an additive \(\eta = \epsilon \) (see Definition 3 of his paper); intuitively this \(r\) represents his algorithm’s “guess” at the true mode of the distribution. To locate such an \(r\), following Birgé’s suggestion in Sect. 3 of his paper, we begin by identifying two consecutive points in the sample such that \(r\) lies between those two sample points. This can be done using \(\log m\) stages of binary search over the (sorted) points in the sample, where at each stage of the binary search we compute the two functions \(d^-\) and \(d^+\) and proceed in the appropriate direction. To compute the function \(d^-(j)\) at a given point \(j\) (the computation of \(d^+\) is analogous), we recall that \(d^-(j)\) is defined as the maximum difference over \([1,j]\) between the empirical cdf and its convex minorant over \([1,j]\). The convex minorant of the empirical cdf (over \(m\) points) can be computed in \(\tilde{O}((\log n)m)\) bit-operations (where the \(\log n\) comes from the fact that each sample point is an element of \([n]\)), and then by enumerating over all points in the sample that lie in \([1,j]\) (in time \(O((\log n)m)\)) we can compute \(d^-(j).\) Thus it is possible to identify two adjacent points in the sample such that \(r\) lies between them in time \(\tilde{O}((\log n)m).\) Finally, as Birgé explains in the last paragraph of Section 3 of his paper, once two such points have been identified it is possible to again use binary search to find a point \(r\) in that interval where \(d(r)\) is minimized to within an additive \(\eta .\) Since the maximum difference between \(d^-\) and \(d_+\) can never exceed 1, at most \(\log (1/\eta )=\log (1/\epsilon )\) stages of binary search are required here to find the desired \(r\).

Finally, once the desired \(r\) has been obtained, it is straightforward to output the final hypothesis (which Birgé denotes \(\hat{f}_n\)). As explained in Definition 3, this hypothesis is the derivative of \(\tilde{F}^r_n\), which is essentially the convex minorant of the empirical cdf to the left of \(r\) and the convex majorant of the empirical cdf to the right of \(r\). As described above, given a value of \(r\) these convex majorants and minorants can be computed in \(\tilde{O}((\log n)m)\) time, and the derivative is simply a collection of uniform distributions as claimed. This concludes our sketch of how Theorem 5 follows from [6].

Appendix 3: Efficient Evaluation of the Poisson Distribution

In this section we provide an efficient algorithm to compute an additive approximation to the Poisson probability mass function. It seems that this should be a basic operation in numerical analysis, but we were not able to find it explicitly in the literature. Our main result for this section is the following.

Theorem 6

There is an algorithm that, on input a rational number \(\lambda >0\), and integers \(k \ge 0\) and \(t>0\), produces an estimate \(\widehat{p_{k}}\) such that

$$\begin{aligned} \left| \widehat{p_{k}} - p_{k}\right| \le {1 \over t}, \end{aligned}$$

where \(p_{k}={\lambda ^k e^{-\lambda } \over k!}\) is the probability that the Poisson distribution of parameter \(\lambda \) assigns to integer \(k\). The running time of the algorithm is \(\tilde{O}({\langle {t}\rangle }^{3} + {\langle {k}\rangle }\cdot {\langle {t}\rangle } +{\langle {\lambda }\rangle } \cdot {\langle {t}\rangle })\).

Proof

Clearly we cannot just compute \(e^{-\lambda }\), \(\lambda ^k\) and \(k!\) separately, as this will take time exponential in the description complexity of \(k\) and \(\lambda \). We follow instead an indirect approach. We start by rewriting the target probability as follows

$$\begin{aligned} p_{k} = e^{-\lambda + k \ln (\lambda )-\ln (k!)}. \end{aligned}$$

Motivated by this formula, let

$$\begin{aligned} E_{k}:=-\lambda + k \ln (\lambda )-\ln (k!). \end{aligned}$$

Note that \(E_{k} \le 0\). Our goal is to approximate \(E_{k}\) to within high enough accuracy and then use this approximation to approximate \(p_{k}\).

In particular, the main part of the argument involves an efficient algorithm to compute an approximation \(\widehat{\widehat{E_{k}}}\) to \(E_{k}\) satisfying

$$\begin{aligned} \Big |\widehat{\widehat{{E}_{k}}}-E_{k} \Big | \le {1 \over 4t} \le {1 \over 2t} - {1\over 8 t^{2} }. \end{aligned}$$
(13)

This approximation will have bit complexity \(\tilde{O}({\langle {k}\rangle } +{\langle {\lambda }\rangle }+{\langle {t}\rangle })\) and be computable in time \(\tilde{O}({\langle {k}\rangle } \cdot {\langle {t}\rangle }+{\langle {\lambda }\rangle }+{\langle {t}\rangle }^{3})\).

We show that if we had such an approximation, then we would be able to complete the proof. For this, we claim that it suffices to approximate \(e^{\widehat{\widehat{{E_{k}}}}}\) to within an additive error \({1\over 2t}\). Indeed, if \(\widehat{p_{k}}\) were the result of this approximation, then we would have:

$$\begin{aligned} \widehat{p}_{k}&\le e^{\widehat{\widehat{E_{k}}}}+{1 \over 2t} \\&\le e^{{E}_{k} + {1 \over 2t} - {1\over 8 t^{2} }} + {1 \over 2t}\\&\le e^{{E}_{k} + \ln (1+ {1 \over 2t})} + {1 \over 2t}\\&\le e^{E_{k}}\left( 1+{1 \over 2t}\right) + {1 \over 2t} \le p_{k} +{1\over t}; \end{aligned}$$

and similarly

$$\begin{aligned} \widehat{p}_{k}&\ge e^{\widehat{\widehat{E_{k}}}}-{1 \over 2t} \\&\ge e^{{E}_{k} - ( {1 \over 2t} - {1\over 8 t^{2} })} - {1 \over 2t}\\&\ge e^{{E}_{k} - \ln (1+ {1 \over 2t})} - {1 \over 2t} \\&\ge e^{E_{k}}\Big /\left( 1+{1 \over 2t}\right) - {1 \over 2t}\\&\ge e^{E_{k}}\left( 1-{1 \over 2t}\right) - {1 \over 2t} \ge p_{k} -{1\over t}. \end{aligned}$$

To approximate \(e^{\widehat{\widehat{{E_{k}}}}}\) given \({\widehat{\widehat{{E_{k}}}}}\), we need the following lemma:

Lemma 17

Let \(\alpha \le 0\) be a rational number. There is an algorithm that computes an estimate \(\widehat{e^{\alpha }}\) such that

$$\begin{aligned} \left| \widehat{e^{\alpha }} - e^{\alpha } \right| \le {1 \over 2t} \end{aligned}$$

and has running time \(\tilde{O}({\langle {\alpha }\rangle }\cdot {\langle {t}\rangle }+{\langle {t}\rangle }^{2}).\)

Proof

Since \(e^{\alpha } \in [0,1]\), the point of the additive grid \(\{ {i \over 4t} \}_{i=1}^{4t}\) closest to \(e^{\alpha }\) achieves error at most \(1/(4t)\). Equivalently, in a logarithmic scale, consider the grid \(\{ \ln {i \over 4t} \}_{i=1}^{4t}\) and let \(j^{*}:=\arg \min _{j} \left\{ \Big |\alpha - {\ln ({j \over 4t})}\Big | \right\} \). Then, we have that

$$\begin{aligned} \left| {j^{*} \over (4t)} - e^{\alpha }\right| \le {1 \over 4t}. \end{aligned}$$

The idea of the algorithm is to approximately identify the point \(j^{*}\), by computing approximations to the points of the logarithmic grid combined with a binary search procedure. Indeed, consider the “rounded” grid \(\{ \widehat{\ln {i \over 4t}} \}_{i=1}^{4t}\) where each \(\widehat{\ln ({i \over 4t})}\) is an approximation to \(\ln ({i \over 4t})\) that is accurate to within an additive \({1 \over 16t}\). Notice that, for \(i=1,\ldots ,4t\):

$$\begin{aligned} \ln \left( {i +1 \over 4t}\right) -\ln \left( {i \over 4t}\right) = \ln \left( 1+{1 \over i}\right) \ge \ln \left( 1+{1 \over 4t}\right) > 1/8t. \end{aligned}$$

Given that our approximations are accurate to within an additive \(1/16t\), it follows that the rounded grid \(\{ \widehat{\ln {i \over 4t}} \}_{i=1}^{4t}\) is monotonic in \(i\).

The algorithm does not construct the points of this grid explicitly, but adaptively as it needs them. In particular, it performs a binary search in the set \(\{1,\ldots , 4t\}\) to find the point \(i^{*} := \arg \min _{i} \left\{ \Big |\alpha - \widehat{\ln ({i \over 4t})}\Big | \right\} \). In every iteration of the search, when the algorithm examines the point \(j\), it needs to compute the approximation \(g_{j} = \widehat{\ln ({j \over 4t})}\) and evaluate the distance \(|\alpha - g_{j} |\). It is known that the logarithm of a number \(x\) with a binary fraction of \(L\) bits and an exponent of \(o(L)\) bits can be computed to within a relative error \(O(2^{-L})\) in time \(\tilde{O}(L)\) [8]. It follows from this that \(g_{j}\) has \(O({\langle {t}\rangle })\) bits and can be computed in time \(\tilde{O}({\langle {t}\rangle })\). The subtraction takes linear time, i.e., it uses \(O({\langle {\alpha }\rangle }+{\langle {t}\rangle })\) bit operations. Therefore, each step of the binary search can be done in time \(O({\langle {\alpha }\rangle })+ \tilde{O}({\langle {t}\rangle })\) and thus the overall algorithm has \(O({\langle {\alpha }\rangle } \cdot {\langle {t}\rangle })+ \tilde{O}({\langle {t}\rangle }^{2})\) running time.

The algorithm outputs \(i^{*} \over 4t\) as its final approximation to \(e^{\alpha }\). We argue next that the achieved error is at most an additive \(1 \over 2t\). Since the distance between two consecutive points of the grid \(\{ \ln {i \over 4t} \}_{i=1}^{4t}\) is more than \(1/(8t)\) and our approximations are accurate to within an additive \(1/16t\), a little thought reveals that \(i^{*} \in \{j^{*}-1,j^{*},j^{*}+1\}\). This implies that \(i^{*} \over 4t\) is within an additive \(1/2t\) of \(e^{\alpha }\) as desired, and the proof of the lemma is complete. \(\square \)

Given Lemma 17, we describe how we could approximate \(e^{\widehat{\widehat{{E_{k}}}}}\) given \({\widehat{\widehat{{E_{k}}}}}\). Recall that we want to output an estimate \(\widehat{p_{k}}\) such that \(|\widehat{p_{k}} -e^{\widehat{\widehat{{E_{k}}}}}| \le 1/(2t)\). We distinguish the following cases:

  • If \(\widehat{\widehat{{E_{k}}}} \ge 0\), we output \(\widehat{p_{k}}:=1\). Indeed, given that \(\Big |\widehat{\widehat{{E}_{k}}}-E_{k} \Big | \le {1 \over 4t}\) and \(E_{k} \le 0\), if \(\widehat{\widehat{{E_{k}}}} \ge 0\) then \(\widehat{\widehat{{E_{k}}}} \in [0,{1 \over 4t}]\). Hence, because \(t\ge 1\), \(e^{\widehat{\widehat{{E_{k}}}}} \in [1,1+1/2t]\), so \(1\) is within an additive \(1/2t\) of the right answer.

  • Otherwise, \(\widehat{p_{k}}\) is defined to be the estimate obtained by applying Lemma 17 for \(\alpha := \widehat{\widehat{E_{k}}}\). Given the bit complexity of \(\widehat{\widehat{E_{k}}}\), the running time of this procedure will be \(\tilde{O}({\langle {k}\rangle } \cdot {\langle {t}\rangle } +{\langle {\lambda }\rangle }\cdot {\langle {t}\rangle } + {\langle {t}\rangle }^{2})\).

Hence, the overall running time is \(\tilde{O}({\langle {k}\rangle } \cdot {\langle {t}\rangle } + {\langle {\lambda }\rangle }\cdot {\langle {t}\rangle }+{\langle {t}\rangle }^{3})\).

In view of the above, we only need to show how to compute \(\widehat{\widehat{E_{k}}}\). There are several steps to our approximation:

  1. 1.

    (Stirling’s Asymptotic Approximation): Recall Stirling’s asymptotic approximation (see e.g., [54] p.193), which says that \(\ln k!\) equals

    $$\begin{aligned} k \ln (k) - k + (1/2)\cdot \ln (2\pi ) +\sum _{j=2}^{m} { \frac{B_{j} \cdot (-1)^j}{j(j-1) \cdot k^{j-1}} }+O(1/k^m) \end{aligned}$$

    where \(B_{k}\) are the Bernoulli numbers. We define an approximation of \(\ln {k!}\) as follows:

    $$\begin{aligned} \widehat{\ln k!}: =k \ln (k) - k + (1/2)\cdot \ln (2\pi ) + \sum _{j=2}^{m_0} { \frac{B_{j} \cdot (-1)^j}{j(j-1) \cdot k^{j-1}}} \end{aligned}$$

    for \(m_0:= O\left( \left\lceil {{\langle {t}\rangle } \over {\langle {k}\rangle }} \right\rceil +1\right) .\)

  2. 2.

    (Definition of an approximate exponent \(\widehat{E_{k}}\)): Define \(\widehat{E_{k}}:=-\lambda + k \ln (\lambda )-\widehat{\ln (k!)}\). Given the above discussion, we can calculate the distance of \(\widehat{E_{k}}\) to the true exponent \(E_{k}\) as follows:

    $$\begin{aligned} |E_{k} - \widehat{E_{k}}| \le |\ln (k!)-\widehat{\ln (k!)}|&\le O(1/k^{m_0})\end{aligned}$$
    (14)
    $$\begin{aligned}&\le {1 \over 10t}. \end{aligned}$$
    (15)

    So we can focus our attention to approximating \(\widehat{E_{k}}\). Note that \(\widehat{E_{k}}\) is the sum of \(m_0+2 = O({\log t \over \log k})\) terms. To approximate it within error \(1/(10t)\), it suffices to approximate each summand within an additive error of \(O(1/(t \cdot \log t))\). Indeed, we so approximate each summand and our final approximation \(\widehat{\widehat{E_{k}}}\) will be the sum of these approximations. We proceed with the analysis:

  3. 3.

    (Estimating \(2\pi \)): Since \(2\pi \) shows up in the above expression, we should try to approximate it. It is known that the first \(\ell \) digits of \(\pi \) can be computed exactly in time \(O(\log \ell \cdot M(\ell ))\), where \(M(\ell )\) is the time to multiply two \(\ell \)-bit integers [9, 45]. For example, if we use the Schönhage-Strassen algorithm for multiplication [49], we get \(M(\ell )=O(\ell \cdot \log \ell \cdot \log \log \ell )\). Hence, choosing \(\ell :=\lceil \log _2(12t \cdot \log t)\rceil \), we can obtain in time \(\tilde{O}({\langle {t}\rangle })\) an approximation \(\widehat{2\pi }\) of \(2\pi \) that has a binary fraction of \(\ell \) bits and satisfies:

    $$\begin{aligned} |\widehat{2\pi }-2\pi | \le 2^{-\ell }~~\Rightarrow ~~ (1-2^{-\ell }) 2 \pi \le \widehat{2\pi } \le (1+2^{-\ell }) 2\pi . \end{aligned}$$

    Note that, with this approximation, we have

    $$\begin{aligned} \left| \ln (2\pi ) - \ln (\widehat{2\pi }) \right| \le \ln (1-2^{-\ell })\le 2^{-\ell } \le 1/(12t \cdot \log t). \end{aligned}$$
  4. 4.

    (Floating-Point Representation): We will also need accurate approximations to \(\ln {\widehat{2\pi }}\), \(\ln k\) and \(\ln \lambda \). We think of \(\widehat{2\pi }\) and \(k\) as multiple-precision floating point numbers base \(2\). In particular,

    • \(\widehat{2\pi }\) can be described with a binary fraction of \(\ell +3\) bits and a constant size exponent; and

    • \(k \equiv 2^{\lceil \log k\rceil }\cdot {k \over 2^{\lceil \log k\rceil }}\) can be described with a binary fraction of \(\lceil \log k \rceil \), i.e., \({\langle {k}\rangle }\), bits and an exponent of length \(O( \log \log k)\), i.e., \(O(\log {\langle {k}\rangle })\).

    Also, since \(\lambda \) is a positive rational number, \(\lambda ={\lambda _1 \over \lambda _2}\), where \(\lambda _1\) and \(\lambda _2\) are positive integers of at most \({\langle {\lambda }\rangle }\) bits. Hence, for \(i=1,2\), we can think of \(\lambda _{i}\) as a multiple-precision floating point number base \(2\) with a binary fraction of \({\langle {\lambda }\rangle }\) bits and an exponent of length \(O(\log {\langle {\lambda }\rangle })\). Hence, if we choose \(L = \lceil \log _2(12(3k+1)t^{2} \cdot k \cdot \lambda _1 \cdot \lambda _2) \rceil = O({\langle {k}\rangle }+{\langle {\lambda }\rangle }+{\langle {t}\rangle })\), we can represent all numbers \(\widehat{2\pi }, \lambda _1,\lambda _2, k\) as multiple precision floating point numbers with a binary fraction of \(L\) bits and an exponent of \(O(\log L)\) bits.

  5. 5.

    (Estimating the logs): It is known that the logarithm of a number \(x\) with a binary fraction of \(L\) bits and an exponent of \(o(L)\) bits can be computed to within a relative error \(O(2^{-L})\) in time \(\tilde{O}(L)\) [8]. Hence, in time \(\tilde{O}(L)\) we can obtain approximations \(\widehat{\ln \widehat{2\pi }}, \widehat{\ln k}, \widehat{\ln {\lambda _1}}, \widehat{\ln {\lambda _2}}\) such that:

    • \(|\widehat{\ln k} - {\ln k}| \le 2^{-L} {\ln k} \le {1 \over 12(3k+1)t^{2}}\); and similarly

    • \(|\widehat{\ln \lambda _{i}} - {\ln \lambda _{i}}| \le {1 \over 12(3k+1)t^{2}}\), for \(i=1,2\);

    • \(|\widehat{\ln \widehat{2\pi }} - {\ln \widehat{2\pi }}| \le {1 \over 12(3k+1)t^{2}}.\)

  6. 6.

    (Estimating the terms of the series): To complete the analysis, we also need to approximate each term of the form \(c_{j} = \frac{B_{j}}{j(j-1) \cdot k^{j-1}}\) up to an additive error of \(O(1/(t \cdot \log t))\). We do this as follows: We compute the numbers \(B_{j}\) and \(k^{j-1}\) exactly, and we perform the division approximately. Clearly, the positive integer \(k^{j-1}\) has description complexity \(j \cdot {\langle {k}\rangle } = O(m_0 \cdot {\langle {k}\rangle }) = O({\langle {t}\rangle }+{\langle {k}\rangle })\), since \(j = O(m_0)\). We compute \(k^{j-1}\) exactly using repeated squaring in time \(\tilde{O}(j \cdot {\langle {k}\rangle }) = \tilde{O}({\langle {t}\rangle }+{\langle {k}\rangle })\). It is known [30] that the rational number \(B_{j}\) has \(\tilde{O}(j)\) bits and can be computed in \(\tilde{O}(j^{2}) = \tilde{O}({\langle {t}\rangle }^{2})\) time. Hence, the approximate evaluation of the term \(c_{j}\) (up to the desired additive error of \(1/(t \log t)\)) can be done in \(\tilde{O}({\langle {t}\rangle }^{2}+{\langle {k}\rangle })\), by a rational division operation (see e.g., [37]). The sum of all the approximate terms takes linear time, hence the approximate evaluation of the entire truncated series (comprising at most \(m_0 \le {\langle {t}\rangle }\) terms) can be done in \(\tilde{O}({\langle {t}\rangle }^{3}+{\langle {k}\rangle } \cdot {\langle {t}\rangle })\) time overall. Let \(\widehat{\widehat{E_{k}}}\) be the approximation arising if we use all the aforementioned approximations. It follows from the above computations that

    $$\begin{aligned} \Big |\widehat{\widehat{E_{k}}} - \widehat{E_{k}} \Big | \le {1 \over 10t}. \end{aligned}$$
  7. 7.

    (Overall Error): Combining the above computations we get:

    $$\begin{aligned} \Big |\widehat{\widehat{E_{k}}} - {E_{k}} \Big | \le {1 \over 4t}. \end{aligned}$$

    The overall time needed to obtain \(\widehat{\widehat{E_{k}}}\) was \(\tilde{O}({\langle {k}\rangle } \cdot {\langle {t}\rangle }+{\langle {\lambda }\rangle }+{\langle {t}\rangle }^{3})\) and the proof of Theorem 6 is complete. \(\square \)

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Daskalakis, C., Diakonikolas, I. & Servedio, R.A. Learning Poisson Binomial Distributions. Algorithmica 72, 316–357 (2015). https://doi.org/10.1007/s00453-015-9971-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-015-9971-3

Keywords

Navigation