Advertisement

Optimal Bounds for Estimating Entropy with PMF Queries

  • Cafer Caferov
  • Barış Kaya
  • Ryan O’Donnell
  • A. C. Cem Say
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9235)

Abstract

Let p be an unknown probability distribution on \([n] := \{1, 2, \dots n\}\) that we can access via two kinds of queries: A SAMP query takes no input and returns \(x \in [n]\) with probability p[x]; a PMF query takes as input \(x \in [n]\) and returns the value p[x]. We consider the task of estimating the entropy of p to within \(\pm \varDelta \) (with high probability). For the usual Shannon entropy H(p), we show that \(\varOmega (\log ^2 n/\varDelta ^2)\) queries are necessary, matching a recent upper bound of Canonne and Rubinfeld. For the Rényi entropy \(H_\alpha \)(p), where \(\alpha >1\), we show that \(\varTheta (n^{1-1/\alpha })\) queries are necessary and sufficient. This complements recent work of Acharya et al. in the \(\mathsf SAMP \)-only model that showed \(O(n^{1-1/\alpha })\) queries suffice when \(\alpha \) is an integer, but \(\widetilde{\varOmega }(n)\) queries are necessary when \(\alpha \) is a noninteger. All of our lower bounds also easily extend to the model where CDF queries (given x, return \(\sum _{y \le x}\) p[y]) are allowed.

Keywords

Mutual Information Shannon Entropy Collision Probability Unknown Distribution Support Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

We thank Clément Canonne for his assistance with our questions about the literature, and an anonymous reviewer for helpful remarks on a previous version of this manuscript.

References

  1. 1.
    Acharya, J., Orlitsky, A., Suresh, A.T., Tyagi, H.: The complexity of estimating Rényi entropy. In: Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (2015)Google Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bhuvanagiri, L., Ganguly, S.: Estimating entropy over data streams. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 148–159. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  5. 5.
    Canonne, C.: A survey on distribution testing: Your data is big. But is it blue? Technical Report TR15-063, ECCC (2015)Google Scholar
  6. 6.
    Canonne, C., Rubinfeld, R.: Testing probability distributions underlying aggregated data. Technical Report 1402.3835, arXiv (2014)Google Scholar
  7. 7.
    Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream, pp. 328–335 (2007)Google Scholar
  8. 8.
    Chakrabarti, A., Ba, K.D., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. Internet Math. 3(1), 63–78 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Guha, S., McGregor, A., Venkatasubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 733–742. ACM (2006)Google Scholar
  10. 10.
    Harvey, N., Nelson, J., Onak, K.: Sketching and streaming entropy via approximation theory. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 489–498 (2008)Google Scholar
  11. 11.
    Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R., Sellie, L.: On the learnability of discrete distributions. In: Proceedings of the 26th Annual ACM Symposium on Theory of Computing, pp. 273–282 (1994)Google Scholar
  12. 12.
    Lall, A., Sekar, V., Ogihara, M., Xu, J., Zhang, H.: Data streaming algorithms for estimating entropy of network traffic. In: Proceedings of ACM SIGMETRICS, pp. 145–156 (2006)Google Scholar
  13. 13.
    Paninski, L.: Estimation of entropy and mutual information. Neural Comput. 15(6), 1191–1253 (2003)CrossRefzbMATHGoogle Scholar
  14. 14.
    Paninski, L.: Estimating entropy on \(m\) bins given fewer than \(m\) samples. IEEE Trans. Inf. Theory 50(9), 2200–2203 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Rubinfeld, R.: Taming big probability distributions. XRDS: Crossroads ACM Mag. Stud. 19(1), 24–28 (2012)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. Technical Report CBPF-NF-062/87, CBPF (1987)Google Scholar
  17. 17.
    Valiant, G., Valiant, P.: A CLT and tight lower bounds for estimating entropy. Technical Report TR10-179, Electronic Colloquium on Computational Complexity (2011)Google Scholar
  18. 18.
    Valiant, G., Valiant, P.: Estimating the unseen: an \(n/\log (n)\)-sample estimator for entropy and support size, shown optimal via new CLTsse. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, pp. 685–694 (2011)Google Scholar
  19. 19.
    Valiant, P.: Testing symmetric properties of distributions. SIAM J. Comput. 40(6), 1927–1968 (2011)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Cafer Caferov
    • 1
  • Barış Kaya
    • 1
  • Ryan O’Donnell
    • 2
  • A. C. Cem Say
    • 1
  1. 1.Computer Engineering DepartmentBoğaziçi UniversityIstanbulTurkey
  2. 2.Department of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations