Estimating Sum by Weighted Sampling

  • Rajeev Motwani
  • Rina Panigrahy
  • Ying Xu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4596)

Abstract

We study the classic problem of estimating the sum of n variables. The traditional uniform sampling approach requires a linear number of samples to provide any non-trivial guarantees on the estimated sum. In this paper we consider various sampling methods besides uniform sampling, in particular sampling a variable with probability proportional to its value, referred to as linear weighted sampling. If only linear weighted sampling is allowed, we show an algorithm for estimating sum with \(\tilde{O}(\sqrt n)\) samples, and it is almost optimal in the sense that \(\Omega(\sqrt n)\) samples are necessary for any reasonable sum estimator. If both uniform sampling and linear weighted sampling are allowed, we show a sum estimator with \(\tilde{O}(\sqrt[3]n)\) samples. More generally, we may allow general weighted sampling where the probability of sampling a variable is proportional to any function of its value. We prove a lower bound of \(\Omega(\sqrt[3]n)\) samples for any reasonable sum estimator using general weighted sampling, which implies that our algorithm combining uniform and linear weighted sampling is an almost optimal sum estimator.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alon, N., Duffield, N.G., Lund, C., Thorup, M.: Estimating arbitrary subset sums with few probes. In: PODS 2005Google Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. JCSC 58, 137–147 (1999)MATHMathSciNetGoogle Scholar
  3. 3.
    Bar-Yossef, Z., Gurevich, M. (eds.): Random sampling from a search engine’s index. In: WWW 2006Google Scholar
  4. 4.
    Bar-Yossef, Z., Gurevich, M.: Efficient search engine measurements. In: WWW 2007Google Scholar
  5. 5.
    Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: lower bounds and applications. In: STOC 2001Google Scholar
  6. 6.
    Broder, A., Fontura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., Xu, Y.: Estimating corpus size via queries. In: CIKM 2006Google Scholar
  7. 7.
    Canetti, R., Even, G., Goldreich, O.: Lower Bounds for Sampling Algorithms for Estimating the Average. Information Processing Letters 53, 17–25 (1995)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation error guarantees for distinct values. In: PODS 2000Google Scholar
  9. 9.
    Duffield, N.G., Lund, C., Thorup, M.: Learn more, sample less: control of volume and variance in network measurements. IEEE Trans. on Information Theory 51, 1756–1775 (2005)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Gulli, A., Signorini, A.: The indexable Web is more than 11.5 billion pages. In: WWW 2005Google Scholar
  11. 11.
    Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: On near-uniform URL sampling. In: WWW 2000Google Scholar
  12. 12.
    Lawrence, S., Giles, C.: Searching the World Wide Web. Science 280, 98–100 (1998)CrossRefGoogle Scholar
  13. 13.
    Lawrence, S., Giles, C.: Accessibility of information on the web. Nature 400, 107–109 (1999)CrossRefGoogle Scholar
  14. 14.
    Liu, J.: Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statist. Comput. 6, 113–119 (1996)CrossRefGoogle Scholar
  15. 15.
    Motwani, R., Raghavan, P.: Randomized Algorithm (1995)Google Scholar
  16. 16.
    Motwani, R., Raghavan, P., Xu, Y.: Estimating Sum by Weighted Sampling. Technical Report (2007)Google Scholar
  17. 17.
    Szegedy, M.: The DLT priority sampling is essentially optimal. In: STOC 2006Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Rajeev Motwani
    • 1
  • Rina Panigrahy
    • 2
  • Ying Xu
    • 1
  1. 1.Dept of Computer Science, Stanford UniversityUSA
  2. 2.Microsoft Research, Mountain View, CAUSA

Personalised recommendations