Skip to main content
Log in

Sampling in Space Restricted Settings

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Space efficient algorithms play an important role in dealing with large amount of data. In such settings, one would like to analyze the large data using small amount of “working space”. One of the key steps in many algorithms for analyzing large data is to maintain a (or a small number) random sample from the data points. In this paper, we consider two space restricted settings—(i) the streaming model, where data arrives over time and one can use only a small amount of storage, and (ii) the query model, where we can structure the data in low space and answer sampling queries. In this paper, we prove the following results in the above two settings:

  • In the streaming setting, we would like to maintain a random sample from the elements seen so far. We prove that one can maintain a random sample using \(O(\log n)\) random bits and \(O(\log n)\) bits of space, where n is the number of elements seen so far. We can extend this to the case when elements have weights as well.

  • In the query model, there are n elements with weights \(w_1, \ldots , w_n\) (which are w-bit integers) and one would like to sample a random element with probability proportional to its weight. Bringmann and Larsen (STOC 2013) showed how to sample such an element using \(nw +1 \) bits of space (whereas, the information theoretic lower bound is nw). We consider the approximate sampling problem, where we are given an error parameter \(\varepsilon \), and the sampling probability of an element can be off by an \(\varepsilon \) factor. We give matching upper and lower bounds for this problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. It may require more number of bits when t is not a power of 2. But we can say that in expectation we need \(O(\log t)\) number of random bits.

  2. We will later discuss a sampling algorithm by Vitter which can be adjusted to work with \(O(\log ^2 n)\) random bits.

  3. Note that this is not a trivial observation since \(x_1,x_2,\dots ,x_n\) and \(\frac{x_1}{2},\frac{x_2}{2},\dots ,\frac{x_n}{2}\) both represent the same probability distribution. See lemma 5.1 in [2].

  4. A python implementation of this pseudocode can be found at http://www.cse.iitd.ac.in/~rjaiswal/Research/Sampling/sampling.py.

  5. Note that this is under the common word RAM assumption that arithmetic operations on \(O(\log {n})\) bit words can be done in O(1) time.

References

  1. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’02, pp. 633–634, Philadelphia, PA, USA, 2002. Society for Industrial and Applied Mathematics (2002)

  2. Bringmann, K., Larsen, K.G.: Succinct sampling from discrete distributions. In: Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing. STOC ’13, pp. 775–782. NY, USA, ACM, New York (2013)

  3. Bringmann, K., Panagiotou, K.: Efficient sampling methods for discrete distributions. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) Automata. Languages, and Programming, volume 7391 of Lecture Notes in Computer Science, pp. 133–144. Springer, Berlin Heidelberg (2012)

  4. Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  5. Jaiswal, R., Kumar, A., Sen, S.: A simple \({D}^2\)-sampling based PTAS for \(k\)-means and other clustering problems. Algorithmica 70(1), 22–46 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Knuth, D.E.: The Art of Computer Programming, vol. 2. Addison-Wesley, Boston (1981)

    MATH  Google Scholar 

  7. Kronmal, R.A., Peterson Jr., A.V.: On the alias method for generating random variables from a discrete distribution. Am. Stat. 33(4), 214–218 (1979)

    MathSciNet  MATH  Google Scholar 

  8. Li, K.-H.: Reservoir-sampling algorithms of time complexity \(o( n (1 + \log {N / n}))\). ACM Trans. Math. Softw. 20(4), 481–493 (1994)

    Article  MATH  Google Scholar 

  9. Park, B.-H., Ostrouchov, G., Samatova, N.F.: Sampling streaming data with replacement. Comput. Stat. Data Anal. 52(2), 750–762 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  10. Vitter, J.S.: Faster methods for random sampling. Commun. ACM 27(7), 703–718 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  11. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  12. Walker, A.J.: New fast method for generating discrete random numbers with arbitrary frequency distributions. Electron. Lett. 10(8), 127–128 (1974)

    Article  Google Scholar 

Download references

Acknowledgements

RJ and AK would like to thank Karl Bringmann for discussion on Succinct Sampling. They would also like to thank the Indo-German IMPECS program for making the interaction possible.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ragesh Jaiswal.

Additional information

Davis Issac: Major part of this work was done when the author was at IIT Delhi. Ragesh Jaiswal acknowledges the support of ISF-UGC India–Israel Joint Research Grant 2014.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhattacharya, A., Issac, D., Jaiswal, R. et al. Sampling in Space Restricted Settings. Algorithmica 80, 1439–1458 (2018). https://doi.org/10.1007/s00453-017-0335-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-017-0335-z

Keywords

Navigation