Skip to main content
Log in

Stratified random sampling from streaming and stored data

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is \(\varOmega (r)\) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is locally variance-optimal. We prove that any sliding window-based streaming SRS needs a workspace of \(\varOmega (rM\log W)\) in the worst case, to maintain a variance-optimal SRS of size M, where W is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only O(M) workspace but can maintain an SRS of size close to M in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

Notes

  1. Note that a query for the variance or standard deviation of data is distinct from the variance or standard deviation of an estimate.

References

  1. Nguyen, T.D., Shih, M., Srivastava, D., Tirthapura, S., Xu, B.: Stratified random sampling over streaming and stored data. In: EDBT, pp. 25–36 (2019)

  2. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: Proceedings in SIGMOD, pp. 574–576 (1999)

  3. Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: Queries with bounded errors and bounded response times on very large data. In: Proceedings in EuroSys, pp. 29–42 (2013)

  4. Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: SIGMOD, pp. 631–646 (2016)

  5. Chaudhuri, S., Das, G., Narasayya, V.: Optimized stratified sampling for approximate query processing. ACM TODS (2007). https://doi.org/10.1145/1242524.1242526

    Article  Google Scholar 

  6. Johnson, T., Shkapenyuk, V.: Data stream warehousing in tidalrace. In: Proceeding in CIDR (2015)

  7. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP, pp. 423–438 (2013)

  8. Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. J. R. Stat. Soc. 97(4), 558–625 (1934)

    Article  Google Scholar 

  9. Al-Kateb, M., Lee, B.S.: Adaptive stratified reservoir sampling over heterogeneous data streams. Inf. Syst. 39, 199–216 (2014)

    Article  Google Scholar 

  10. Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)

    Article  MathSciNet  Google Scholar 

  11. Meng, X.: Scalable simple random sampling and stratified sampling. In: Proceedings in ICML, pp. 531–539 (2013)

  12. Al-Kateb, M., Lee, B.S.: Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of SSDBM, pp. 621–639 (2010)

  13. Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: Proceedings in SSDBM, p. 22 (2007)

  14. Bankier, M.D.: Power allocations: determining sample sizes for subnational areas. Am. Stat. 42(3), 174–177 (1988)

    Google Scholar 

  15. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  16. Lang, K., Liberty, E., Shmakov, K.: Stratified sampling meets machine learning. In: Proceedings in ICML, pp. 2320–2329 (2016)

  17. Acharya, S., Gibbons, P., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: Proceedings in SIGMOD, pp. 487–498 (2000)

  18. Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings in SIGMOD, pp. 539–550 (2003)

  19. Joshi, S., Jermaine, C.: Robust stratified sampling plans for low selectivity queries. In: Proceedings in ICDE, pp. 199–208 (2008)

  20. Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample + seek: approximating aggregates with distribution precision guarantee. In: SIGMOD, pp. 679–694 (2016)

  21. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceeding in PODS, pp. 1–16 (2002)

  22. Cochran, W.G.: Sampling Techniques, 3rd edn. Wiley, New York (1977)

    MATH  Google Scholar 

  23. Haas, P.J.: Data-stream sampling: basic techniques and results. Data Stream Management, pp. 13–44. Springer, Berlin (2016)

    Chapter  Google Scholar 

  24. Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Duxbury Press, London (2009)

    MATH  Google Scholar 

  25. Thompson, S.K.: Sampling, 3rd edn. Wiley, New York (2012)

    Book  Google Scholar 

  26. Tillé, Y.: Sampling Algorithms, 1st edn. Springer, Berlin (2006)

    MATH  Google Scholar 

  27. Mcleod, I., Bellhouse, D.: A convenient algorithm for drawing a simple random sample. J. R. Stat. Soc. Ser. C 32, 182–184 (1983)

    MATH  Google Scholar 

  28. Vitter, J.S.: Optimum algorithms for two random sampling problems. In: Proceeding in FOCS, pp. 65–75 (1983)

  29. Braverman, V., Ostrovsky, R., Vorsanger, G.: Weighted sampling without replacement from data streams. Inf. Process. Lett. 115(12), 923–926 (2015)

    Article  MathSciNet  Google Scholar 

  30. Gemulla, R., Lehner, W., Haas, P.J.: Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–201 (2008)

    Article  Google Scholar 

  31. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: Proceedings in SPAA, pp. 281–291 (2001)

  32. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA (2002)

  33. Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: Proceedings in PODS, pp. 147–156 (2009)

  34. Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD (2008)

  35. Cormode, G., Shkapenyuk, V., Srivastava, D., Xu, B.: Forward decay: a practical time decay model for streaming systems. In: Proceedings in ICDE, pp. 138–149 (2009)

  36. Cormode, G., Tirthapura, S., Xu, B.: Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 39(4), 1309–1339 (2009)

    Article  MathSciNet  Google Scholar 

  37. Chung, Y., Tirthapura, S.: Distinct random sampling from a distributed stream. In: IPDPS, pp. 532–541 (2015)

  38. Chung, Y., Tirthapura, S., Woodruff, D.: A simple message-optimal algorithm for random sampling from a distributed stream. IEEE TKDE 28(6), 1356–1368 (2016)

    Google Scholar 

  39. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. JACM (2012). https://doi.org/10.1145/0000000.0000000

    Article  MATH  Google Scholar 

  40. Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: DISC, pp. 283–297 (2011)

  41. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. 31(6), 1794–1813 (2002)

    Article  MathSciNet  Google Scholar 

  42. Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72 (2002)

  43. Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: Proceedings of 22nd ACM Symposium on Principles of Database Systems (PODS), pp. 234–243, June (2003)

  44. Zhang, L., Guan, Y.: Variance estimation over sliding windows. In: PODS, pp. 225–232 (2007)

  45. http://openaq.org

  46. https://www.divvybikes.com/system-data

Download references

Acknowledgements

Nguyen and Tirthapura were supported in part by NSF Grants 1527541 and 1725702.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trong Duc Nguyen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appears in [1].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, T.D., Shih, MH., Srivastava, D. et al. Stratified random sampling from streaming and stored data. Distrib Parallel Databases 39, 665–710 (2021). https://doi.org/10.1007/s10619-020-07315-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-020-07315-w

Keywords

Navigation