Skip to main content
Log in

Maintaining bounded-size sample synopses of evolving datasets

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sample can be very convenient from a system-design point of view, because the task of memory management is simplified, especially when many samples are maintained simultaneously. In this paper, we study methods for incrementally maintaining a bounded-size uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP), that maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the 45-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm of Babcock et al. when the insertions and deletions correspond to a moving window over a data stream. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically resizing a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size. We also show how to merge uniform samples from disjoint datasets to obtain a uniform sample of the union of the datasets; the merged sample can be incrementally maintained. Our new RPMerge algorithm extends the HRMerge algorithm of Brown and Haas to effectively deal with deletions, thereby facilitating efficient parallel sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proc. SODA, pp. 633–634 (2002)

  2. Brown P., Haas P., Myllymaki J., Pirahesh H., Reinwald B. and Sismanis Y. (2005). Toward automated large-scale information integration and discovery. In: Härder, T. and Lehner, W. (eds) Data Management in a Connected World, pp 161–180. Springer, Heidelberg

    Google Scholar 

  3. Brown, P., Haas, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relational data. In: Proc. VLDB, pp. 668–679 (2003)

  4. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: Proc. ICDE (2006)

  5. Chaudhuri, S., Motwani, R., Narasayya, V.R.: On random sampling over joins. In: Proc. ACM SIGMOD, pp. 263–274 (1999)

  6. Colt Library: Open source libraries for high performance scientific and technical computing in Java. http://dsd.lbl.gov/ hoschek/colt/

  7. Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: Proc. VLDB, pp. 25–36 (2005)

  8. Fan C., Muller M. and Rezucha I. (1962). Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Am. Statist. Assoc. 57: 387–402

    Article  MATH  MathSciNet  Google Scholar 

  9. Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Proc. 21st Symp. Computat. Geom., pp. 142–149 (2005)

  10. Gemulla, R., Lehner, W.: Deferred maintenance of disk-based random samples. In: Proc. EDBT, pp. 423–441 (2006)

  11. Gemulla, R., Lehner, W., Haas, P.J.: A dip in the reservoir: Maintaining sample synopses of evolving datasets. In: Proc. VLDB, pp. 595–606 (2006)

  12. Gemulla, R., Lehner, W., Haas, P.J.: Maintaining Bernoulli samples over evolving multisets. In: Proc. ACM PODS, pp. 93–102 (2007)

  13. Gibbons P., Matias Y. and Poosala V. (1997). AQUA project white paper. Tech. rep., Bell Laboratories, Murray Hill

    Google Scholar 

  14. Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proc. ACM SIGMOD, pp. 331–342 (1998)

  15. Gibbons P.B., Matias Y. and Poosala V. (2002). Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst. 27: 182–184

    Article  Google Scholar 

  16. GSL: GNU Scientific Library. http://www.gnu.org/software/gsl/

  17. Haas, P., König, C.: A bi-level Bernoulli scheme for database sampling. In: Proc. ACM SIGMOD, pp. 275–286 (2004)

  18. Haas, P.J.: Data stream sampling: Basic techniques and results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management: Processing High Speed Data Streams, Springer, Heidelberg (2007)

  19. Halevy, A.Y., Etzioni, O., Doan, A., Ives, Z.G., Madhavan, J., McDowell, L., Tatarinov, I.: Join synopses for approximate query answering. In: Proc. CIDR (2003)

  20. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proc. ACM SIGMOD, pp. 171–182 (1997)

  21. IBM Corporation: WebSphere Profile Stage User’s Manual (2005)

  22. Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. In: Proc. ACM SIGMOD, pp. 647–658 (2004)

  23. Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: Proc. ACM SIGMOD, pp. 299–310 (2004)

  24. John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: Proc. KDD, pp. 367–370 (2005)

  25. Johnson N.L., Kotz S. and Kemp A.W. (1992). Discrete Univariate Distributions, 2nd edn. Wiley, New York

    Google Scholar 

  26. Kachitvichyanukul V. and Schmeiser B. (1985). Computer generation of hypergeometric random variables. J. Stat. Comput. Simul 22: 127–145

    Article  MATH  Google Scholar 

  27. Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proc. ACM PODS, pp. 77–85 (1994)

  28. Knuth, D.E.: The Art of Computer Programming, vol. 2: Seminumerical Algorithms, 1st edn. Addison-Wesley, Reading (1969)

  29. Law A.M. (2007). Simulation Modeling and Analysis, 4th edn. McGraw-Hill, New York

    Google Scholar 

  30. L’Ecuyer P.  (2006). Uniform random number generation. In: Henderson, S.G. and Nelson, B.L. (eds) Simulation, pp 55–81. Elsevier, Amsterdam

    Google Scholar 

  31. Leser, U., Naumann, F.: (Almost) hands-off information integration for the life sciences. In: Proc. CIDR, pp. 131–143 (2005)

  32. Matsumoto M. and Nishimura T. (1998). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. 8(1): 3–30

    Article  MATH  Google Scholar 

  33. McLeod A.I. and Bellhouse D.R. (1983). A convenient algorithm for drawing a simple random sample. Appl. Statist. 32: 182–184

    Article  MATH  Google Scholar 

  34. Norris J.R. (1997). Markov Chains. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  35. Olken, F.: Random sampling from databases. Thesis LBL-32883, Information and Computing Sciences Division, Lawrence Berkeley National Laboratory (1993)

  36. Olken, F., Rotem, D.: Maintenance of materialized views of sampling queries. In: Proc. ICDE (1992)

  37. Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proc. ACM SIGMOD, pp. 294–305 (1996)

  38. Press W.H., Teukolsky S.A., Vetterling W.T. and Flannery B.P. (1992). Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge

    Google Scholar 

  39. Robbins H. and Monro S. (1951). A stochastic approximation method. Ann. Math. Statist. 22: 400–407

    Article  MathSciNet  Google Scholar 

  40. Ross S.M. (1983). Stochastic Processes. Wiley, New York

    MATH  Google Scholar 

  41. Särndal C.E., Swensson B. and Wretman J. (1992). Model Assisted Survey Sampling. Springer, Heidelberg

    MATH  Google Scholar 

  42. Spall J.C. (2003). Introduction to Stochastic Search and Optimization. Wiley, New York

    MATH  Google Scholar 

  43. Tatbul, N., Çetintemel, U., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Proc. VLDB, pp. 309–320 (2003)

  44. Vitter J.S. (1984). Faster methods for random sampling. Commun. ACM 27(7): 703–718

    Article  MATH  MathSciNet  Google Scholar 

  45. Vitter J.S. (1985). Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1): 37–57

    Article  MATH  MathSciNet  Google Scholar 

  46. Zechner, H.: Efficient sampling from continuous and discrete distributions. Ph.D. thesis, Technical University Graz (1997)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rainer Gemulla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gemulla, R., Lehner, W. & Haas, P.J. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal 17, 173–201 (2008). https://doi.org/10.1007/s00778-007-0065-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0065-y

Keywords

Navigation