The VLDB Journal

, Volume 17, Issue 2, pp 173–201

Maintaining bounded-size sample synopses of evolving datasets

Special Issue Paper

Abstract

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sample can be very convenient from a system-design point of view, because the task of memory management is simplified, especially when many samples are maintained simultaneously. In this paper, we study methods for incrementally maintaining a bounded-size uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP), that maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the 45-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm of Babcock et al. when the insertions and deletions correspond to a moving window over a data stream. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically resizing a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size. We also show how to merge uniform samples from disjoint datasets to obtain a uniform sample of the union of the datasets; the merged sample can be incrementally maintained. Our new RPMerge algorithm extends the HRMerge algorithm of Brown and Haas to effectively deal with deletions, thereby facilitating efficient parallel sampling.

Keywords

Database sampling Reservoir sampling Sample maintenance Synopsis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proc. SODA, pp. 633–634 (2002)Google Scholar
  2. 2.
    Brown P., Haas P., Myllymaki J., Pirahesh H., Reinwald B. and Sismanis Y. (2005). Toward automated large-scale information integration and discovery. In: Härder, T. and Lehner, W. (eds) Data Management in a Connected World, pp 161–180. Springer, Heidelberg Google Scholar
  3. 3.
    Brown, P., Haas, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relational data. In: Proc. VLDB, pp. 668–679 (2003)Google Scholar
  4. 4.
    Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: Proc. ICDE (2006)Google Scholar
  5. 5.
    Chaudhuri, S., Motwani, R., Narasayya, V.R.: On random sampling over joins. In: Proc. ACM SIGMOD, pp. 263–274 (1999)Google Scholar
  6. 6.
    Colt Library: Open source libraries for high performance scientific and technical computing in Java. http://dsd.lbl.gov/ hoschek/colt/Google Scholar
  7. 7.
    Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: Proc. VLDB, pp. 25–36 (2005)Google Scholar
  8. 8.
    Fan C., Muller M. and Rezucha I. (1962). Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Am. Statist. Assoc. 57: 387–402 MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Proc. 21st Symp. Computat. Geom., pp. 142–149 (2005)Google Scholar
  10. 10.
    Gemulla, R., Lehner, W.: Deferred maintenance of disk-based random samples. In: Proc. EDBT, pp. 423–441 (2006)Google Scholar
  11. 11.
    Gemulla, R., Lehner, W., Haas, P.J.: A dip in the reservoir: Maintaining sample synopses of evolving datasets. In: Proc. VLDB, pp. 595–606 (2006)Google Scholar
  12. 12.
    Gemulla, R., Lehner, W., Haas, P.J.: Maintaining Bernoulli samples over evolving multisets. In: Proc. ACM PODS, pp. 93–102 (2007)Google Scholar
  13. 13.
    Gibbons P., Matias Y. and Poosala V. (1997). AQUA project white paper. Tech. rep., Bell Laboratories, Murray Hill Google Scholar
  14. 14.
    Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proc. ACM SIGMOD, pp. 331–342 (1998)Google Scholar
  15. 15.
    Gibbons P.B., Matias Y. and Poosala V. (2002). Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst. 27: 182–184 CrossRefGoogle Scholar
  16. 16.
    GSL: GNU Scientific Library. http://www.gnu.org/software/gsl/Google Scholar
  17. 17.
    Haas, P., König, C.: A bi-level Bernoulli scheme for database sampling. In: Proc. ACM SIGMOD, pp. 275–286 (2004)Google Scholar
  18. 18.
    Haas, P.J.: Data stream sampling: Basic techniques and results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management: Processing High Speed Data Streams, Springer, Heidelberg (2007)Google Scholar
  19. 19.
    Halevy, A.Y., Etzioni, O., Doan, A., Ives, Z.G., Madhavan, J., McDowell, L., Tatarinov, I.: Join synopses for approximate query answering. In: Proc. CIDR (2003)Google Scholar
  20. 20.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proc. ACM SIGMOD, pp. 171–182 (1997)Google Scholar
  21. 21.
    IBM Corporation: WebSphere Profile Stage User’s Manual (2005)Google Scholar
  22. 22.
    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. In: Proc. ACM SIGMOD, pp. 647–658 (2004)Google Scholar
  23. 23.
    Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: Proc. ACM SIGMOD, pp. 299–310 (2004)Google Scholar
  24. 24.
    John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: Proc. KDD, pp. 367–370 (2005)Google Scholar
  25. 25.
    Johnson N.L., Kotz S. and Kemp A.W. (1992). Discrete Univariate Distributions, 2nd edn. Wiley, New York Google Scholar
  26. 26.
    Kachitvichyanukul V. and Schmeiser B. (1985). Computer generation of hypergeometric random variables. J. Stat. Comput. Simul 22: 127–145 MATHCrossRefGoogle Scholar
  27. 27.
    Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proc. ACM PODS, pp. 77–85 (1994)Google Scholar
  28. 28.
    Knuth, D.E.: The Art of Computer Programming, vol. 2: Seminumerical Algorithms, 1st edn. Addison-Wesley, Reading (1969)Google Scholar
  29. 29.
    Law A.M. (2007). Simulation Modeling and Analysis, 4th edn. McGraw-Hill, New York Google Scholar
  30. 30.
    L’Ecuyer P.  (2006). Uniform random number generation. In: Henderson, S.G. and Nelson, B.L. (eds) Simulation, pp 55–81. Elsevier, Amsterdam Google Scholar
  31. 31.
    Leser, U., Naumann, F.: (Almost) hands-off information integration for the life sciences. In: Proc. CIDR, pp. 131–143 (2005)Google Scholar
  32. 32.
    Matsumoto M. and Nishimura T. (1998). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. 8(1): 3–30 MATHCrossRefGoogle Scholar
  33. 33.
    McLeod A.I. and Bellhouse D.R. (1983). A convenient algorithm for drawing a simple random sample. Appl. Statist. 32: 182–184 MATHCrossRefGoogle Scholar
  34. 34.
    Norris J.R. (1997). Markov Chains. Cambridge University Press, Cambridge MATHGoogle Scholar
  35. 35.
    Olken, F.: Random sampling from databases. Thesis LBL-32883, Information and Computing Sciences Division, Lawrence Berkeley National Laboratory (1993)Google Scholar
  36. 36.
    Olken, F., Rotem, D.: Maintenance of materialized views of sampling queries. In: Proc. ICDE (1992)Google Scholar
  37. 37.
    Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proc. ACM SIGMOD, pp. 294–305 (1996)Google Scholar
  38. 38.
    Press W.H., Teukolsky S.A., Vetterling W.T. and Flannery B.P. (1992). Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge Google Scholar
  39. 39.
    Robbins H. and Monro S. (1951). A stochastic approximation method. Ann. Math. Statist. 22: 400–407 CrossRefMathSciNetGoogle Scholar
  40. 40.
    Ross S.M. (1983). Stochastic Processes. Wiley, New York MATHGoogle Scholar
  41. 41.
    Särndal C.E., Swensson B. and Wretman J. (1992). Model Assisted Survey Sampling. Springer, Heidelberg MATHGoogle Scholar
  42. 42.
    Spall J.C. (2003). Introduction to Stochastic Search and Optimization. Wiley, New York MATHGoogle Scholar
  43. 43.
    Tatbul, N., Çetintemel, U., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Proc. VLDB, pp. 309–320 (2003)Google Scholar
  44. 44.
    Vitter J.S. (1984). Faster methods for random sampling. Commun. ACM 27(7): 703–718 MATHCrossRefMathSciNetGoogle Scholar
  45. 45.
    Vitter J.S. (1985). Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1): 37–57 MATHCrossRefMathSciNetGoogle Scholar
  46. 46.
    Zechner, H.: Efficient sampling from continuous and discrete distributions. Ph.D. thesis, Technical University Graz (1997)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Rainer Gemulla
    • 1
  • Wolfgang Lehner
    • 1
  • Peter J. Haas
    • 2
  1. 1.Technische Universität DresdenDresdenGermany
  2. 2.IBM Almaden Research CenterSan JoseUSA

Personalised recommendations