Deferred Maintenance of Disk-Based Random Samples

  • Rainer Gemulla
  • Wolfgang Lehner
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3896)

Abstract

Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a “sample” of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Haas, P., König, C.: A Bi-Level Bernoulli Scheme for Database Sampling. In: Proc. ACM SIGMOD, pp. 275–286 (2004)Google Scholar
  2. 2.
    Gupta, A., Mumick, I.S.: Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999)Google Scholar
  3. 3.
    Vitter, J.S.: Faster Methods for Random Sampling. Commun. ACM 27, 703–718 (1984)MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Vitter, J.S.: Random Sampling with a Reservoir. ACM TOMS 11, 37–57 (1985)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Haas, P.J.: Data Stream Sampling: Basic Techniques and Results. In: Data Stream Management: Processing High Speed Data Streams, Springer, Heidelberg (2006)(to appear)Google Scholar
  6. 6.
    Tatbul, N., Çetintemel, U., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Load Shedding in a Data Stream Manager. In: Proc. VLDB, pp. 309–320 (2003)Google Scholar
  7. 7.
    Jermaine, C., Pol, A., Arumugam, S.: Online Maintenance of Very Large Random Samples. In: Proc. ACM SIGMOD, pp. 299–310 (2004)Google Scholar
  8. 8.
    Ganti, V., Lee, M.L., Ramakrishnan, R.: ICICLES: Self-Tuning Samples for Approximate Query Answering. The VLDB Journal, 176–187 (2000)Google Scholar
  9. 9.
    Chaudhuri, S., Das, G., Datar, M., Narasayya, R.M.V.R.: Overcoming Limitations of Sampling for Aggregation Queries. In: Proc. ICDE., pp. 534–544 (2001)Google Scholar
  10. 10.
    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join Synopses for Approximate Query Answering. In: Proc. ACM SIGMOD, pp. 275–286 (1999)Google Scholar
  11. 11.
    Acharya, S., Gibbons, P.B., Poosala, V.: Congressional Samples for Approximate Answering of Group-By Queries. In: Proc. ACM SIGMOD, pp. 487–498 (2000)Google Scholar
  12. 12.
    Babcock, B., Chaudhuri, S., Das, G.: Dynamic Sample Selection for Approximate Query Processing. In: Proc. ACM SIGMOD, pp. 539–550 (2003)Google Scholar
  13. 13.
    Olken, F., Rotem, D.: Maintenance of Materialized Views of Sampling Queries. In: Proc. ICDE. (1992)Google Scholar
  14. 14.
    Matsumoto, M., Nishimura, T.: Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM TOMACS 8, 3–30 (1998)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Rainer Gemulla
    • 1
  • Wolfgang Lehner
    • 1
  1. 1.Dresden University of TechnologyDresdenGermany

Personalised recommendations