Deferred Maintenance of Disk-Based Random Samples

  • Rainer Gemulla
  • Wolfgang Lehner
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3896)


Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a “sample” of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.


Data Stream Memory Consumption Candidate Index Large Random Sample Refresh Period 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Haas, P., König, C.: A Bi-Level Bernoulli Scheme for Database Sampling. In: Proc. ACM SIGMOD, pp. 275–286 (2004)Google Scholar
  2. 2.
    Gupta, A., Mumick, I.S.: Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999)Google Scholar
  3. 3.
    Vitter, J.S.: Faster Methods for Random Sampling. Commun. ACM 27, 703–718 (1984)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Vitter, J.S.: Random Sampling with a Reservoir. ACM TOMS 11, 37–57 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Haas, P.J.: Data Stream Sampling: Basic Techniques and Results. In: Data Stream Management: Processing High Speed Data Streams, Springer, Heidelberg (2006)(to appear)Google Scholar
  6. 6.
    Tatbul, N., Çetintemel, U., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Load Shedding in a Data Stream Manager. In: Proc. VLDB, pp. 309–320 (2003)Google Scholar
  7. 7.
    Jermaine, C., Pol, A., Arumugam, S.: Online Maintenance of Very Large Random Samples. In: Proc. ACM SIGMOD, pp. 299–310 (2004)Google Scholar
  8. 8.
    Ganti, V., Lee, M.L., Ramakrishnan, R.: ICICLES: Self-Tuning Samples for Approximate Query Answering. The VLDB Journal, 176–187 (2000)Google Scholar
  9. 9.
    Chaudhuri, S., Das, G., Datar, M., Narasayya, R.M.V.R.: Overcoming Limitations of Sampling for Aggregation Queries. In: Proc. ICDE., pp. 534–544 (2001)Google Scholar
  10. 10.
    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join Synopses for Approximate Query Answering. In: Proc. ACM SIGMOD, pp. 275–286 (1999)Google Scholar
  11. 11.
    Acharya, S., Gibbons, P.B., Poosala, V.: Congressional Samples for Approximate Answering of Group-By Queries. In: Proc. ACM SIGMOD, pp. 487–498 (2000)Google Scholar
  12. 12.
    Babcock, B., Chaudhuri, S., Das, G.: Dynamic Sample Selection for Approximate Query Processing. In: Proc. ACM SIGMOD, pp. 539–550 (2003)Google Scholar
  13. 13.
    Olken, F., Rotem, D.: Maintenance of Materialized Views of Sampling Queries. In: Proc. ICDE. (1992)Google Scholar
  14. 14.
    Matsumoto, M., Nishimura, T.: Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM TOMACS 8, 3–30 (1998)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Rainer Gemulla
    • 1
  • Wolfgang Lehner
    • 1
  1. 1.Dresden University of TechnologyDresdenGermany

Personalised recommendations