Skip to main content

Deferred Maintenance of Disk-Based Random Samples

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3896))

Abstract

Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a “sample” of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Haas, P., König, C.: A Bi-Level Bernoulli Scheme for Database Sampling. In: Proc. ACM SIGMOD, pp. 275–286 (2004)

    Google Scholar 

  2. Gupta, A., Mumick, I.S.: Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999)

    Google Scholar 

  3. Vitter, J.S.: Faster Methods for Random Sampling. Commun. ACM 27, 703–718 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  4. Vitter, J.S.: Random Sampling with a Reservoir. ACM TOMS 11, 37–57 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  5. Haas, P.J.: Data Stream Sampling: Basic Techniques and Results. In: Data Stream Management: Processing High Speed Data Streams, Springer, Heidelberg (2006)(to appear)

    Google Scholar 

  6. Tatbul, N., Çetintemel, U., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Load Shedding in a Data Stream Manager. In: Proc. VLDB, pp. 309–320 (2003)

    Google Scholar 

  7. Jermaine, C., Pol, A., Arumugam, S.: Online Maintenance of Very Large Random Samples. In: Proc. ACM SIGMOD, pp. 299–310 (2004)

    Google Scholar 

  8. Ganti, V., Lee, M.L., Ramakrishnan, R.: ICICLES: Self-Tuning Samples for Approximate Query Answering. The VLDB Journal, 176–187 (2000)

    Google Scholar 

  9. Chaudhuri, S., Das, G., Datar, M., Narasayya, R.M.V.R.: Overcoming Limitations of Sampling for Aggregation Queries. In: Proc. ICDE., pp. 534–544 (2001)

    Google Scholar 

  10. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join Synopses for Approximate Query Answering. In: Proc. ACM SIGMOD, pp. 275–286 (1999)

    Google Scholar 

  11. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional Samples for Approximate Answering of Group-By Queries. In: Proc. ACM SIGMOD, pp. 487–498 (2000)

    Google Scholar 

  12. Babcock, B., Chaudhuri, S., Das, G.: Dynamic Sample Selection for Approximate Query Processing. In: Proc. ACM SIGMOD, pp. 539–550 (2003)

    Google Scholar 

  13. Olken, F., Rotem, D.: Maintenance of Materialized Views of Sampling Queries. In: Proc. ICDE. (1992)

    Google Scholar 

  14. Matsumoto, M., Nishimura, T.: Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator. ACM TOMACS 8, 3–30 (1998)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gemulla, R., Lehner, W. (2006). Deferred Maintenance of Disk-Based Random Samples. In: Ioannidis, Y., et al. Advances in Database Technology - EDBT 2006. EDBT 2006. Lecture Notes in Computer Science, vol 3896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11687238_27

Download citation

  • DOI: https://doi.org/10.1007/11687238_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-32960-2

  • Online ISBN: 978-3-540-32961-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics