Informatik - Forschung und Entwicklung

, Volume 20, Issue 1–2, pp 45–56

Hierarchisches gruppenbasiertes Sampling

  • Rainer Gemulla
  • Henrike Berthold
  • Wolfgang Lehner
Original Article

Abstract

In Zeiten wachsender Datenbankgrößen ist es unumgänglich, Anfragen näherungsweise auszuwerten um schnelle Antworten zu erhalten. Dieser Artikel stellt verschiedene Methoden vor, dieses Ziel zu erreichen, und wendet sich anschließend dem Sampling zu, welches mit Hilfe einer Stichprobe schnell zu adäquaten Ergebnissen führt. Enthalten Datenbankanfragen Verbund- oder Gruppierungsoperationen, so sinkt die Genauigkeit vieler Sampling-Verfahren sehr stark; insbesondere werden vor allem kleine Gruppen nicht erkannt. Dieser Artikel befasst sich mit hierarchischen gruppenbasiertem Sampling, welches Sampling, Gruppierung und Verbundoperationen kombiniert.

Abstract

In times of increasing database sizes it is crucial to process queries approximately in order to obtain answers quickly. This article introduces several methods for achieving this goal and afterwards focuses on sampling, yielding appropriate results by using only a subset of the actual data. If database queries contain join or group-by operations, the accuracy of many sampling methods drops significantly; especially small groups are not recognized. This article is concerned with hierarchical group-based sampling, which combines sampling, grouping and joins.

Keywords

Database systems Query processing  Approximation Sampling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Literatur

  1. 1.
    Acharya S, Gibbons PB, Poosala V (2000) Congressional Samples for Approximate Answering of Group-By Queries. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 487–498Google Scholar
  2. 2.
    Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) Join synopses for approximate query answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 275–286Google Scholar
  3. 3.
    Babcock B, Chaudhuri S, Das G (2003) Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp 539–550Google Scholar
  4. 4.
    Barbará D, DuMouchel W, Faloutsos C, Haas PJ, Hellerstein JM, Ioannidis YE, Jagadish HV, Johnson T, Ng RT, Poosala V, Ross KA, Sevcik KC (1997) The New Jersey Data Reduction Report. IEEE Data Eng Bull 20(4):3–45Google Scholar
  5. 5.
    Chakrabarti K, Garofalakis MN, Rastogi R, Shim K (2000) Approximate Query Processing Using Wavelets. In: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000, pp 111–122Google Scholar
  6. 6.
    Chaudhuri S, Das G, Datar M, Motwani R, Narasayya VR (2001) Overcoming Limitations of Sampling for Aggregation Queries. In: 17th International Conference on Data Engineering (ICDE’ 01), pp 534–544, AprilGoogle Scholar
  7. 7.
    Chaudhuri S, Das G, Srivastava U (2004) Effective use of block-level sampling in statistics estimation. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 287–298Google Scholar
  8. 8.
    Chaudhuri S, Motwani R, Narasayya V (1999) On Random Sampling over Joins. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 263–274Google Scholar
  9. 9.
    Ganti V, Lee M-L, Ramakrishnan R (2000) ICICLES: Self-Tuning Samples for Approximate Query Answering. In: The VLDB Journal, pp 176–187Google Scholar
  10. 10.
    Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. ACM Press, pp 331–342Google Scholar
  11. 11.
    Gryz J, Guo J, Liu L, Zuzarte C (2004) Query sampling in DB2 Universal Database. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 839–843Google Scholar
  12. 12.
    Haas PJ, Hellerstein JM (1999) Ripple Joins for Online Aggregation. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 287–298Google Scholar
  13. 13.
    Haas PJ, König C (2004) A bi-level Bernoulli scheme for database sampling. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 275–286Google Scholar
  14. 14.
    Hellerstein JM, Haas PJ, Wang HJ (1997) Online Aggregation. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp 171–182Google Scholar
  15. 15.
    Ioannidis YE, Poosala V (1995) Histogram-Based Solutions to Diverse Database Estimation Problems. Data Engineering Bulletin 18(3):10–18Google Scholar
  16. 16.
    Jermaine C (2003) Robust Estimation With Sampling and Approximate Pre-Aggregation. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, September 9–12, 2003, Berlin, Germany, Los Altos, CA 94022, USA, 2003. Morgan Kaufmann Publishers, pp 886–897Google Scholar
  17. 17.
    Jermaine C, Pol A, Arumugam S (2004) Online maintenance of very large random samples. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 299–310Google Scholar
  18. 18.
    Olken F (1993) Random Sampling from Databases. Thesis LBL-32883, Information and Computing Sciences Division, Lawrence Berkeley National Laboratory, Mailstop 50B-3238, 1 Cyclotron Road, Berkeley, California 94720, USAGoogle Scholar
  19. 19.
    Poosala V, Ganti V, Ioannidis YE (1999) Approximate Query Answering using Histograms. IEEE Data Eng Bull 22(4):5–14Google Scholar
  20. 20.
    Transaction Processing Performace Council (1998) TPC-D Benchmark Version 2.1, February. http://www.tpc.orgGoogle Scholar
  21. 21.
    Vitter JS (1985) Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11(1):37–57, MarchGoogle Scholar
  22. 22.
    Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the 1999 ACM SIGMOD international conference on Management of data. ACM Press, pp 193–204Google Scholar

Copyright information

© Springer-Verlag 2005

Authors and Affiliations

  • Rainer Gemulla
    • 1
  • Henrike Berthold
    • 1
  • Wolfgang Lehner
    • 1
  1. 1.Database Technology GroupTechnische Universität DresdenDresden

Personalised recommendations