SUM-optimal histograms for approximate query processing

Zhang, Meifan; Wang, Hongzhi; Li, Jianzhong; Gao, Hong

doi:10.1007/s10115-020-01450-7

SUM-optimal histograms for approximate query processing

Regular Paper
Published: 06 March 2020

Volume 62, pages 3155–3180, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Meifan Zhang¹,
Hongzhi Wang¹,
Jianzhong Li¹ &
…
Hong Gao¹

361 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we study the problem of the SUM query approximation with histograms. We define a new kind of histogram called the SUM-optimal histogram which can provide better estimation result for the SUM queries than the traditional equi-depth and V-optimal histograms. We propose three methods for the histogram construction. The first one is a dynamic programming method, and the other two are approximate methods. We use a greedy strategy to insert separators into a histogram and use the stochastic gradient descent method to improve the accuracy of separators. The experimental results indicate that our method can provide better estimations for the SUM queries than the equi-depth and V-optimal histograms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Article Open access 15 January 2021

Hai Lan, Zhifeng Bao & Yuwei Peng

On some efficient logarithmic type estimators under stratified ranked set sampling

Article 05 April 2024

Shashi Bhushan & Anoop Kumar

Notes

http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.

References

Acharya J, Diakonikolas I, Hegde C, Li JZ, Schmidt L (2015) Fast and near-optimal algorithms for approximating distributions by histograms. In: Proceedings of the 34th ACM symposium on principles of database systems, PODS 2015, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp 249–263
Acharya S, Gibbons PB, Poosala V (2000) Congressional samples for approximate answering of group-by queries. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, May 16–18, 2000, Dallas, TX, USA, pp 487–498
Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The aqua approximate query answering system. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, PA, USA, pp 574–576
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Eighth Eurosys conference 2013, EuroSys’13, Prague, Czech Republic, April 14–17, 2013, pp 29–42
Agrawal R, Swami AN (1995) A one-pass space-efficient algorithm for finding quantiles. In: COMAD
Buccafurri F, Furfaro F, Mazzeo GM, Saccà D (2011) A quad-tree based multiresolution approach for two-dimensional summary data. Inf Syst 36(7):1082–1103
Article Google Scholar
Buccafurri F, Lax G, Saccà D, Pontieri L, Rosaci D (2008) Enhancing histograms by tree-like bucket indices. VLDB J 17(5):1041–1061
Article Google Scholar
Chaiken R, Jenkins B, Larson PÅ, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2):1265–1276
Google Scholar
Chaudhuri S, Das G, Datar M, Motwani R, Narasayya VR (2001) Overcoming limitations of sampling for aggregation queries. In: Proceedings of the 17th international conference on data engineering, April 2–6, 2001, Heidelberg, Germany, pp 534–542
Chaudhuri S, Das G, Narasayya VR (2001) A robust, optimization-based approach for approximate answering of aggregate queries. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 295–306
Chaudhuri S, Ding B, Kandula S (2017) Approximate query processing: no silver bullet. In: Proceedings of the 2017 ACM international conference on management of data, SIGMOD conference 2017, Chicago, IL, USA, May 14–19, 2017, pp 511–519
Chaudhuri S, Motwani R, Narasayya VR (1998) Random sampling for histogram construction: How much is enough? In: SIGMOD 1998, proceedings ACM SIGMOD international conference on management of data, June 2–4, 1998, Seattle, Washington, USA, pp 436–447
Cormode G, Garofalakis MN, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294
MATH Google Scholar
Ding X, Liu P, Jin H (2019) Privacy-preserving multi-keyword top-\(k\) k similarity search over encrypted data. IEEE Trans Dependable Sec Comput 16(2):344–357
Article Google Scholar
Ding X, Yang W, Choo K-KR, Wang X, Jin H (2019) Privacy preserving similarity joins using mapreduce. Inf Sci 493:20–33
Article Google Scholar
Galakatos A, Crotty A, Zgraggen E, Binnig C, Kraska T (2017) Revisiting reuse for approximate query processing. PVLDB 10(10):1142–1153
Google Scholar
Gibbons PB, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27(3):261–298
Article Google Scholar
Gilbert AC, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M (2002) Fast, small-space algorithms for approximate histogram maintenance. In: STOC. ACM, New York, pp 389–398
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 58–66
Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31(1):396–438
Article Google Scholar
Indyk P, Levi R, Rubinfeld R (2012) Approximating and testing \(k\)-histogram distributions in sub-linear time. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2012, Scottsdale, AZ, USA, May 20–24, 2012, pp 15–22
Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, May 22–25, 1995, pp 233–244
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: VLDB’98, proceedings of 24th international conference on very large data bases, August 24–27, 1998, New York City, NY, USA, pp 275–286
Joseph AG, Bhatnagar S (2015) A stochastic approximation algorithm for quantile estimation. In: Neural information processing—22nd international conference, ICONIP 2015, Istanbul, Turkey, November 9–12, 2015, Proceedings, Part II, pp 311–319
Li K, Li G (2018) Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Sci Eng 3(4):379–397
Article Google Scholar
Ma Q, Triantafillou P (2019) Dbest: revisiting approximate query processing engines with machine learning models. In: Proceedings of the 2019 international conference on management of data, SIGMOD conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp 1553–1570
Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T (2011) Dremel: interactive analysis of web-scale datasets. Commun ACM 54(6):114–123
Article Google Scholar
Munro JI, Paterson M (1980) Selection and sorting with limited storage. Theor Comput Sci 12:315–323
Article MathSciNet Google Scholar
Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: Database Theory—ICDT’99, 7th international conference, Jerusalem, Israel, January 10–12, 1999, Proceedings, pp 236–256
Olma M, Papapetrou O, Appuswamy R, Ailamaki A (2019) Taster: self-tuning, elastic and online approximate query processing. In: 35th IEEE international conference on data engineering, ICDE 2019, Macao, China, April 8–11, 2019, pp 482–493
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008, pp 1099–1110
Pearson K (1901) Mathematical contributions to the theory of evolution. X. Supplement to a memoir on skew variation. Philos Trans R Soc Lond 197(11):443–459
MATH Google Scholar
Peng J, Zhang D, Wang J, Pei J (2018) AQP++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 international conference on management of data, SIGMOD conference 2018, Houston, TX, USA, June 10–15, 2018, pp 1477–1492
Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, proceedings of annual meeting, Boston, MA, June 18–21, 1984, pp 256–276
Poosala V, Ioannidis YE (1996) Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB’96, proceedings of 22nd international conference on very large data bases, September 3–6, 1996, Mumbai (Bombay), India, pp 448–459
Poosala V, Ioannidis YE, Haas PJ, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, June 4–6, 1996, pp 294–305
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Article MathSciNet Google Scholar
Shekelyan M, Dignös A, Gamper J (2017) Digithist: a histogram-based data summary with tight error bounds. PVLDB 10(11):1514–1525
Google Scholar
Sidirourgos L, Kersten ML, Boncz PA (2011) Sciborq: scientific data management with bounds on runtime and quality. In: CIDR 2011, 5th biennial conference on innovative data systems research, Asilomar, CA, USA, January 9–12, 2011, online proceedings, pp 296–301
Song G, Wenwen Q, Liu X, Wang X (2018) Approximate calculation of window aggregate functions via global random sample. Data Sci Eng 3(1):40–51
Article Google Scholar
To H, Chiang K, Shahabi C (2013) Entropy-based histograms for selectivity estimation. In: 22nd ACM international conference on information and knowledge management, CIKM’13, San Francisco, CA, USA, October 27–November 1, 2013, pp 1939–1948
Yildiz B, Büyüktanir T, Emekçi F (2016) Equi-depth histogram construction for big data with quality guarantees. CoRR arXiv:1606.05633

Download references

Acknowledgements

This paper was partially supported by NSFC Grant U1866602, 61602129, 61772157, CCF-Huawei Database System Innovation Research Plan DBIR2019005B and Microsoft Research Asia.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Meifan Zhang, Hongzhi Wang, Jianzhong Li & Hong Gao

Authors

Meifan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, M., Wang, H., Li, J. et al. SUM-optimal histograms for approximate query processing. Knowl Inf Syst 62, 3155–3180 (2020). https://doi.org/10.1007/s10115-020-01450-7

Download citation

Received: 27 October 2018
Revised: 02 February 2020
Accepted: 22 February 2020
Published: 06 March 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10115-020-01450-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SUM-optimal histograms for approximate query processing

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

On some efficient logarithmic type estimators under stratified ranked set sampling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SUM-optimal histograms for approximate query processing

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

On some efficient logarithmic type estimators under stratified ranked set sampling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation