Efficiently adapting graphical models for selectivity estimation

Tzoumas, Kostas; Deshpande, Amol; Jensen, Christian S.

doi:10.1007/s00778-012-0293-7

Efficiently adapting graphical models for selectivity estimation

Special Issue Paper
Published: 07 November 2012

Volume 22, pages 3–27, (2013)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Kostas Tzoumas¹,
Amol Deshpande² &
Christian S. Jensen³

1922 Accesses
25 Citations
3 Altmetric
Explore all metrics

Abstract

Query optimizers rely on statistical models that succinctly describe the underlying data. Models are used to derive cardinality estimates for intermediate relations, which in turn guide the optimizer to choose the best query execution plan. The quality of the resulting plan is highly dependent on the accuracy of the statistical model that represents the data. It is well known that small errors in the model estimates propagate exponentially through joins, and may result in the choice of a highly sub-optimal query execution plan. Most commercial query optimizers make the attribute value independence assumption: all attributes are assumed to be statistically independent. This reduces the statistical model of the data to a collection of one-dimensional synopses (typically in the form of histograms), and it permits the optimizer to estimate the selectivity of a predicate conjunction as the product of the selectivities of the constituent predicates. However, this independence assumption is more often than not wrong, and is considered to be the most common cause of sub-optimal query execution plans chosen by modern query optimizers. We take a step towards a principled and practical approach to performing cardinality estimation without making the independence assumption. By carefully using concepts from the field of graphical models, we are able to factor the joint probability distribution over all the attributes in the database into small, usually two-dimensional distributions, without a significant loss in estimation accuracy. We show how to efficiently construct such a graphical model from the database using only two-way join queries, and we show how to perform selectivity estimation in a highly efficient manner. We integrate our algorithms into the PostgreSQL DBMS. Experimental results indicate that estimation errors can be greatly reduced, leading to orders of magnitude more efficient query execution plans in many cases. Optimization time is kept in the range of tens of milliseconds, making this a practical approach for industrial-strength query optimizers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Approach Based on Bayesian Networks for Query Selectivity Estimation

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Article 16 July 2016

Wolfgang Gatterbauer & Dan Suciu

Join Sizes, Frequency Moments, and Applications

Notes

\(\bowtie _{\textit{HJ}}\) and \(\bowtie _{\textit{NLJ}}\) denote hash and nested loop joins. The right operand of \(\bowtie \) in the plans is the inner relation.
In the rest of this paper, and when the meaning is clear from the context, we will denote a descriptive attribute \(\mathbf X \) and the attribute \(X\) that it represents with the same symbol \(X\). Similarly, we will denote by \(J_{RS}\) both the random variable and the predicate that defines it.
See supplementary material associated with the online version of this article on the journal’s web site.

References

Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: SIGMOD, pp. 181–192 (1999)
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: SIGMOD, pp. 275–286 (1999)
Berry, A., Blair, J.R.S., Heggernes, P., Peyton, B.W.: Maximum cardinality search for computing minimal triangulations of graphs. Algorithmica 39(4), 287–298 (2004)
Article MathSciNet MATH Google Scholar
Bravo, H.C., Ramakrishnan, R.: Optimizing MPF queries: Decision support and probabilistic inference. In: SIGMOD, pp. 701–712 (2007)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: A multidimensional workload-aware histogram. In: SIGMOD, pp. 211–222 (2001)
Chaudhuri, S., Motwani, R., Narasayya, V.R.: Random sampling for histogram construction: How much is enough? In: SIGMOD, pp. 436–447 (1998)
Chaudhuri, S., Motwani, R., Narasayya, V.R.: On random sampling over joins. In: SIGMOD, pp. 263–274 (1999)
Chaudhuri, S., Shim, K.: Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24(2), 177–228 (1999)
Article Google Scholar
Chow, C.K., Liu, C.N.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–V467 (1968)
Article MathSciNet MATH Google Scholar
Cowell, R.G., Dawid, P., Lauritzen, S.L., Spiegelhalter, D.J.: Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks. Springer, Berlin (1999)
Google Scholar
Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD, pp. 199–210 (2001)
Friedman, N., Goldszmidt, M.: Sequential update of bayesian network structure. In: UAI, pp. 165–174 (1997)
Getoor, L.: Learning statistical models from relational data. Ph.D. thesis, Stanford University (2001)
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: SIGMOD, pp. 461–472 (2001)
Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD, pp. 463–474 (2000)
Haas, P.J., Ilyas, I.F., Lohman, G.M., Markl, V.: Discovering and exploiting statistical properties for query optimization in relational databases: a survey. Stat. Anal. Data Min. 1(4), 223–250 (2009)
Article MathSciNet Google Scholar
Hellerstein, J.M.: Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23(2), 113–157 (1998)
Article MathSciNet Google Scholar
Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658 (2004)
Ioannidis, Y.E.: Universality of serial histograms. In: VLDB, pp. 256–267 (1993)
Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB, pp. 19–30 (2003)
Ioannidis, Y.E., Christodoulakis, S.: On the propagation of errors in the size of join results. In: SIGMOD, pp. 268–277 (1991)
Ioannidis, Y.E., Christodoulakis, S.: Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. Database Syst. 18(4), 709–748 (1993)
Article Google Scholar
Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: SIGMOD, pp. 233–244 (1995)
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. In: VLDB, pp. 275–286 (1998)
Jensen, F.V., Jensen, F.: Optimal junction trees. In: UAI, pp. 360–366 (1994)
Kanne, C.C., Moerkotte, G.: Histograms reloaded: The merits of bucket diversity. In: SIGMOD, pp. 663–674 (2010)
Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55–76 (2007)
Article Google Scholar
Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H.: Robust query processing through progressive optimization. In: SIGMOD, pp. 659–670 (2004)
Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD, pp. 448–459 (1998)
Moerkotte, G., Neumann, T., Steidl, G.: Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB 2(1), 982–993 (2009)
Google Scholar
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD, pp. 28–36 (1988)
Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: ICDT, pp. 236–256 (1999)
Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB, pp. 486–495 (1997)
Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. In: PODS, pp. 53–64 (2010)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)
Spiegel, J., Polyzotis, N.: Graph-based synopses for relational selectivity estimation. In: SIGMOD, pp. 205–216 (2006)
Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: Leo - db2’s learning optimizer. In: VLDB, pp. 19–28 (2001)
Tzoumas, K., Deshpande, A., Jensen, C.S.: Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11), 852–863 (2011)
Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for comments that greatly improved the content and presentation of this paper.

Author information

Authors and Affiliations

Technische Universität Berlin, Berlin, Germany
Kostas Tzoumas
University of Maryland, College Park, MD, USA
Amol Deshpande
Aarhus University, Aarhus, Denmark
Christian S. Jensen

Authors

Kostas Tzoumas
View author publications
You can also search for this author in PubMed Google Scholar
Amol Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
Christian S. Jensen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kostas Tzoumas.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 46 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tzoumas, K., Deshpande, A. & Jensen, C.S. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal 22, 3–27 (2013). https://doi.org/10.1007/s00778-012-0293-7

Download citation

Received: 14 December 2011
Revised: 29 July 2012
Accepted: 03 September 2012
Published: 07 November 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s00778-012-0293-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Efficiently adapting graphical models for selectivity estimation

Abstract

Access this article

Similar content being viewed by others

An Approach Based on Bayesian Networks for Query Selectivity Estimation

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Join Sizes, Frequency Moments, and Applications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (PDF 46 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficiently adapting graphical models for selectivity estimation

Abstract

Access this article

Similar content being viewed by others

An Approach Based on Bayesian Networks for Query Selectivity Estimation

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Join Sizes, Frequency Moments, and Applications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (PDF 46 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation