Efficiently adapting graphical models for selectivity estimation

Abstract

Query optimizers rely on statistical models that succinctly describe the underlying data. Models are used to derive cardinality estimates for intermediate relations, which in turn guide the optimizer to choose the best query execution plan. The quality of the resulting plan is highly dependent on the accuracy of the statistical model that represents the data. It is well known that small errors in the model estimates propagate exponentially through joins, and may result in the choice of a highly sub-optimal query execution plan. Most commercial query optimizers make the attribute value independence assumption: all attributes are assumed to be statistically independent. This reduces the statistical model of the data to a collection of one-dimensional synopses (typically in the form of histograms), and it permits the optimizer to estimate the selectivity of a predicate conjunction as the product of the selectivities of the constituent predicates. However, this independence assumption is more often than not wrong, and is considered to be the most common cause of sub-optimal query execution plans chosen by modern query optimizers. We take a step towards a principled and practical approach to performing cardinality estimation without making the independence assumption. By carefully using concepts from the field of graphical models, we are able to factor the joint probability distribution over all the attributes in the database into small, usually two-dimensional distributions, without a significant loss in estimation accuracy. We show how to efficiently construct such a graphical model from the database using only two-way join queries, and we show how to perform selectivity estimation in a highly efficient manner. We integrate our algorithms into the PostgreSQL DBMS. Experimental results indicate that estimation errors can be greatly reduced, leading to orders of magnitude more efficient query execution plans in many cases. Optimization time is kept in the range of tens of milliseconds, making this a practical approach for industrial-strength query optimizers.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    \(\bowtie _{\textit{HJ}}\) and \(\bowtie _{\textit{NLJ}}\) denote hash and nested loop joins. The right operand of \(\bowtie \) in the plans is the inner relation.

  2. 2.

    In the rest of this paper, and when the meaning is clear from the context, we will denote a descriptive attribute \(\mathbf X \) and the attribute \(X\) that it represents with the same symbol \(X\). Similarly, we will denote by \(J_{RS}\) both the random variable and the predicate that defines it.

  3. 3.

    See supplementary material associated with the online version of this article on the journal’s web site.

References

  1. 1.

    Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: SIGMOD, pp. 181–192 (1999)

  2. 2.

    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: SIGMOD, pp. 275–286 (1999)

  3. 3.

    Berry, A., Blair, J.R.S., Heggernes, P., Peyton, B.W.: Maximum cardinality search for computing minimal triangulations of graphs. Algorithmica 39(4), 287–298 (2004)

    MathSciNet  MATH  Article  Google Scholar 

  4. 4.

    Bravo, H.C., Ramakrishnan, R.: Optimizing MPF queries: Decision support and probabilistic inference. In: SIGMOD, pp. 701–712 (2007)

  5. 5.

    Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: A multidimensional workload-aware histogram. In: SIGMOD, pp. 211–222 (2001)

  6. 6.

    Chaudhuri, S., Motwani, R., Narasayya, V.R.: Random sampling for histogram construction: How much is enough? In: SIGMOD, pp. 436–447 (1998)

  7. 7.

    Chaudhuri, S., Motwani, R., Narasayya, V.R.: On random sampling over joins. In: SIGMOD, pp. 263–274 (1999)

  8. 8.

    Chaudhuri, S., Shim, K.: Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24(2), 177–228 (1999)

    Article  Google Scholar 

  9. 9.

    Chow, C.K., Liu, C.N.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–V467 (1968)

    MathSciNet  MATH  Article  Google Scholar 

  10. 10.

    Cowell, R.G., Dawid, P., Lauritzen, S.L., Spiegelhalter, D.J.: Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks. Springer, Berlin (1999)

    Google Scholar 

  11. 11.

    Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD, pp. 199–210 (2001)

  12. 12.

    Friedman, N., Goldszmidt, M.: Sequential update of bayesian network structure. In: UAI, pp. 165–174 (1997)

  13. 13.

    Getoor, L.: Learning statistical models from relational data. Ph.D. thesis, Stanford University (2001)

  14. 14.

    Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: SIGMOD, pp. 461–472 (2001)

  15. 15.

    Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD, pp. 463–474 (2000)

  16. 16.

    Haas, P.J., Ilyas, I.F., Lohman, G.M., Markl, V.: Discovering and exploiting statistical properties for query optimization in relational databases: a survey. Stat. Anal. Data Min. 1(4), 223–250 (2009)

    MathSciNet  Article  Google Scholar 

  17. 17.

    Hellerstein, J.M.: Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23(2), 113–157 (1998)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658 (2004)

  19. 19.

    Ioannidis, Y.E.: Universality of serial histograms. In: VLDB, pp. 256–267 (1993)

  20. 20.

    Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB, pp. 19–30 (2003)

  21. 21.

    Ioannidis, Y.E., Christodoulakis, S.: On the propagation of errors in the size of join results. In: SIGMOD, pp. 268–277 (1991)

  22. 22.

    Ioannidis, Y.E., Christodoulakis, S.: Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. Database Syst. 18(4), 709–748 (1993)

    Article  Google Scholar 

  23. 23.

    Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: SIGMOD, pp. 233–244 (1995)

  24. 24.

    Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal histograms with quality guarantees. In: VLDB, pp. 275–286 (1998)

  25. 25.

    Jensen, F.V., Jensen, F.: Optimal junction trees. In: UAI, pp. 360–366 (1994)

  26. 26.

    Kanne, C.C., Moerkotte, G.: Histograms reloaded: The merits of bucket diversity. In: SIGMOD, pp. 663–674 (2010)

  27. 27.

    Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55–76 (2007)

    Article  Google Scholar 

  28. 28.

    Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H.: Robust query processing through progressive optimization. In: SIGMOD, pp. 659–670 (2004)

  29. 29.

    Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD, pp. 448–459 (1998)

  30. 30.

    Moerkotte, G., Neumann, T., Steidl, G.: Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB 2(1), 982–993 (2009)

    Google Scholar 

  31. 31.

    Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD, pp. 28–36 (1988)

  32. 32.

    Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: ICDT, pp. 236–256 (1999)

  33. 33.

    Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB, pp. 486–495 (1997)

  34. 34.

    Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. In: PODS, pp. 53–64 (2010)

  35. 35.

    Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)

  36. 36.

    Spiegel, J., Polyzotis, N.: Graph-based synopses for relational selectivity estimation. In: SIGMOD, pp. 205–216 (2006)

  37. 37.

    Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: Leo - db2’s learning optimizer. In: VLDB, pp. 19–28 (2001)

  38. 38.

    Tzoumas, K., Deshpande, A., Jensen, C.S.: Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11), 852–863 (2011)

    Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for comments that greatly improved the content and presentation of this paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kostas Tzoumas.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 46 KB)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Tzoumas, K., Deshpande, A. & Jensen, C.S. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal 22, 3–27 (2013). https://doi.org/10.1007/s00778-012-0293-7

Download citation

Keywords

  • Graphical models
  • Selectivity estimation
  • Query optimization