Skip to main content
Log in

Cardinality estimation using normalizing flow

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning-based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality, while data-driven approaches learn a distribution of tuples, select some samples that satisfy a SQL query, and use the data distributions of these selected tuples to estimate the cardinality of the SQL query. As query-driven methods rely on training queries, the estimation quality is not reliable when there are no high-quality training queries, while data-driven methods have no such limitation and have high adaptivity. In this work, we focus on data-driven methods. A good data-driven model should achieve three optimization goals. First, the model needs to capture data dependencies between columns and support large domain sizes (achieving high accuracy). Second, the model should achieve high inference efficiency, because many data samples are needed to estimate the cardinality (achieving low inference latency). Third, the model should not be too large (achieving a small model size). However, existing data-driven methods cannot simultaneously optimize the three goals. To address the limitations, we propose a novel cardinality estimator \(\texttt{FACE}\), which leverages the normalizing flow-based model to learn a continuous joint distribution for relational data. \(\texttt{FACE}\) can transform a complex distribution over continuous random variables into a simple distribution (e.g., multivariate normal distribution) and use the probability density to estimate the cardinality for both sequential queries and parallel queries. First, we design a dequantization method to make data more “continuous.” Second, we propose encoding and indexing techniques to handle Like predicates for string data. Third, we propose a Monte Carlo method to estimate the cardinality based on the \(\texttt{FACE}\) model. Fourth, we propose a grouping technique to process parallel queries. Fifth, we discuss how to support join queries. Experimental results show that our method significantly outperforms existing approaches in terms of estimation accuracy while keeping similar latency and model size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Algorithm 2
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. If there are some strings in \(A_i\) that correspond to non-leaf nodes in the trie, we can easily add dummy leaf nodes to represent them.

  2. Substring-based predicates are likely to generate multiple ranges. Our solution can sample them simultaneously.

  3. We approximately assume that the queries appearing within a small time window (e.g., 1ms) are coming simultaneously.

  4. Our method can support multiple relations with joins. We use one relation here for ease of representation.

  5. When the domain size was large, we applied NeuroCard by factorizing the column.

References

  1. Beliakov, G.: Monotonicity preserving approximation of multivariate scattered data. BIT Numer. Math. 45, 653–677 (2005)

    Article  MathSciNet  Google Scholar 

  2. Blanchette, M., Kim, E., Vetta, A.: Clique cover on sparse networks. In: Bader, D.A., Mutzel, P. (eds.) ALENEX, pp. 93–102. SIAM (2012)

    Google Scholar 

  3. Cerioli, M.R., Faria, L., Ferreira, T.O., Martinhon, C.A.J., Protti, F., Reed, B.A.: Partition into cliques for cubic graphs: planar case, complexity and approximation. Discret. Appl. Math. 156(12), 2270–2278 (2008)

    Article  MathSciNet  Google Scholar 

  4. Chalupa, D.: On the efficiency of an order-based representation in the clique covering problem. In: Soule, T., Moore, J.H. (eds.) GECCO, pp. 353–360. ACM (2012)

    Google Scholar 

  5. Chalupa, D.: Construction of near-optimal vertex clique covering for real-world networks. Comput. Inf. 34(6), 1397–1417 (2015)

    MathSciNet  Google Scholar 

  6. Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. In: Y. Bengio and Y. LeCun, (eds.) ICLR (2015)

  7. Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12(9), 1044–1057 (2019)

    Article  Google Scholar 

  8. Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V.R., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12(9), 1044–1057 (2019)

    Article  Google Scholar 

  9. Fritsch, F.N., Carlson, R.E.: Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 17(2), 238–246 (1980)

    Article  ADS  MathSciNet  Google Scholar 

  10. Garey, M.R., Johnson, D.S.: Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman (1979)

    Google Scholar 

  11. Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: ICML, vol. 37, pp. 881–889 (2015)

  12. Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)

    Article  Google Scholar 

  13. Goodfellow, I., Pouget-Abadie, E.A.: Generative adversarial nets. NIPS 27, 2672–2680 (2014)

    Google Scholar 

  14. Han, L., Schumaker, L.L.: Fitting monotone surfaces to scattered data using c1 piecewise cubics. SIAM J. Numer. Anal. 34(2), 569–585 (1997)

    Article  MathSciNet  Google Scholar 

  15. Hasan, S., Thirumuruganathan, S., Augustine, J., Koudas, N., Das, G.: Deep learning models for selectivity estimation of multi-attribute queries. In: SIGMOD, pp. 1035–1050. ACM (2020)

  16. Heimel, M., Kiefer, M., Markl, V.: Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In: SIGMOD, pp. 1477–1492. ACM, (2015)

  17. Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., Binnig, C.: Deepdb: learn from data, not from queries! VLDB 13(7), 992–1005 (2020)

    Google Scholar 

  18. Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: Improving flow-based generative models with variational dequantization and architecture design. In: ICML, vol. 97, pp. 2722–2730. PMLR (2019)

  19. Hoogeboom, E., Cohen, T.S., Tomczak, J.M.: Learning discrete distributions by dequantization. arXiv preprint arXiv:2001.11235 (2020)

  20. I. household electric power consumption data set. https://github.com/gpapamak/maf, (2021)

  21. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) The IBM research symposia series, pp. 85–103. Plenum Press, New York (1972)

    Google Scholar 

  22. Kiefer, M., Heimel, M., Breß, S., Markl, V.: Estimating join selectivities using bandwidth-optimized kernel density models. VLDB 10(13), 2085–2096 (2017)

    Google Scholar 

  23. Kim, K., Jung, J., Seo, I., Han, W., Choi, K., Chong, J.: Learned cardinality estimation: an in-depth study. In SIGMOD, pp. 1214–1227. ACM, (2022)

  24. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR, (2014)

  25. Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Cardinalities: Estimating correlated joins with deep learning. In: CIDR. www.cidrdb.org (2019)

  26. Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, P.A.: Learned cardinalities: estimating correlated joins with deep learning. In: CIDR (2019)

  27. Kobyzev, I., Prince, S., Brubaker, S.: Normalizing flows: an introduction and review of current methods. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3964–3979 (2020)

    Article  Google Scholar 

  28. Leis, V., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: How good are query optimizers, really? VLDB 9(3), 204–215 (2015)

    Google Scholar 

  29. Leis, V., Radke, B., Gubichev, A., Kemper, A., Neumann, T.: Cardinality estimation done right: Index-based join sampling. In: CIDR. www.cidrdb.org (2017)

  30. Lepage, G.P.: Adaptive multidimensional integration: Vegas enhanced. J. Comput. Phys. 439, 110386 (2021)

    Article  MathSciNet  Google Scholar 

  31. Li, G., Zhou, X., Cao, : AI meets database: AI4DB and DB4AI. In: SIGMOD, pp. 2859–2866 (2021)

  32. Li, G., Zhou, X., Cao, L.: Machine learning for databases. Proc. VLDB Endow. 14(12), 3190–3193 (2021)

    Article  Google Scholar 

  33. Li, G., Zhou, X., Chai, C.: AI meets database: a survey. In: TKDE (2021)

  34. Li, G., Zhou, X., Sun, J., Yu, X., Han, Y., Jin, L., Li, W., Wang, T., Li, S.: opengauss: An autonomous database system. Proc. VLDB Endow. 14(12), 3028–3041 (2021)

    Article  Google Scholar 

  35. Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 03(01), 68 (2020)

    Article  Google Scholar 

  36. Müller, T., McWilliams, B., Rousselle, F., Gross, M., Novák, J.: Neural importance sampling. ACM Trans. Graph. 38(5), 1–19 (2019)

    Article  Google Scholar 

  37. Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425 (2019)

  38. Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(57), 1–64 (2021)

    MathSciNet  Google Scholar 

  39. Lepage, G.P.: A new algorithm for adaptive multidimensional integration. J. Comput. Phys. 27(2), 192–203 (1978)

    Article  ADS  MathSciNet  Google Scholar 

  40. Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD, pp. 294–305. ACM Press (1996)

  41. PostgreSQL. https://www.postgresql.org/ (2021). Accessed 14 Sep 2021

  42. Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: ICML, vol. 37, pp. 1530–1538 (2015)

  43. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Bernstein, P.A. (ed), SIGMOD, pp. 23–34. ACM (1979)

  44. Set, B.M.-S.A.-Q.D.D. https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data (2021). Accessed 14 Sep 2021

  45. Strash, D., Thompson, L.: Effective data reduction for the vertex clique cover problem. In: Phillips, C.A. and Speckmann, B. (eds), ALENEX, pp. 41–53. SIAM (2022)

  46. Sun, J., Li, G.: An end-to-end learning-based cost estimator. VLDB 13(3), 307–319 (2019)

    MathSciNet  Google Scholar 

  47. Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: SIGMOD, pp. 1745–1757 (2021)

  48. Sun, J., Zhang, J., Sun, Z., Li, G., Tang, N.: Learned cardinality estimation: a design space exploration and a comparative evaluation. VLDB, (2021)

  49. Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: Bengio, Y. and LeCun, Y. (eds.), ICLR, (2016)

  50. Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)

    Article  Google Scholar 

  51. Uria, B., Murray, I., Larochelle, H.: RNADE: the real-valued neural autoregressive density-estimator. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., and Weinberger, K.Q. (eds.) NIPS, pp. 2175–2183 (2013)

  52. Wang, J., Chai, C., Liu, J., Li, G.: FACE: a normalizing flow based cardinality estimator. Proc. VLDB Endow. 15(1), 72–84 (2021). https://doi.org/10.14778/3485450.3485458

    Article  Google Scholar 

  53. Wang, X., Qu, C., Wu, W., Wang, J., Zhou, Q.: Are we ready for learned cardinality estimation? Proc. VLDB Endow. 14(9), 1640–1654 (2021)

    Article  Google Scholar 

  54. Wu, P., Cong, G.: A unified deep model of learning from both data and queries for cardinality estimation. In: SIGMOD, pp. 2009–2022. ACM (2021)

  55. Yang, Z., Kamsetty, A., Luan, S., Liang, E., Duan, Y., Chen, X., Stoica, I.: Neurocard: one cardinality estimator for all tables. Proc. VLDB Endow. 14(1), 61–73 (2020)

    Article  Google Scholar 

  56. Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, P., Abbeel, P., Hellerstein, J.M., Krishnan, S., Stoica, I.: Deep unsupervised cardinality estimation. VLDB 13(3), 279–292 (2019)

    Google Scholar 

  57. Yu, X., Li, G., Chai, C., Tang, N.: Reinforcement learning with tree-lstm for join order selection. In: ICDE, pp. 1297–1308. IEEE (2020)

  58. Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: SIGMOD 2018, pp. 1525–1539. ACM (2018)

  59. Zhou, X., Sun, J., Li, G., Feng, J.: Query performance prediction for concurrent queries using graph embedding. Proc. VLDB Endow. 13(9), 1416–1428 (2020)

    Article  Google Scholar 

  60. Zhu, R., Wu, Z., Han, Y., Zeng, K., Pfadler, A., Qian, Z., Zhou, J., Cui, B.: FLAT: fast, lightweight and accurate method for cardinality estimation. VLDB 14(9), 1489–1502 (2021)

    Google Scholar 

  61. Ziegler, Z.M., Rush, A.M.: Latent normalizing flows for discrete sequences. In: Chaudhuri, K. and Salakhutdinov, R. (eds.) ICML, vol. 97, pp. 7673–7682. PMLR (2019)

  62. Zuckerman, D.: Linear degree extractors and the inapproximability of max clique and chromatic number. In: Kleinberg, J.M. (ed.) STOC, pp. 681–690. ACM (2006)

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their precious comments. This work is supported by NSF of China (61925205, 62232009, 62102215), Huawei, TAL education, and Zhongguancun Lab.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chengliang Chai or Guoliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Chai, C., Liu, J. et al. Cardinality estimation using normalizing flow. The VLDB Journal 33, 323–348 (2024). https://doi.org/10.1007/s00778-023-00808-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00808-x

Keywords

Navigation