Cardinality estimation using normalizing flow

Wang, Jiayi; Chai, Chengliang; Liu, Jiabin; Li, Guoliang

doi:10.1007/s00778-023-00808-x

Cardinality estimation using normalizing flow

Regular Paper
Published: 29 August 2023

Volume 33, pages 323–348, (2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

405 Accesses
Explore all metrics

Abstract

Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning-based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality, while data-driven approaches learn a distribution of tuples, select some samples that satisfy a SQL query, and use the data distributions of these selected tuples to estimate the cardinality of the SQL query. As query-driven methods rely on training queries, the estimation quality is not reliable when there are no high-quality training queries, while data-driven methods have no such limitation and have high adaptivity. In this work, we focus on data-driven methods. A good data-driven model should achieve three optimization goals. First, the model needs to capture data dependencies between columns and support large domain sizes (achieving high accuracy). Second, the model should achieve high inference efficiency, because many data samples are needed to estimate the cardinality (achieving low inference latency). Third, the model should not be too large (achieving a small model size). However, existing data-driven methods cannot simultaneously optimize the three goals. To address the limitations, we propose a novel cardinality estimator \(\texttt{FACE}\), which leverages the normalizing flow-based model to learn a continuous joint distribution for relational data. \(\texttt{FACE}\) can transform a complex distribution over continuous random variables into a simple distribution (e.g., multivariate normal distribution) and use the probability density to estimate the cardinality for both sequential queries and parallel queries. First, we design a dequantization method to make data more “continuous.” Second, we propose encoding and indexing techniques to handle Like predicates for string data. Third, we propose a Monte Carlo method to estimate the cardinality based on the \(\texttt{FACE}\) model. Fourth, we propose a grouping technique to process parallel queries. Fifth, we discuss how to support join queries. Experimental results show that our method significantly outperforms existing approaches in terms of estimation accuracy while keeping similar latency and model size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CELA: An Accurate Learned Cardinality Estimator with Strong Generalization Ability and Dimensional Adaptability

NFAQP: Normalizing Flow Based Approximate Query Processing

Adaptive Inference on Probabilistic Relational Models

Notes

If there are some strings in \(A_i\) that correspond to non-leaf nodes in the trie, we can easily add dummy leaf nodes to represent them.
Substring-based predicates are likely to generate multiple ranges. Our solution can sample them simultaneously.
We approximately assume that the queries appearing within a small time window (e.g., 1ms) are coming simultaneously.
Our method can support multiple relations with joins. We use one relation here for ease of representation.
When the domain size was large, we applied NeuroCard by factorizing the column.

References

Beliakov, G.: Monotonicity preserving approximation of multivariate scattered data. BIT Numer. Math. 45, 653–677 (2005)
Article MathSciNet Google Scholar
Blanchette, M., Kim, E., Vetta, A.: Clique cover on sparse networks. In: Bader, D.A., Mutzel, P. (eds.) ALENEX, pp. 93–102. SIAM (2012)
Google Scholar
Cerioli, M.R., Faria, L., Ferreira, T.O., Martinhon, C.A.J., Protti, F., Reed, B.A.: Partition into cliques for cubic graphs: planar case, complexity and approximation. Discret. Appl. Math. 156(12), 2270–2278 (2008)
Article MathSciNet Google Scholar
Chalupa, D.: On the efficiency of an order-based representation in the clique covering problem. In: Soule, T., Moore, J.H. (eds.) GECCO, pp. 353–360. ACM (2012)
Google Scholar
Chalupa, D.: Construction of near-optimal vertex clique covering for real-world networks. Comput. Inf. 34(6), 1397–1417 (2015)
MathSciNet Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. In: Y. Bengio and Y. LeCun, (eds.) ICLR (2015)
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12(9), 1044–1057 (2019)
Article Google Scholar
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V.R., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12(9), 1044–1057 (2019)
Article Google Scholar
Fritsch, F.N., Carlson, R.E.: Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 17(2), 238–246 (1980)
Article ADS MathSciNet Google Scholar
Garey, M.R., Johnson, D.S.: Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman (1979)
Google Scholar
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: ICML, vol. 37, pp. 881–889 (2015)
Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)
Article Google Scholar
Goodfellow, I., Pouget-Abadie, E.A.: Generative adversarial nets. NIPS 27, 2672–2680 (2014)
Google Scholar
Han, L., Schumaker, L.L.: Fitting monotone surfaces to scattered data using c1 piecewise cubics. SIAM J. Numer. Anal. 34(2), 569–585 (1997)
Article MathSciNet Google Scholar
Hasan, S., Thirumuruganathan, S., Augustine, J., Koudas, N., Das, G.: Deep learning models for selectivity estimation of multi-attribute queries. In: SIGMOD, pp. 1035–1050. ACM (2020)
Heimel, M., Kiefer, M., Markl, V.: Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In: SIGMOD, pp. 1477–1492. ACM, (2015)
Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., Binnig, C.: Deepdb: learn from data, not from queries! VLDB 13(7), 992–1005 (2020)
Google Scholar
Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: Improving flow-based generative models with variational dequantization and architecture design. In: ICML, vol. 97, pp. 2722–2730. PMLR (2019)
Hoogeboom, E., Cohen, T.S., Tomczak, J.M.: Learning discrete distributions by dequantization. arXiv preprint arXiv:2001.11235 (2020)
I. household electric power consumption data set. https://github.com/gpapamak/maf, (2021)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) The IBM research symposia series, pp. 85–103. Plenum Press, New York (1972)
Google Scholar
Kiefer, M., Heimel, M., Breß, S., Markl, V.: Estimating join selectivities using bandwidth-optimized kernel density models. VLDB 10(13), 2085–2096 (2017)
Google Scholar
Kim, K., Jung, J., Seo, I., Han, W., Choi, K., Chong, J.: Learned cardinality estimation: an in-depth study. In SIGMOD, pp. 1214–1227. ACM, (2022)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR, (2014)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Cardinalities: Estimating correlated joins with deep learning. In: CIDR. www.cidrdb.org (2019)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, P.A.: Learned cardinalities: estimating correlated joins with deep learning. In: CIDR (2019)
Kobyzev, I., Prince, S., Brubaker, S.: Normalizing flows: an introduction and review of current methods. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3964–3979 (2020)
Article Google Scholar
Leis, V., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: How good are query optimizers, really? VLDB 9(3), 204–215 (2015)
Google Scholar
Leis, V., Radke, B., Gubichev, A., Kemper, A., Neumann, T.: Cardinality estimation done right: Index-based join sampling. In: CIDR. www.cidrdb.org (2017)
Lepage, G.P.: Adaptive multidimensional integration: Vegas enhanced. J. Comput. Phys. 439, 110386 (2021)
Article MathSciNet Google Scholar
Li, G., Zhou, X., Cao, : AI meets database: AI4DB and DB4AI. In: SIGMOD, pp. 2859–2866 (2021)
Li, G., Zhou, X., Cao, L.: Machine learning for databases. Proc. VLDB Endow. 14(12), 3190–3193 (2021)
Article Google Scholar
Li, G., Zhou, X., Chai, C.: AI meets database: a survey. In: TKDE (2021)
Li, G., Zhou, X., Sun, J., Yu, X., Han, Y., Jin, L., Li, W., Wang, T., Li, S.: opengauss: An autonomous database system. Proc. VLDB Endow. 14(12), 3028–3041 (2021)
Article Google Scholar
Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 03(01), 68 (2020)
Article Google Scholar
Müller, T., McWilliams, B., Rousselle, F., Gross, M., Novák, J.: Neural importance sampling. ACM Trans. Graph. 38(5), 1–19 (2019)
Article Google Scholar
Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425 (2019)
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(57), 1–64 (2021)
MathSciNet Google Scholar
Lepage, G.P.: A new algorithm for adaptive multidimensional integration. J. Comput. Phys. 27(2), 192–203 (1978)
Article ADS MathSciNet Google Scholar
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD, pp. 294–305. ACM Press (1996)
PostgreSQL. https://www.postgresql.org/ (2021). Accessed 14 Sep 2021
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: ICML, vol. 37, pp. 1530–1538 (2015)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Bernstein, P.A. (ed), SIGMOD, pp. 23–34. ACM (1979)
Set, B.M.-S.A.-Q.D.D. https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data (2021). Accessed 14 Sep 2021
Strash, D., Thompson, L.: Effective data reduction for the vertex clique cover problem. In: Phillips, C.A. and Speckmann, B. (eds), ALENEX, pp. 41–53. SIAM (2022)
Sun, J., Li, G.: An end-to-end learning-based cost estimator. VLDB 13(3), 307–319 (2019)
MathSciNet Google Scholar
Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: SIGMOD, pp. 1745–1757 (2021)
Sun, J., Zhang, J., Sun, Z., Li, G., Tang, N.: Learned cardinality estimation: a design space exploration and a comparative evaluation. VLDB, (2021)
Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: Bengio, Y. and LeCun, Y. (eds.), ICLR, (2016)
Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)
Article Google Scholar
Uria, B., Murray, I., Larochelle, H.: RNADE: the real-valued neural autoregressive density-estimator. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., and Weinberger, K.Q. (eds.) NIPS, pp. 2175–2183 (2013)
Wang, J., Chai, C., Liu, J., Li, G.: FACE: a normalizing flow based cardinality estimator. Proc. VLDB Endow. 15(1), 72–84 (2021). https://doi.org/10.14778/3485450.3485458
Article Google Scholar
Wang, X., Qu, C., Wu, W., Wang, J., Zhou, Q.: Are we ready for learned cardinality estimation? Proc. VLDB Endow. 14(9), 1640–1654 (2021)
Article Google Scholar
Wu, P., Cong, G.: A unified deep model of learning from both data and queries for cardinality estimation. In: SIGMOD, pp. 2009–2022. ACM (2021)
Yang, Z., Kamsetty, A., Luan, S., Liang, E., Duan, Y., Chen, X., Stoica, I.: Neurocard: one cardinality estimator for all tables. Proc. VLDB Endow. 14(1), 61–73 (2020)
Article Google Scholar
Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, P., Abbeel, P., Hellerstein, J.M., Krishnan, S., Stoica, I.: Deep unsupervised cardinality estimation. VLDB 13(3), 279–292 (2019)
Google Scholar
Yu, X., Li, G., Chai, C., Tang, N.: Reinforcement learning with tree-lstm for join order selection. In: ICDE, pp. 1297–1308. IEEE (2020)
Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: SIGMOD 2018, pp. 1525–1539. ACM (2018)
Zhou, X., Sun, J., Li, G., Feng, J.: Query performance prediction for concurrent queries using graph embedding. Proc. VLDB Endow. 13(9), 1416–1428 (2020)
Article Google Scholar
Zhu, R., Wu, Z., Han, Y., Zeng, K., Pfadler, A., Qian, Z., Zhou, J., Cui, B.: FLAT: fast, lightweight and accurate method for cardinality estimation. VLDB 14(9), 1489–1502 (2021)
Google Scholar
Ziegler, Z.M., Rush, A.M.: Latent normalizing flows for discrete sequences. In: Chaudhuri, K. and Salakhutdinov, R. (eds.) ICML, vol. 97, pp. 7673–7682. PMLR (2019)
Zuckerman, D.: Linear degree extractors and the inapproximability of max clique and chromatic number. In: Kleinberg, J.M. (ed.) STOC, pp. 681–690. ACM (2006)

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their precious comments. This work is supported by NSF of China (61925205, 62232009, 62102215), Huawei, TAL education, and Zhongguancun Lab.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jiayi Wang, Jiabin Liu & Guoliang Li
Department of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Chengliang Chai

Authors

Jiayi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chengliang Chai
View author publications
You can also search for this author in PubMed Google Scholar
Jiabin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chengliang Chai or Guoliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Chai, C., Liu, J. et al. Cardinality estimation using normalizing flow. The VLDB Journal 33, 323–348 (2024). https://doi.org/10.1007/s00778-023-00808-x

Download citation

Received: 20 December 2022
Revised: 04 June 2023
Accepted: 28 July 2023
Published: 29 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00778-023-00808-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cardinality estimation using normalizing flow

Abstract

Access this article

Similar content being viewed by others

CELA: An Accurate Learned Cardinality Estimator with Strong Generalization Ability and Dimensional Adaptability

NFAQP: Normalizing Flow Based Approximate Query Processing

Adaptive Inference on Probabilistic Relational Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cardinality estimation using normalizing flow

Abstract

Access this article

Similar content being viewed by others

CELA: An Accurate Learned Cardinality Estimator with Strong Generalization Ability and Dimensional Adaptability

NFAQP: Normalizing Flow Based Approximate Query Processing

Adaptive Inference on Probabilistic Relational Models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation