Abstract
Cardinality estimation is one of the most important problems in query optimization. Recently, machine learning-based techniques have been proposed to effectively estimate cardinality, which can be broadly classified into query-driven and data-driven approaches. Query-driven approaches learn a regression model from a query to its cardinality, while data-driven approaches learn a distribution of tuples, select some samples that satisfy a SQL query, and use the data distributions of these selected tuples to estimate the cardinality of the SQL query. As query-driven methods rely on training queries, the estimation quality is not reliable when there are no high-quality training queries, while data-driven methods have no such limitation and have high adaptivity. In this work, we focus on data-driven methods. A good data-driven model should achieve three optimization goals. First, the model needs to capture data dependencies between columns and support large domain sizes (achieving high accuracy). Second, the model should achieve high inference efficiency, because many data samples are needed to estimate the cardinality (achieving low inference latency). Third, the model should not be too large (achieving a small model size). However, existing data-driven methods cannot simultaneously optimize the three goals. To address the limitations, we propose a novel cardinality estimator \(\texttt{FACE}\), which leverages the normalizing flow-based model to learn a continuous joint distribution for relational data. \(\texttt{FACE}\) can transform a complex distribution over continuous random variables into a simple distribution (e.g., multivariate normal distribution) and use the probability density to estimate the cardinality for both sequential queries and parallel queries. First, we design a dequantization method to make data more “continuous.” Second, we propose encoding and indexing techniques to handle Like predicates for string data. Third, we propose a Monte Carlo method to estimate the cardinality based on the \(\texttt{FACE}\) model. Fourth, we propose a grouping technique to process parallel queries. Fifth, we discuss how to support join queries. Experimental results show that our method significantly outperforms existing approaches in terms of estimation accuracy while keeping similar latency and model size.
Similar content being viewed by others
Notes
If there are some strings in \(A_i\) that correspond to non-leaf nodes in the trie, we can easily add dummy leaf nodes to represent them.
Substring-based predicates are likely to generate multiple ranges. Our solution can sample them simultaneously.
We approximately assume that the queries appearing within a small time window (e.g., 1ms) are coming simultaneously.
Our method can support multiple relations with joins. We use one relation here for ease of representation.
When the domain size was large, we applied NeuroCard by factorizing the column.
References
Beliakov, G.: Monotonicity preserving approximation of multivariate scattered data. BIT Numer. Math. 45, 653–677 (2005)
Blanchette, M., Kim, E., Vetta, A.: Clique cover on sparse networks. In: Bader, D.A., Mutzel, P. (eds.) ALENEX, pp. 93–102. SIAM (2012)
Cerioli, M.R., Faria, L., Ferreira, T.O., Martinhon, C.A.J., Protti, F., Reed, B.A.: Partition into cliques for cubic graphs: planar case, complexity and approximation. Discret. Appl. Math. 156(12), 2270–2278 (2008)
Chalupa, D.: On the efficiency of an order-based representation in the clique covering problem. In: Soule, T., Moore, J.H. (eds.) GECCO, pp. 353–360. ACM (2012)
Chalupa, D.: Construction of near-optimal vertex clique covering for real-world networks. Comput. Inf. 34(6), 1397–1417 (2015)
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. In: Y. Bengio and Y. LeCun, (eds.) ICLR (2015)
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12(9), 1044–1057 (2019)
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V.R., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow. 12(9), 1044–1057 (2019)
Fritsch, F.N., Carlson, R.E.: Monotone piecewise cubic interpolation. SIAM J. Numer. Anal. 17(2), 238–246 (1980)
Garey, M.R., Johnson, D.S.: Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman (1979)
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: ICML, vol. 37, pp. 881–889 (2015)
Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)
Goodfellow, I., Pouget-Abadie, E.A.: Generative adversarial nets. NIPS 27, 2672–2680 (2014)
Han, L., Schumaker, L.L.: Fitting monotone surfaces to scattered data using c1 piecewise cubics. SIAM J. Numer. Anal. 34(2), 569–585 (1997)
Hasan, S., Thirumuruganathan, S., Augustine, J., Koudas, N., Das, G.: Deep learning models for selectivity estimation of multi-attribute queries. In: SIGMOD, pp. 1035–1050. ACM (2020)
Heimel, M., Kiefer, M., Markl, V.: Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation. In: SIGMOD, pp. 1477–1492. ACM, (2015)
Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., Binnig, C.: Deepdb: learn from data, not from queries! VLDB 13(7), 992–1005 (2020)
Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: Improving flow-based generative models with variational dequantization and architecture design. In: ICML, vol. 97, pp. 2722–2730. PMLR (2019)
Hoogeboom, E., Cohen, T.S., Tomczak, J.M.: Learning discrete distributions by dequantization. arXiv preprint arXiv:2001.11235 (2020)
I. household electric power consumption data set. https://github.com/gpapamak/maf, (2021)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) The IBM research symposia series, pp. 85–103. Plenum Press, New York (1972)
Kiefer, M., Heimel, M., Breß, S., Markl, V.: Estimating join selectivities using bandwidth-optimized kernel density models. VLDB 10(13), 2085–2096 (2017)
Kim, K., Jung, J., Seo, I., Han, W., Choi, K., Chong, J.: Learned cardinality estimation: an in-depth study. In SIGMOD, pp. 1214–1227. ACM, (2022)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR, (2014)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Cardinalities: Estimating correlated joins with deep learning. In: CIDR. www.cidrdb.org (2019)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, P.A.: Learned cardinalities: estimating correlated joins with deep learning. In: CIDR (2019)
Kobyzev, I., Prince, S., Brubaker, S.: Normalizing flows: an introduction and review of current methods. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3964–3979 (2020)
Leis, V., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: How good are query optimizers, really? VLDB 9(3), 204–215 (2015)
Leis, V., Radke, B., Gubichev, A., Kemper, A., Neumann, T.: Cardinality estimation done right: Index-based join sampling. In: CIDR. www.cidrdb.org (2017)
Lepage, G.P.: Adaptive multidimensional integration: Vegas enhanced. J. Comput. Phys. 439, 110386 (2021)
Li, G., Zhou, X., Cao, : AI meets database: AI4DB and DB4AI. In: SIGMOD, pp. 2859–2866 (2021)
Li, G., Zhou, X., Cao, L.: Machine learning for databases. Proc. VLDB Endow. 14(12), 3190–3193 (2021)
Li, G., Zhou, X., Chai, C.: AI meets database: a survey. In: TKDE (2021)
Li, G., Zhou, X., Sun, J., Yu, X., Han, Y., Jin, L., Li, W., Wang, T., Li, S.: opengauss: An autonomous database system. Proc. VLDB Endow. 14(12), 3028–3041 (2021)
Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 03(01), 68 (2020)
Müller, T., McWilliams, B., Rousselle, F., Gross, M., Novák, J.: Neural importance sampling. ACM Trans. Graph. 38(5), 1–19 (2019)
Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425 (2019)
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(57), 1–64 (2021)
Lepage, G.P.: A new algorithm for adaptive multidimensional integration. J. Comput. Phys. 27(2), 192–203 (1978)
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD, pp. 294–305. ACM Press (1996)
PostgreSQL. https://www.postgresql.org/ (2021). Accessed 14 Sep 2021
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: ICML, vol. 37, pp. 1530–1538 (2015)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Bernstein, P.A. (ed), SIGMOD, pp. 23–34. ACM (1979)
Set, B.M.-S.A.-Q.D.D. https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data (2021). Accessed 14 Sep 2021
Strash, D., Thompson, L.: Effective data reduction for the vertex clique cover problem. In: Phillips, C.A. and Speckmann, B. (eds), ALENEX, pp. 41–53. SIAM (2022)
Sun, J., Li, G.: An end-to-end learning-based cost estimator. VLDB 13(3), 307–319 (2019)
Sun, J., Li, G., Tang, N.: Learned cardinality estimation for similarity queries. In: SIGMOD, pp. 1745–1757 (2021)
Sun, J., Zhang, J., Sun, Z., Li, G., Tang, N.: Learned cardinality estimation: a design space exploration and a comparative evaluation. VLDB, (2021)
Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: Bengio, Y. and LeCun, Y. (eds.), ICLR, (2016)
Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)
Uria, B., Murray, I., Larochelle, H.: RNADE: the real-valued neural autoregressive density-estimator. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., and Weinberger, K.Q. (eds.) NIPS, pp. 2175–2183 (2013)
Wang, J., Chai, C., Liu, J., Li, G.: FACE: a normalizing flow based cardinality estimator. Proc. VLDB Endow. 15(1), 72–84 (2021). https://doi.org/10.14778/3485450.3485458
Wang, X., Qu, C., Wu, W., Wang, J., Zhou, Q.: Are we ready for learned cardinality estimation? Proc. VLDB Endow. 14(9), 1640–1654 (2021)
Wu, P., Cong, G.: A unified deep model of learning from both data and queries for cardinality estimation. In: SIGMOD, pp. 2009–2022. ACM (2021)
Yang, Z., Kamsetty, A., Luan, S., Liang, E., Duan, Y., Chen, X., Stoica, I.: Neurocard: one cardinality estimator for all tables. Proc. VLDB Endow. 14(1), 61–73 (2020)
Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, P., Abbeel, P., Hellerstein, J.M., Krishnan, S., Stoica, I.: Deep unsupervised cardinality estimation. VLDB 13(3), 279–292 (2019)
Yu, X., Li, G., Chai, C., Tang, N.: Reinforcement learning with tree-lstm for join order selection. In: ICDE, pp. 1297–1308. IEEE (2020)
Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: SIGMOD 2018, pp. 1525–1539. ACM (2018)
Zhou, X., Sun, J., Li, G., Feng, J.: Query performance prediction for concurrent queries using graph embedding. Proc. VLDB Endow. 13(9), 1416–1428 (2020)
Zhu, R., Wu, Z., Han, Y., Zeng, K., Pfadler, A., Qian, Z., Zhou, J., Cui, B.: FLAT: fast, lightweight and accurate method for cardinality estimation. VLDB 14(9), 1489–1502 (2021)
Ziegler, Z.M., Rush, A.M.: Latent normalizing flows for discrete sequences. In: Chaudhuri, K. and Salakhutdinov, R. (eds.) ICML, vol. 97, pp. 7673–7682. PMLR (2019)
Zuckerman, D.: Linear degree extractors and the inapproximability of max clique and chromatic number. In: Kleinberg, J.M. (ed.) STOC, pp. 681–690. ACM (2006)
Acknowledgements
We would like to thank the anonymous reviewers for their precious comments. This work is supported by NSF of China (61925205, 62232009, 62102215), Huawei, TAL education, and Zhongguancun Lab.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Chai, C., Liu, J. et al. Cardinality estimation using normalizing flow. The VLDB Journal 33, 323–348 (2024). https://doi.org/10.1007/s00778-023-00808-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-023-00808-x