Abstract
Cardinality estimation, which aims at accurately estimating the result size of queries, is a fundamental task in database query processing and optimization. One of the most recent and effective solutions to this problem is the use of deep autoregressive models to obtain joint probability distributions through unsupervised learning. However, due to the data sparsity, it is difficult for the estimator to accurately capture the actual distribution, which affects the accuracy of the cardinality estimation. In addition, autoregressive estimators’ progressive sampling characteristics are prone to error propagation, which is more evident in high-dimensional data. To reduce the autoregressive cardinality estimation error and to obtain a better trade-off between estimate accuracy and latency, we propose a random smoothing autoregressive cardinality estimation model (SAM-CE), which uses a random smoothing technique combined with a deep autoregressive model to simplify the learning of joint probability distributions. A smooth progressive sampling method that is suitable for range queries is designed to improve the estimator accuracy by improving the sample quality. We conduct extensive experiments to demonstrate the effectiveness and performance of the proposed SAM-CE. The results show that SAM-CE achieves the state of the art effectiveness of cardinality estimation.
Similar content being viewed by others
Availability of data and materials
The datasets generated during the current study are available from the corresponding author on reasonable request.
Notes
catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations.
References
Lan, H., Bao, Z., Peng, Y.: A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration. Data Sci. Eng. 6(1), 86–101 (2021). https://doi.org/10.1007/s41019-020-00149-7
Leis, V., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)
Leis, V., Radke, B., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J. 27(5), 643–668 (2018). https://doi.org/10.1007/s00778-017-0480-7
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12(9), 1044–1057 (2019)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Learned cardinalities: Estimating correlated joins with deep learning. In: 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf
Bruno, N., Chaudhuri, S., Gravano, L.: Stholes: A multidimensional workload-aware histogram. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 211–222 (2001)
Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: Samples, histograms, wavelets, sketches. Found. TrendsDatabases 4(1–3), 1–294 (2012). https://doi.org/10.1561/1900000004
Flajolet, P., Fusy, E., Gandouet, O., Meunier, F.: Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete Mathematics and Theoretical Computer Science (2007)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985). https://doi.org/10.1016/0022-0000(85)90041-8
Giroire, F.: Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math. 157(2), 406–427 (2009). https://doi.org/10.1016/j.dam.2008.06.020
Lipton, R.J., Naughton, J.F., Schneider, D.A.: Practical selectivity estimation through adaptive sampling. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pp. 1–11 (2022)
Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pp. 486–495. http://www.vldb.org/conf/1997/P486.PDF
Heimel, M., Kiefer, M., Markl, V.: Self-tuning, gpu-accelerated kernel density models or multidimensional selectivity estimation. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1477–1492
Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., Binnig, C.: Deepdb: Learn from data, not from queries! Proc. VLDB Endow. 13(7), 992–1005 (2020)
Wu, P., Cong, G.: A unified deep model of learning from both data and queries for cardinality estimation. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2009–2022 (2021)
Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, X., Abbeel, P., Hellerstein, J.M., Krishnan, S., Stoica, I.: Deep unsupervised cardinality estimation. Proceedings of the Vldb Endowment 13(3), 279–292 (2019). https://doi.org/10.14778/3368289.3368294
Hasan, S., Thirumuruganathan, S., Augustine, J., Koudas, N., Das, G.: Deep learning models for selectivity estimation of multi-attribute queries. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1035–1050 (2020)
Bishop, C.M.: Training with noise is equivalent to tikhonov regularization. Neural Comput. 7(1), 108–116 (1995). https://doi.org/10.1162/neco.1995.7.1.108
To, H., Chiang, K., Shahabi, C.: Entropy-based histograms for selectivity estimation. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 1939–1948 (2013)
Lynch, C.A.: Selectivity estimation and query optimization in large databases with highly skewed distribution of column values. In: Proceedings of the 14th International Conference on Very Large Data Bases, pp. 240–251 (1998)
Park, Y., Zhong, S., Mozafari, B.: Quicksel: Quick selectivity learning with mixture models. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1017–1033 (2020)
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB, vol. 1, pp. 541–550 (2001)
Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, vol. 95, pp. 311–322 (1995)
Chow, C.K., Liu, C.N.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968). https://doi.org/10.1109/TIT.1968.1054142
Spiegel, J., Polyzotis, N.: Graph-based synopses for relational selectivity estimation. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 205–216 (2006)
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 461–472 (2001)
Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pp. 463–474 (2000). https://doi.org/10.1145/342009.335448
Lakshmi, M.S., Zhou, S.: Selectivity estimation in extensible databases - A neural network approach. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pp.623–627. http://www.vldb.org/conf/1998/p623.pdf
Liu, H., Xu, M., Yu, Z., Corvinelli, V., Zuzarte, C.: Cardinality estimation using neural networks. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, pp. 53–59
Lu, H., Setiono, R.: Effective query size estimation using neural networks. Appl. Intell. 16(3), 173–183 (2002). https://doi.org/10.1023/A:1014333932021
Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: An empirical analysis of deep learning for cardinality estimation. CoRR abs/1905.06425 (2019). arXiv:1905.06425
Zhu, R., Wu, Z., Han, Y., Zeng, K., Pfadler, A., Qian, Z., Zhou, J., Cui, B.: Flat: Fast, lightweight and accurate method for cardinality estimation. Proc. VLDB Endow. 14(9), 1489–1502 (2021)
Narayanan, H., Mitter, S.K.: Sample complexity of testing the manifold hypothesis. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6-9 December 2010, pp. 1786–1794. Vancouver, British Columbia, Canada (2010)
Cornish, R., Caterini, A., Deligiannidis, G., Doucet, A.: Relaxing bijectivity constraints with continuously indexed normalising flows. In: International Conference on Machine Learning, pp. 2133–2143. PMLR (2020)
Meng, C., Song, J., Song, Y., Zhao, S., Ermon, S.: Improved autoregressive modeling with distribution smoothing. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. https://openreview.net/forum?id=rJA5Pz7lHKb
Cohen, J., Rosenfeld, E., Kolter, Z.: Certified adversarial robustness via randomized smoothing. In: International Conference on Machine Learning, pp. 1310–1320. PMLR
Wang, X., Qu, C., Wu, W., Wang, J., Zhou, Q.: Are we ready for learned cardinality estimation? Proc. VLDB Endow. 14(9), 1640–1654 (2021)
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Jagadish, H.V., Mumick, I.S. (eds.) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996, pp. 294–305 (1996). https://doi.org/10.1145/233269.233342
Moerkotte, G., Neumann, T., Steidl, G.: Preventing bad plans by bounding the impact of cardinality estimation errors. Proc. VLDB Endow. 2(1), 982–993 (2009). https://doi.org/10.14778/1687627.1687738
Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 881–889 (2015). http://proceedings.mlr.press/v37/germain15.html
Funding
This work was supported by National Natural Science Foundation of China (Nos. 62062027 and U22A2099), Innovation Project of GUET Graduate Education (No. 2022YCXS079) and the project of Guangxi Key Laboratory of Trusted Software.
Author information
Authors and Affiliations
Contributions
Yuming Lin: Methodology, Conceptualization, Investigation, Validation, Writing - original draft, Writing - review & editing, Funding acquisition. Zejun Xu: Methodology, Software, Writing - Original Draft, Writing - review & editing, Validation. Yinghao Zhang: Data preparation and maintenance, Validation. You Li: Methodology, Writing - review & editing, Funding acquisition. Jingwei Zhang: Resources, Supervision. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, Y., Xu, Z., Zhang, Y. et al. Cardinality estimation with smoothing autoregressive models. World Wide Web 26, 3441–3461 (2023). https://doi.org/10.1007/s11280-023-01195-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-023-01195-7