Skip to main content
Log in

Cardinality estimation with smoothing autoregressive models

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Cardinality estimation, which aims at accurately estimating the result size of queries, is a fundamental task in database query processing and optimization. One of the most recent and effective solutions to this problem is the use of deep autoregressive models to obtain joint probability distributions through unsupervised learning. However, due to the data sparsity, it is difficult for the estimator to accurately capture the actual distribution, which affects the accuracy of the cardinality estimation. In addition, autoregressive estimators’ progressive sampling characteristics are prone to error propagation, which is more evident in high-dimensional data. To reduce the autoregressive cardinality estimation error and to obtain a better trade-off between estimate accuracy and latency, we propose a random smoothing autoregressive cardinality estimation model (SAM-CE), which uses a random smoothing technique combined with a deep autoregressive model to simplify the learning of joint probability distributions. A smooth progressive sampling method that is suitable for range queries is designed to improve the estimator accuracy by improving the sample quality. We conduct extensive experiments to demonstrate the effectiveness and performance of the proposed SAM-CE. The results show that SAM-CE achieves the state of the art effectiveness of cardinality estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

The datasets generated during the current study are available from the corresponding author on reasonable request.

Notes

  1. catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations.

  2. https://archive.ics.uci.edu/ml/index.php.

References

  1. Lan, H., Bao, Z., Peng, Y.: A survey on advancing the dbms query optimizer: Cardinality estimation, cost model, and plan enumeration. Data Sci. Eng. 6(1), 86–101 (2021). https://doi.org/10.1007/s41019-020-00149-7

    Article  Google Scholar 

  2. Leis, V., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)

    Article  Google Scholar 

  3. Leis, V., Radke, B., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J. 27(5), 643–668 (2018). https://doi.org/10.1007/s00778-017-0480-7

    Article  Google Scholar 

  4. Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12(9), 1044–1057 (2019)

    Article  Google Scholar 

  5. Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Learned cardinalities: Estimating correlated joins with deep learning. In: 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf

  6. Bruno, N., Chaudhuri, S., Gravano, L.: Stholes: A multidimensional workload-aware histogram. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 211–222 (2001)

  7. Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: Samples, histograms, wavelets, sketches. Found. TrendsDatabases 4(1–3), 1–294 (2012). https://doi.org/10.1561/1900000004

    Article  MATH  Google Scholar 

  8. Flajolet, P., Fusy, E., Gandouet, O., Meunier, F.: Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete Mathematics and Theoretical Computer Science (2007)

  9. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985). https://doi.org/10.1016/0022-0000(85)90041-8

    Article  MathSciNet  MATH  Google Scholar 

  10. Giroire, F.: Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math. 157(2), 406–427 (2009). https://doi.org/10.1016/j.dam.2008.06.020

    Article  MathSciNet  MATH  Google Scholar 

  11. Lipton, R.J., Naughton, J.F., Schneider, D.A.: Practical selectivity estimation through adaptive sampling. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pp. 1–11 (2022)

  12. Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pp. 486–495. http://www.vldb.org/conf/1997/P486.PDF

  13. Heimel, M., Kiefer, M., Markl, V.: Self-tuning, gpu-accelerated kernel density models or multidimensional selectivity estimation. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1477–1492

  14. Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., Binnig, C.: Deepdb: Learn from data, not from queries! Proc. VLDB Endow. 13(7), 992–1005 (2020)

    Article  Google Scholar 

  15. Wu, P., Cong, G.: A unified deep model of learning from both data and queries for cardinality estimation. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2009–2022 (2021)

  16. Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, X., Abbeel, P., Hellerstein, J.M., Krishnan, S., Stoica, I.: Deep unsupervised cardinality estimation. Proceedings of the Vldb Endowment 13(3), 279–292 (2019). https://doi.org/10.14778/3368289.3368294

  17. Hasan, S., Thirumuruganathan, S., Augustine, J., Koudas, N., Das, G.: Deep learning models for selectivity estimation of multi-attribute queries. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1035–1050 (2020)

  18. Bishop, C.M.: Training with noise is equivalent to tikhonov regularization. Neural Comput. 7(1), 108–116 (1995). https://doi.org/10.1162/neco.1995.7.1.108

    Article  Google Scholar 

  19. To, H., Chiang, K., Shahabi, C.: Entropy-based histograms for selectivity estimation. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 1939–1948 (2013)

  20. Lynch, C.A.: Selectivity estimation and query optimization in large databases with highly skewed distribution of column values. In: Proceedings of the 14th International Conference on Very Large Data Bases, pp. 240–251 (1998)

  21. Park, Y., Zhong, S., Mozafari, B.: Quicksel: Quick selectivity learning with mixture models. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1017–1033 (2020)

  22. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB, vol. 1, pp. 541–550 (2001)

  23. Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, vol. 95, pp. 311–322 (1995)

  24. Chow, C.K., Liu, C.N.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968). https://doi.org/10.1109/TIT.1968.1054142

    Article  MATH  Google Scholar 

  25. Spiegel, J., Polyzotis, N.: Graph-based synopses for relational selectivity estimation. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 205–216 (2006)

  26. Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 461–472 (2001)

  27. Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pp. 463–474 (2000). https://doi.org/10.1145/342009.335448

  28. Lakshmi, M.S., Zhou, S.: Selectivity estimation in extensible databases - A neural network approach. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pp.623–627. http://www.vldb.org/conf/1998/p623.pdf

  29. Liu, H., Xu, M., Yu, Z., Corvinelli, V., Zuzarte, C.: Cardinality estimation using neural networks. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, pp. 53–59

  30. Lu, H., Setiono, R.: Effective query size estimation using neural networks. Appl. Intell. 16(3), 173–183 (2002). https://doi.org/10.1023/A:1014333932021

    Article  MATH  Google Scholar 

  31. Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: An empirical analysis of deep learning for cardinality estimation. CoRR abs/1905.06425 (2019). arXiv:1905.06425

  32. Zhu, R., Wu, Z., Han, Y., Zeng, K., Pfadler, A., Qian, Z., Zhou, J., Cui, B.: Flat: Fast, lightweight and accurate method for cardinality estimation. Proc. VLDB Endow. 14(9), 1489–1502 (2021)

    Article  Google Scholar 

  33. Narayanan, H., Mitter, S.K.: Sample complexity of testing the manifold hypothesis. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6-9 December 2010, pp. 1786–1794. Vancouver, British Columbia, Canada (2010)

  34. Cornish, R., Caterini, A., Deligiannidis, G., Doucet, A.: Relaxing bijectivity constraints with continuously indexed normalising flows. In: International Conference on Machine Learning, pp. 2133–2143. PMLR (2020)

  35. Meng, C., Song, J., Song, Y., Zhao, S., Ermon, S.: Improved autoregressive modeling with distribution smoothing. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. https://openreview.net/forum?id=rJA5Pz7lHKb

  36. Cohen, J., Rosenfeld, E., Kolter, Z.: Certified adversarial robustness via randomized smoothing. In: International Conference on Machine Learning, pp. 1310–1320. PMLR

  37. Wang, X., Qu, C., Wu, W., Wang, J., Zhou, Q.: Are we ready for learned cardinality estimation? Proc. VLDB Endow. 14(9), 1640–1654 (2021)

    Article  Google Scholar 

  38. Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Jagadish, H.V., Mumick, I.S. (eds.) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996, pp. 294–305 (1996). https://doi.org/10.1145/233269.233342

  39. Moerkotte, G., Neumann, T., Steidl, G.: Preventing bad plans by bounding the impact of cardinality estimation errors. Proc. VLDB Endow. 2(1), 982–993 (2009). https://doi.org/10.14778/1687627.1687738

  40. Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 881–889 (2015). http://proceedings.mlr.press/v37/germain15.html

Download references

Funding

This work was supported by National Natural Science Foundation of China (Nos. 62062027 and U22A2099), Innovation Project of GUET Graduate Education (No. 2022YCXS079) and the project of Guangxi Key Laboratory of Trusted Software.

Author information

Authors and Affiliations

Authors

Contributions

Yuming Lin: Methodology, Conceptualization, Investigation, Validation, Writing - original draft, Writing - review & editing, Funding acquisition. Zejun Xu: Methodology, Software, Writing - Original Draft, Writing - review & editing, Validation. Yinghao Zhang: Data preparation and maintenance, Validation. You Li: Methodology, Writing - review & editing, Funding acquisition. Jingwei Zhang: Resources, Supervision. All authors reviewed the manuscript.

Corresponding author

Correspondence to You Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, Y., Xu, Z., Zhang, Y. et al. Cardinality estimation with smoothing autoregressive models. World Wide Web 26, 3441–3461 (2023). https://doi.org/10.1007/s11280-023-01195-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-023-01195-7

Keywords

Navigation