The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling

  • Matteo Riondato
  • Mert Akdere
  • Uǧur Çetintemel
  • Stanley B. Zdonik
  • Eli Upfal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6912)


We develop a novel method, based on the statistical concept of VC-dimension, for evaluating the selectivity (output cardinality) of SQL queries – a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound on the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate, and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size of the database. We develop a method based on this result: given a class of queries, it constructs a concise random sample of a database, such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. The sample representation computed by our method is typically sufficiently small to be stored in main memory. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and the Microsoft SQL Server.


Query Processing Range Space Execution Plan Query Answering Multiple Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join Synopses for Approximate Query Answering. In: SIGMOD 1999 (1999)Google Scholar
  2. 2.
    Alon, N., Spencer, J.H.: The Probabilistic Method, 3rd edn. John Wiley & Sons, Chichester (2008)CrossRefzbMATHGoogle Scholar
  3. 3.
    Babcock, B., Chaudhuri, S., Das, G.: Dynamic Sample Selection for Approximate Query Processing. In: SIGMOD 2003 (2003)Google Scholar
  4. 4.
    Brown, P.G., Haas, P.J.: Techniques for Warehousing of Sample Data. In: ICDE 2006 (2006)Google Scholar
  5. 5.
    Bruno, N., Chaudhuri, S.: Conditional Selectivity for Statistics on Query Expressions. In: SIGMOD 2004 (2004)Google Scholar
  6. 6.
    Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a Multidimensional Workload-Aware Histogram. In: SIGMOD 2001 (2001)Google Scholar
  7. 7.
    Chaudhuri, S., Das, G., Narasayya, V.: Optimized Stratified Sampling for Approximate Query Processing. ACM Trans. Database Syst. 32(2), Art. 9 (2007)Google Scholar
  8. 8.
    Chaudhuri, S., Motwani, R., Narasayya, V.: Random Sampling for Histogram Construction: How Much is Enough. In: SIGMOD 1998 (1998)Google Scholar
  9. 9.
    Chaudhuri, S., Motwani, R., Narasayya, V.: On Random Sampling over Joins. In: SIGMOD 1999 (1999)Google Scholar
  10. 10.
    Chazelle, B.: The Discrepancy Method: Randomness and Complexity. Cambridge University Press, Cambridge (2000)CrossRefzbMATHGoogle Scholar
  11. 11.
    Chen, M.C., McNamee, L., Matloff, N.S.: Selectivity Estimation Using Homogeneity Measurement. In: ICDE 1990 (1990)Google Scholar
  12. 12.
    Das, G.: Sampling Methods in Approximate Query Answering Systems. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining. Information Science Publishing (2005)Google Scholar
  13. 13.
    Dobra, A.: Histograms Revisited: When are Histograms the Best Approximation Method for Aggregates over Joins? In: PODS 2005 (2005)Google Scholar
  14. 14.
    Estan, C., Naughton, J.F.: End-Biased Samples for Join Cardinality Estimation. In: ICDE 2006 (2006)Google Scholar
  15. 15.
    Ganguly, S., Gibbons, P.B., Matias, Y., Silberschatz, A.: Bifocal Sampling for Skew-Resistant Join Size Estimation. In: SIGMOD 1996 (1996)Google Scholar
  16. 16.
    Ganti, V., Lee, M.-L., Ramakrishnan, R.: ICICLES: Self-Tuning Samples for Approximate Query Answering. In: VLDB 2000 (2000)Google Scholar
  17. 17.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice-Hall, Englewood Cliffs (2002)Google Scholar
  18. 18.
    Gemulla, R., Lehner, W., Haas, P.J.: A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets. In: VLDB 2006 (2006)Google Scholar
  19. 19.
    Getoor, L., Taskar, B., Koller, D.: Selectivity Estimation using Probabilistic Models. In: SIGMOD 2001 (2001)Google Scholar
  20. 20.
    Gibbons, P.B., Matias, Y.: New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In: SIGMOD 1998 (1998)Google Scholar
  21. 21.
    Gryz, J., Liang, D.: Query Selectivity Estimation via Data Mining. In: IIS 2004 (2004)Google Scholar
  22. 22.
    Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Fixed-Precision Estimation of Join Selectivity. In: PODS 1993 (1993)Google Scholar
  23. 23.
    Haas, P.J., Naughton, J.F., Swami, A.N.: On the Relative Cost of Sampling for Join Selectivity Estimation. In: PODS 1994 (1994)Google Scholar
  24. 24.
    Haas, P.J., Swami, A.N.: Sequential Sampling Procedures for Query Size Estimation. In: SIGMOD 1992 (1992)Google Scholar
  25. 25.
    Haas, P.J., Swami, A.N.: Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics. In: ICDE 1995 (1995)Google Scholar
  26. 26.
    Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and Cost Estimation for Joins Based on Random Sampling. Jour. of Comp. and Sys. Sci. 52, 550–569 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Haas, P.J.: Hoeffding Inequalities for Join-Selectivity Estimation and Online Aggregation. IBM Research Report RJ 10040 (2000)Google Scholar
  28. 28.
    Harangsri, B., Shepherd, J., Ngu, A.H.H.: Query Size Estimation using Machine Learning. In: DASFAA 1997 (1997)Google Scholar
  29. 29.
    Har-Peled, S., Sharir, M.: Relative (p,ε)-Approximations in Geometry. Discrete & Computational Geometry 45(3), 462–496 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Hou, W.-C., Ozsoyoglu, G., Dogdu, E.: Error-Constraint Count Query Evaluation in Relational Databases. In: SIGMOD 1991 (1991)Google Scholar
  31. 31.
    Hou, W.-C., Ozsoyoglu, G., Taneja, B.K.: Statistical Estimators for Relational Algebra Expressions. In: PODS 1988 (1988)Google Scholar
  32. 32.
    Ioannidis, Y.E., Poosala, V.: Balancing Histogram Optimality and Practicality for Query Result Size Estimation. In: SIGMOD 1995 (1995)Google Scholar
  33. 33.
    Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal Histograms with Quality Guarantees. In: VLDB 1998 (1998)Google Scholar
  34. 34.
    Jin, R., Glimcher, L., Jermaine, C., Agrawal, G.: New Sampling-Based Estimators for OLAP Queries. In: ICDE 2006 (2006)Google Scholar
  35. 35.
    Joshi, S., Jermaine, C.: Robust Stratified Sampling Plans for Low Selectivity Queries. In: ICDE 2008 (2008)Google Scholar
  36. 36.
    Kaushik, R., Naughton, J.F., Ramakrishnan, R., Chakravarthy, V.T.: Synopses for Query Optimization: a Space-Complexity Perspective. ACM Trans. Database Syst. 30(4), 1102–1127 (2005)CrossRefGoogle Scholar
  37. 37.
    Larson, P.-A., Lehner, W., Zhou, J., Zabback, P.: Cardinality Estimation Using Sample Views with Quality Assurance. In: SIGMOD 2007 (2007)Google Scholar
  38. 38.
    Li, Y., Long, P.M., Srinivasan, A.: Improved Bounds on the Sample Complexity of Learning. Jour. of Comp. and Sys. Sci. 62, 516–527 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Lipton, R.J., Naughton, J.F.: Query Size Estimation by Adaptive Sampling. J. Comput. Syst. Sci. 51(1), 18–25 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Lipton, R.J., Naughton, J.F., Schneider, D.A.: Practical Selectivity Estimation through Adaptive Sampling. In: SIGMOD 1990 (1990)Google Scholar
  41. 41.
    Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 313–324. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  42. 42.
    Markl, V., Haas, M., Peter, J., Kutsch, Megiddo, N., Srivastava, U., Tran, T.M.: Consistent Selectivity Estimation via Maximum Entropy. The VLDB Journal 16(1), 55–76 (2007)CrossRefGoogle Scholar
  43. 43.
    Matias, Y., Vitter, J.S., Wang, M.: Wavelet-Based Histograms for Selectivity Estimation. In: SIGMOD 1998 (1998)Google Scholar
  44. 44.
    Matoušek, J.: Lectures on Discrete Geometry. Springer, Heidelberg (2002)CrossRefzbMATHGoogle Scholar
  45. 45.
    Miller, R.J.: Simultaneous Statistical Inference, 2nd edn. Springer, Heidelberg (1981)CrossRefzbMATHGoogle Scholar
  46. 46.
    Ngu, A.H., Harangsri, B., Shepherd, J.: Query Size Estimation for Joins Using Systematic Sampling. Distributed and Parallel Databases 15, 237–275 (2004)CrossRefGoogle Scholar
  47. 47.
    Olken, F.: Random Sampling from Databases. Ph.D. dissertation, LBL Tech. Report LBL-32883 (1993)Google Scholar
  48. 48.
    Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved Histograms for Selectivity Estimation of Range Predicates. In: SIGMOD 1996 (1996)Google Scholar
  49. 49.
    Poosala, V., Ioannidis, Y.E.: Selectivity Estimation without the Attribute Value Independence Assumption. In: VLDB 1997 (1997)Google Scholar
  50. 50.
    Ré, C., Suciu, D.: Understanding Cardinality Estimation Using Entropy Maximization. In: PODS 2010 (2010)Google Scholar
  51. 51.
    Riondato, M., Akdere, M., Çetintemel, U., Zdonik, S.B., Upfal, E.: The VC-Dimension of SQL Queries and Selectivity Estimation Through Sampling. CoRR abs/1101.5805 (2011)Google Scholar
  52. 52.
    Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.: ISOMER: Consistent Histogram Construction Using Query Feedback. In: ICDE 2006 (2006)Google Scholar
  53. 53.
    Vapnik, V.N., Chervonenkis, A.Y.: On the Uniform Convergence of Relative Frequencies of Events to their Probabilities. Theory of Probability and its Applications 16(2), 264–280 (1971)CrossRefzbMATHGoogle Scholar
  54. 54.
    Wang, H., Sevcik, K.C.: A Multi-Dimensional Histogram for Selectivity Estimation and Fast Approximate Query Answering. In: CASCON 2003 (2003)Google Scholar
  55. 55.
    Wang, M., Vitter, J.S., Iyer, B.R.: Selectivity Estimation in the Presence of Alphanumeric Correlations.In: ICDE 1997 (1997)Google Scholar
  56. 56.
    Wu, Y.-L., Agrawal, D., El Abbadi, A.: Applying the Golden Rule of Sampling for Query Estimation. In: SIGMOD 2001 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Matteo Riondato
    • 1
  • Mert Akdere
    • 1
  • Uǧur Çetintemel
    • 1
  • Stanley B. Zdonik
    • 1
  • Eli Upfal
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA

Personalised recommendations