HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Predicates

Yu, Xiaohui; Koudas, Nick; Zuzarte, Calisto

doi:10.1007/11687238_29

Xiaohui Yu²⁵,
Nick Koudas²⁵ &
Calisto Zuzarte²⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3896))

Included in the following conference series:

International Conference on Extending Database Technology

1651 Accesses
3 Citations

Abstract

Current methods for selectivity estimation fall into two broad categories, synopsis-based and sampling-based. Synopsis-based methods, such as histograms, incur minimal overhead at query optimization time and thus are widely used in commercial database systems. Sampling-based methods are more suited for ad-hoc queries, but often involve high I/O cost because of random access to the underlying data. Though both methods serve the same purpose of selectivity estimation, their interaction in the case of selectivity estimation for conjuncts of predicates on multiple attributes is largely unexplored. Our work aims at taking the best of both worlds, by making consistent use of synopses and sample information when they are both present. To achieve this goal, we propose HASE, a novel estimation scheme based on a powerful mechanism called generalized raking. We formalize selectivity estimation in the presence of single attribute synopses and sample information as a constrained optimization problem. By solving this problem, we obtain a new set of weights associated with the sampled tuples, which has the nice property of reproducing the known selectivities when applied to individual predicates. We discuss different variants of the optimization problem and provide algorithms for solving it. We also provide asymptotic error bounds on the estimate. Extensive experiments are performed on both synthetic and real data, and the results show that HASE significantly outperforms both synopsis-based and sampling-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: SIGMOD, pp. 294–305 (1996)
Google Scholar
Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB, pp. 486–495 (1997)
Google Scholar
Olken, F.: Random sampling from databases. PhD thesis, University of California, Berkeley, CA (1993)
Google Scholar
Haas, P.J., König, C.: A bi-level Bernoulli scheme for database sampling. In: SIGMOD Conference, pp. 275–286 (2004)
Google Scholar
Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: Dependencybasedhistogramsynopses forhigh-dimensionaldata. In: SIGMODConference (2001)
Google Scholar
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685 (1952)
Article MATH MathSciNet Google Scholar
Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: SIGMOD Conference., pp. 287–298 (2004)
Google Scholar
Deville, J.C., Särndal, C.E.: Calibration estimators in survey sampling. Journal of the American Statistical Association 87, 376–382 (1992)
Article MATH MathSciNet Google Scholar
Deville, J.C., Särndal, C.E., Sautory, O.: Generalized raking procedures in survey sampling. Journal of the American Statistical Association 88, 1013–1020 (1993)
Article MATH Google Scholar
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific, Belmont (1996)
Google Scholar
Deming, W.E., Stephan, F.F.: On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Annals of Mathematical Statistics 11, 427–444 (1940)
Article MATH MathSciNet Google Scholar
Särndal, C.E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (1992)
MATH Google Scholar
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD, pp. 28–36 (1988)
Google Scholar
Hettich, S., Bay, S.D.: The UCI KDD Archive, Irvine, CA. University of California, Department of Information and Computer Science (1999)
Google Scholar
Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. In: PODS, pp. 40–46 (1990)
Google Scholar
Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD, pp. 448–459 (1998)
Google Scholar
Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: building histograms without looking at data. In: SIGMOD, pp. 181–192. ACM Press, New York (1999)
Google Scholar
Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB, pp. 19–30 (2003)
Google Scholar
Fedorowicz, J.: Database evaluation using multiple regression techniques. In: SIGMOD, pp. 70–76 (1984)
Google Scholar
Markl, V., Megiddo, N., Kutsch, M., Tran, T.M., Haas, P.J., Srivastava, U.: Consistently estimating the selectivity of conjuncts of predicates. In: VLDB, pp. 373–384 (2005)
Google Scholar
Haas, P.J., Swami, A.N.: Sequential sampling procedures for query size estimation. In: SIGMOD, pp. 341–350 (1992)
Google Scholar
Naughton, J.F., Seshadri, S.: On estimating the size of projections. ICDT 470, 499–513 (1990)
MathSciNet Google Scholar
Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, pp. 311–322 (1995)
Google Scholar
Chaudhuri, S., Motwani, R., Narasayya, V.R.: Random sampling for histogram construction: How much is enough? In: SIGMOD, pp. 436–447 (1998)
Google Scholar
Gibbons, P.B., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. In: VLDB, pp. 466–475 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, Toronto, ON, M5S 3G4, Canada
Xiaohui Yu & Nick Koudas
IBM Toronto Lab, 8200 Warden Avenue, Markham, ON, L6G 1C7, Canada
Calisto Zuzarte

Authors

Xiaohui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Nick Koudas
View author publications
You can also search for this author in PubMed Google Scholar
Calisto Zuzarte
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Athens, Greece
Yannis Ioannidis
University of Konstanz, P.O.Box D188, 78457, Konstanz, Germany
Marc H. Scholl
Sustainable Content Logistics Centre, Hamburg, Germany
Joachim W. Schmidt
Chair of Software Engineering for Business Information Systems, Technische Universität München, Boltzmannstraße 3, 85748, Garching b. München,
Florian Matthes
Department of Informatics, University of Athens Panepistimiopolis, 15771, Athens, Greece
Mike Hatzopoulos
IPD, Universität Karlsruhe, Am Fasanengarten 5, 76131, Karlsruhe,
Klemens Boehm
TU München, D-85748, Garching, Germany
Alfons Kemper
Technische Universität München, Germany
Torsten Grust
Institute for Computer Science, Ludwig-Maximilians Universität München,
Christian Boehm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, X., Koudas, N., Zuzarte, C. (2006). HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Predicates. In: Ioannidis, Y., et al. Advances in Database Technology - EDBT 2006. EDBT 2006. Lecture Notes in Computer Science, vol 3896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11687238_29

Download citation

DOI: https://doi.org/10.1007/11687238_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32960-2
Online ISBN: 978-3-540-32961-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics