The VLDB Journal

, Volume 14, Issue 2, pp 137–154 | Cite as

Selectivity estimators for multidimensional range queries over real attributes

  • Dimitrios Gunopulos
  • George Kollios
  • Vassilis J. Tsotras
  • Carlotta Domeniconi
Article

Abstract.

Estimating the selectivity of multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper, we consider the following problem: given a table of d attributes whose domain is the real numbers and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. The simplest approach to tackle this problem is to assume that the attributes are independent. More accurate estimators try to capture the joint data distribution of the attributes. In databases, such estimators include the construction of multidimensional histograms, random sampling, or the wavelet transform. In statistics, kernel estimation techniques are being used. Many traditional approaches assume that attribute values come from discrete, finite domains, where different values have high frequencies. However, for many novel applications (as in temporal, spatial, and multimedia databases) attribute values come from the infinite domain of real numbers. Consequently, each value appears very infrequently, a characteristic that affects the behavior and effectiveness of the estimator. Moreover, real-life data exhibit attribute correlations that also affect the estimator. We present a new histogram technique that is designed to approximate the density of multidimensional datasets with real attributes. Our technique defines buckets of variable size and allows the buckets to overlap. The size of the cells is based on the local density of the data. The use of overlapping buckets allows a more compact approximation of the data distribution. We also show how to generalize kernel density estimators and how to apply them to the multidimensional query approximation problem. Finally, we compare the accuracy of the proposed techniques with existing techniques using real and synthetic datasets. The experimental results show that the proposed techniques behave more accurately in high dimensionalities than previous approaches.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, June 1999Google Scholar
  2. 2.
    Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, June 1999Google Scholar
  3. 3.
    Blohsfeld B, Korus D, Seeger B (1999) A comparison of selectivity estimators for range queries on metric attributes. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, June 1999Google Scholar
  4. 4.
    Bruno N, Chaudhuri S, Gravano L (2001) STHoles: a multidimensional workload-aware histogram In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, May 2001Google Scholar
  5. 5.
    Chaudhuri S, Gravano L (1999) Evaluating top-K selection queries. In: Proceedings of the 25th international conference on very large data bases (VLDB-99), Edinburgh, September 1999Google Scholar
  6. 6.
    Chaudhuri S, Motwani R, Narasayya VR (1998) Random sampling for histogram construction: how much is enough? In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, Seattle, June 1998Google Scholar
  7. 7.
    Cressie NQC (1993) Statistics for spatial data. Wiley, New YorkGoogle Scholar
  8. 8.
    Diggle PJ A kernel method for smoothing point process data. Appl Stat 34:138-147Google Scholar
  9. 9.
    Donjerkovic D, Ramakrishnan R (1999) Probabilistic optimization of top N queries. In: Proceedings of the 25th international conference on very large data bases (VLDB-99), Edinburgh, September 1999Google Scholar
  10. 10.
    Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, Seattle, June 1998Google Scholar
  11. 11.
    Gibbons PB, Matias Y, Poosala V (1997) Fast incremental maintenance of approximate histograms. In: Proceedings of the 23rd international conference on very large data bases, Athens, Greece, August 1997Google Scholar
  12. 12.
    Gunopulos D, Kollios G, Tsotras V, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: Proceedings of the 2000 ACM SIMGOD international conference on management of data, Dallas, May 2000Google Scholar
  13. 13.
    Haas PJ, Swami AN (1992) Sequential sampling procedures for query size estimation. In: Proceedings of the 1992 ACM SIGMOD international conference on management of data, San Diego, June 1992Google Scholar
  14. 14.
    Hellerstein JM, Haas PJ, Wan H (1997) Online aggregation. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data, Tucson, AZ, May 1997Google Scholar
  15. 15.
    Imager Wavelet Library. www.cs.ubc.ca/nest/imager/contributions/bobl/wvlt/top.html Google Scholar
  16. 16.
    Ioannidis Y, Poosala V (1999) Histogram-based approximation of set-valued query-answers. In: Proceedings of the 25th international conference on very large data bases (VLDB-99), Edinburgh, September 1999Google Scholar
  17. 17.
    Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T (1998) Optimal histograms with quality guarantees. In: Proceedings of the 24rd international conference on very large data bases, August 1998Google Scholar
  18. 18.
    Khanna S, Muthukrishnan S, Patterson M (1998) On approximating rectangle tiling and packing. In: Proceedings of the 9th annual symposium on discrete algorithms (SODA), San Francisco, January 1998Google Scholar
  19. 19.
    Konig A, Weikum G (1999) Combining histograms and parametric curve fitting for feedback-driven query result-size estimation. In: Proceedings of the 25th international conference on very large data bases (VLDB-99), Edinburgh, September 1999Google Scholar
  20. 20.
    Korn F, Johnson T, Jagadish H (1999) Range selectivity estimation for continuous attributes. In: Proceedings of the 11th international conference on SSDBMs, Cleveland, OH, July 1999Google Scholar
  21. 21.
    Lipton RJ, Naughton JF, Schneider D (1990) Practical selectivity estimation through adaptive sampling. In: Proceedings of the 1990 ACM SIGMOD international conference on management of data, Atlantic City, NJ, May 1990Google Scholar
  22. 22.
    Lee J, Kim D, Chung C (1999) Multi-dimensional selectivity estimation using compressed histogram information. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, June 1999Google Scholar
  23. 23.
    Matias Y, Scott Vitter J, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, Seattle, June 1998Google Scholar
  24. 24.
    Matias Y, Scott Vitter J, Wang M (2000) Dynamic maintenance of wavelet-based histograms. In: Proceedings of the 26th international conference on very large data bases (VLDB 2000), Cairo, Egypt, September 2000Google Scholar
  25. 25.
    Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the 1988 ACM SIGMOD international conference on management of data, Chicago, June 1988Google Scholar
  26. 26.
    Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: Proceedings of the ICDT 1999, Jerusalem, January 1999, pp 236-256Google Scholar
  27. 27.
    Olken F, Rotem D (1990) Random sampling from database files: a survey. In: Proceedings of the 5th international conference on statistical and scientific database management, Charlotte, NC, July 1990Google Scholar
  28. 28.
    Poosala V, Ganti V (1999) Fast approximate answers to aggregate queries on a data cube. In: Proceedings of the 11th international conference on scientific and statistical database management, Cleveland, OH, July 1999Google Scholar
  29. 29.
    Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: Proceedings of the 23rd international conference on very large data bases (VLDB 1997), Athens, Greece, August 1997Google Scholar
  30. 30.
    Poosala V, Ioannidis YE, Haas PJ, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, May 1996Google Scholar
  31. 31.
    Scott D (1992) Multivariate density estimation: theory, practice and visualization. Wiley, New YorkGoogle Scholar
  32. 32.
    Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG (1979) Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD international conference on management of data, Boston, June 1979Google Scholar
  33. 33.
    Shanmugasundaram J, Fayyad U, Bradley P (1988) Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, August 1988Google Scholar
  34. 34.
    Silverman BW (1986) Density estimation for statistics and data analysis. Monographs on statistics and applied probability, Chapman & Hall, New YorkGoogle Scholar
  35. 35.
    TPC benchmark D (decision support) (1995)Google Scholar
  36. 36.
    Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, June 1999Google Scholar
  37. 37.
    Vitter JS, Wang M, Iyer BR (1998) Data cube approximation and histograms via wavelets. In: Proceedings of the 1998 ACM CIKM international conference on information and knowledge management, Bethesda, MD, November 1998Google Scholar
  38. 38.
    Wand MP, Jones MC (1995) Kernel smoothing. Monographs on statistics and applied probability, Chapman & Hall, New YorkGoogle Scholar
  39. 39.
    Webber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In: Proceedings of the 24rd international conference on very large data bases, New York, August 1998Google Scholar

Copyright information

© Springer-Verlag Berlin/Heidelberg 2005

Authors and Affiliations

  • Dimitrios Gunopulos
    • 1
  • George Kollios
    • 2
  • Vassilis J. Tsotras
    • 1
  • Carlotta Domeniconi
    • 3
  1. 1.Department of Computer Science and Engineering, Bourns College of EngineeringUniversity of California, RiversideRiversideUSA
  2. 2.Department of Computer ScienceBoston UniversityBostonUSA
  3. 3.Department of Information and Software EngineeringGeorge Mason UniversityFairfaxUSA

Personalised recommendations