Abstract
Hierarchical binary partitions of multi-dimensional data are investigated as a basis for the construction of effective histograms. Specifically, the impact of adopting lossless compression techniques for representing the histogram on both the accuracy and the efficiency of query answering is investigated. Compression is obtained by exploiting the hierarchical partition scheme underlying the histogram, and then introducing further restrictions on the partitioning which enable a more compact representation of bucket boundaries. Basically, these restrictions consist of constraining the splits of the partition to be laid onto regular grids defined on the buckets. Several heuristics guiding the histogram construction are also proposed, and a thorough experimental analysis comparing the accuracy of histograms resulting from combining different heuristics with different representation models (both the new compression-based and the traditional ones) is provided. The best accuracy turns out from combining our grid-constrained partitioning scheme with one of the new heuristics. Histograms resulting from this combination are compared with state-of-the-art summarization techniques, showing that the proposed approach yields lower error rates and is much less sensitive to dimensionality, and that adopting our compression scheme results in improving the efficiency of query estimation.
Similar content being viewed by others
References
Aboulnaga A, Naughton JF (2000) Accurate estimation of the cost of spatial selections. In:Proceedings of 16th international conference on data engineering (ICDE), San Diego (CA), USA, February 2000, pp 123–134
Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In:Proceedings of 1999 international conference on management of data (SIGMOD), Philadelphia (PA) USA, June 1999, pp 275–286
Bruno N, Chaudhuri S, Gravano L (2001) STHoles: a multi-dimensional workload aware histogram. In:Proceedings of 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, pp 211–222
Buccafurri F, Furfaro F, Saccà D, Sirangelo C (2003) A Quad-Tree based multiresolution approach for two-dimensional summary data. In:Proceedings of 15th international confernece on scientific and statistical database management (SSDBM), Cambridge (MA), USA, July 2003, pp 127–140
Chaudhuri S (1998) An overview of query optimization in relational systems. In:Proceedings of 17th symposium on principles of database systems (PODS), Seattle (WA), USA, June 1998, pp 34–43
Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In:Proceedings of 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, pp 199–210
Garofalakis M and Gibbons PB (2004). Probabilistic wavelet synopses. ACM Trans Database Syst 29(1): 43–90
Garofalakis M and Kumar A (2005). Wavelet synopses for general error metrics. ACM Trans Database Syst 30(4): 888–928
Gunopulos D, Kollios G, Tsotras VJ and Domeniconi C (2005). Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2): 137–154
Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In:Proceedings 1995 international conference on management of data (SIGMOD), San Josè (CA), USA, May 1995, pp 233–244
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik K, Suel T (1995) Optimal histograms with quality guarantees. In:Proceedings 24th international conference on very large databases (VLDB), New York City (NY), USA, August 1998, pp 275–286
Jagadish HV, Jin H, Ooi BC, Tan K-L (2001) Global optimization of histograms. In:Proceedings 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, zpp 223–234
Jawerth B and Sweldens W (1994). An overview of wavelet based multiresolution analyses. SIAM Rev 36(3): 377–412
Kooi RP (1980) The optimization of queries in relational databases. Ph.D. Thesis, Case Western Reserve University, Cleveland (OH), USA
Korn F, Johnson T, Jagadish HV (1999) Range selectivity estimation for continuous attributes. In:Proceedings 11th international conference on scientific and statistical database management (SSDBM), Cleveland (OH), USA, July 1999, pp 244–253
Lin X, Liu Q, Yuan Y, Zhou X (2003) Multiscale histograms: summarizing topological relations in large spatial datasets. In:Proceedings 29th international conference on very large databases (VLDB), Berlin, Germany, September 2003, pp 814–825
Mamoulis N, Papadias D (2001) Selectivity estimation of complex spatial queries. In:Proceedings 7th international symposium on advances in spatial and temporal databases (SSTD), Redondo Beach (CA), USA, July 2001, pp 155–174
Matias Y, Vitter JS, Wang M (1998) Wavelet-based histograms for selectivity estimation. In:Proceedings 1998 international conference on management of data (SIGMOD), Seattle (WA), USA, June 1998, pp 448–459
Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitioning in two dimensions: algorithms, complexity and applications. In:Proceedings 7th international conference on database theory (ICDT), Jerusalem, Israel, January 1999, pp 236–256
Poh KL (2000). An intelligent decision support system for investment analysis. Knowl Inf Syst 2(3): 340–358
Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In:Proceedings 23rd international conference on very large databases (VLDB), Athens, Greece, August 1997, pp 486–495
Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG (1979) Access path selection in a relational database management system. In:Proceedings 1979 international conference on management of data (SIGMOD), Boston (MA), USA, May 1979, pp 23–34
Shanmugasundaram J, Fayyad U, Bradley PS (1999) Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In:Proceedings 5th international conference on knowledge discovery and data mining (KDD), San Diego (CA), USA, August 1999, pp 223–232
Stollnitz EJ, DeRose TD and Salesin DH (1996). Wavelets for computer graphics—theory and applications. Morgan Kaufmann Publishers, San Francisco
Sun C, Agrawal D, El Abbadi A (2002) Exploring spatial datasets with histograms. In:Proceedings 18th international conference on data engineering (ICDE), San Jose (CA), USA, February 2002, pp 93–102
Vitter JS, Wang M, Iyer B (1998) Data cube approximation and histograms via wavelets. In:Proceedings 7th international conference on Information and Knowledge Management (CIKM), Bethesda, Maryland, USA, November 1998, pp 96–104
Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In:Proceedings 1999 international conference on management of data (SIGMOD), Philadelphia (PA), USA, June 1999, pp 193–204
Zhu Q, Tao Y and Zuzarte C (2005). Optimizing complex queries based on similarities of subqueries. Knowl Inf Syst 8(3): 350–373
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Furfaro, F., Mazzeo, G.M., Saccà, D. et al. Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowl Inf Syst 15, 335–380 (2008). https://doi.org/10.1007/s10115-007-0087-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0087-1