Compressed hierarchical binary histograms for summarizing multi-dimensional data

Abstract

Hierarchical binary partitions of multi-dimensional data are investigated as a basis for the construction of effective histograms. Specifically, the impact of adopting lossless compression techniques for representing the histogram on both the accuracy and the efficiency of query answering is investigated. Compression is obtained by exploiting the hierarchical partition scheme underlying the histogram, and then introducing further restrictions on the partitioning which enable a more compact representation of bucket boundaries. Basically, these restrictions consist of constraining the splits of the partition to be laid onto regular grids defined on the buckets. Several heuristics guiding the histogram construction are also proposed, and a thorough experimental analysis comparing the accuracy of histograms resulting from combining different heuristics with different representation models (both the new compression-based and the traditional ones) is provided. The best accuracy turns out from combining our grid-constrained partitioning scheme with one of the new heuristics. Histograms resulting from this combination are compared with state-of-the-art summarization techniques, showing that the proposed approach yields lower error rates and is much less sensitive to dimensionality, and that adopting our compression scheme results in improving the efficiency of query estimation.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Aboulnaga A, Naughton JF (2000) Accurate estimation of the cost of spatial selections. In:Proceedings of 16th international conference on data engineering (ICDE), San Diego (CA), USA, February 2000, pp 123–134

  2. 2.

    Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In:Proceedings of 1999 international conference on management of data (SIGMOD), Philadelphia (PA) USA, June 1999, pp 275–286

  3. 3.

    Bruno N, Chaudhuri S, Gravano L (2001) STHoles: a multi-dimensional workload aware histogram. In:Proceedings of 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, pp 211–222

  4. 4.

    Buccafurri F, Furfaro F, Saccà D, Sirangelo C (2003) A Quad-Tree based multiresolution approach for two-dimensional summary data. In:Proceedings of 15th international confernece on scientific and statistical database management (SSDBM), Cambridge (MA), USA, July 2003, pp 127–140

  5. 5.

    Chaudhuri S (1998) An overview of query optimization in relational systems. In:Proceedings of 17th symposium on principles of database systems (PODS), Seattle (WA), USA, June 1998, pp 34–43

  6. 6.

    Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In:Proceedings of 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, pp 199–210

  7. 7.

    Garofalakis M and Gibbons PB (2004). Probabilistic wavelet synopses. ACM Trans Database Syst 29(1): 43–90

    Article  Google Scholar 

  8. 8.

    Garofalakis M and Kumar A (2005). Wavelet synopses for general error metrics. ACM Trans Database Syst 30(4): 888–928

    Article  Google Scholar 

  9. 9.

    Gunopulos D, Kollios G, Tsotras VJ and Domeniconi C (2005). Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2): 137–154

    Article  Google Scholar 

  10. 10.

    Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In:Proceedings 1995 international conference on management of data (SIGMOD), San Josè (CA), USA, May 1995, pp 233–244

  11. 11.

    Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik K, Suel T (1995) Optimal histograms with quality guarantees. In:Proceedings 24th international conference on very large databases (VLDB), New York City (NY), USA, August 1998, pp 275–286

  12. 12.

    Jagadish HV, Jin H, Ooi BC, Tan K-L (2001) Global optimization of histograms. In:Proceedings 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, zpp 223–234

  13. 13.

    Jawerth B and Sweldens W (1994). An overview of wavelet based multiresolution analyses. SIAM Rev 36(3): 377–412

    MATH  Article  MathSciNet  Google Scholar 

  14. 14.

    Kooi RP (1980) The optimization of queries in relational databases. Ph.D. Thesis, Case Western Reserve University, Cleveland (OH), USA

  15. 15.

    Korn F, Johnson T, Jagadish HV (1999) Range selectivity estimation for continuous attributes. In:Proceedings 11th international conference on scientific and statistical database management (SSDBM), Cleveland (OH), USA, July 1999, pp 244–253

  16. 16.

    Lin X, Liu Q, Yuan Y, Zhou X (2003) Multiscale histograms: summarizing topological relations in large spatial datasets. In:Proceedings 29th international conference on very large databases (VLDB), Berlin, Germany, September 2003, pp 814–825

  17. 17.

    Mamoulis N, Papadias D (2001) Selectivity estimation of complex spatial queries. In:Proceedings 7th international symposium on advances in spatial and temporal databases (SSTD), Redondo Beach (CA), USA, July 2001, pp 155–174

  18. 18.

    Matias Y, Vitter JS, Wang M (1998) Wavelet-based histograms for selectivity estimation. In:Proceedings 1998 international conference on management of data (SIGMOD), Seattle (WA), USA, June 1998, pp 448–459

  19. 19.

    Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitioning in two dimensions: algorithms, complexity and applications. In:Proceedings 7th international conference on database theory (ICDT), Jerusalem, Israel, January 1999, pp 236–256

  20. 20.

    Poh KL (2000). An intelligent decision support system for investment analysis. Knowl Inf Syst 2(3): 340–358

    MATH  Article  Google Scholar 

  21. 21.

    Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In:Proceedings 23rd international conference on very large databases (VLDB), Athens, Greece, August 1997, pp 486–495

  22. 22.

    Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG (1979) Access path selection in a relational database management system. In:Proceedings 1979 international conference on management of data (SIGMOD), Boston (MA), USA, May 1979, pp 23–34

  23. 23.

    Shanmugasundaram J, Fayyad U, Bradley PS (1999) Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In:Proceedings 5th international conference on knowledge discovery and data mining (KDD), San Diego (CA), USA, August 1999, pp 223–232

  24. 24.

    Stollnitz EJ, DeRose TD and Salesin DH (1996). Wavelets for computer graphics—theory and applications. Morgan Kaufmann Publishers, San Francisco

    Google Scholar 

  25. 25.

    Sun C, Agrawal D, El Abbadi A (2002) Exploring spatial datasets with histograms. In:Proceedings 18th international conference on data engineering (ICDE), San Jose (CA), USA, February 2002, pp 93–102

  26. 26.

    Vitter JS, Wang M, Iyer B (1998) Data cube approximation and histograms via wavelets. In:Proceedings 7th international conference on Information and Knowledge Management (CIKM), Bethesda, Maryland, USA, November 1998, pp 96–104

  27. 27.

    Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In:Proceedings 1999 international conference on management of data (SIGMOD), Philadelphia (PA), USA, June 1999, pp 193–204

  28. 28.

    Zhu Q, Tao Y and Zuzarte C (2005). Optimizing complex queries based on similarities of subqueries. Knowl Inf Syst 8(3): 350–373

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Filippo Furfaro.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Furfaro, F., Mazzeo, G.M., Saccà, D. et al. Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowl Inf Syst 15, 335–380 (2008). https://doi.org/10.1007/s10115-007-0087-1

Download citation

Keywords

  • Data reduction
  • Multi-dimensional data
  • Histograms
  • Query processing