Knowledge and Information Systems

, Volume 15, Issue 3, pp 335–380 | Cite as

Compressed hierarchical binary histograms for summarizing multi-dimensional data

  • Filippo Furfaro
  • Giuseppe M. Mazzeo
  • Domenico Saccà
  • Cristina Sirangelo
Regular Paper

Abstract

Hierarchical binary partitions of multi-dimensional data are investigated as a basis for the construction of effective histograms. Specifically, the impact of adopting lossless compression techniques for representing the histogram on both the accuracy and the efficiency of query answering is investigated. Compression is obtained by exploiting the hierarchical partition scheme underlying the histogram, and then introducing further restrictions on the partitioning which enable a more compact representation of bucket boundaries. Basically, these restrictions consist of constraining the splits of the partition to be laid onto regular grids defined on the buckets. Several heuristics guiding the histogram construction are also proposed, and a thorough experimental analysis comparing the accuracy of histograms resulting from combining different heuristics with different representation models (both the new compression-based and the traditional ones) is provided. The best accuracy turns out from combining our grid-constrained partitioning scheme with one of the new heuristics. Histograms resulting from this combination are compared with state-of-the-art summarization techniques, showing that the proposed approach yields lower error rates and is much less sensitive to dimensionality, and that adopting our compression scheme results in improving the efficiency of query estimation.

Keywords

Data reduction Multi-dimensional data Histograms Query processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aboulnaga A, Naughton JF (2000) Accurate estimation of the cost of spatial selections. In:Proceedings of 16th international conference on data engineering (ICDE), San Diego (CA), USA, February 2000, pp 123–134Google Scholar
  2. 2.
    Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In:Proceedings of 1999 international conference on management of data (SIGMOD), Philadelphia (PA) USA, June 1999, pp 275–286Google Scholar
  3. 3.
    Bruno N, Chaudhuri S, Gravano L (2001) STHoles: a multi-dimensional workload aware histogram. In:Proceedings of 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, pp 211–222Google Scholar
  4. 4.
    Buccafurri F, Furfaro F, Saccà D, Sirangelo C (2003) A Quad-Tree based multiresolution approach for two-dimensional summary data. In:Proceedings of 15th international confernece on scientific and statistical database management (SSDBM), Cambridge (MA), USA, July 2003, pp 127–140Google Scholar
  5. 5.
    Chaudhuri S (1998) An overview of query optimization in relational systems. In:Proceedings of 17th symposium on principles of database systems (PODS), Seattle (WA), USA, June 1998, pp 34–43Google Scholar
  6. 6.
    Deshpande A, Garofalakis M, Rastogi R (2001) Independence is good: dependency-based histogram synopses for high-dimensional data. In:Proceedings of 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, pp 199–210Google Scholar
  7. 7.
    Garofalakis M and Gibbons PB (2004). Probabilistic wavelet synopses. ACM Trans Database Syst 29(1): 43–90 CrossRefGoogle Scholar
  8. 8.
    Garofalakis M and Kumar A (2005). Wavelet synopses for general error metrics. ACM Trans Database Syst 30(4): 888–928 CrossRefGoogle Scholar
  9. 9.
    Gunopulos D, Kollios G, Tsotras VJ and Domeniconi C (2005). Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2): 137–154 CrossRefGoogle Scholar
  10. 10.
    Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In:Proceedings 1995 international conference on management of data (SIGMOD), San Josè (CA), USA, May 1995, pp 233–244Google Scholar
  11. 11.
    Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik K, Suel T (1995) Optimal histograms with quality guarantees. In:Proceedings 24th international conference on very large databases (VLDB), New York City (NY), USA, August 1998, pp 275–286Google Scholar
  12. 12.
    Jagadish HV, Jin H, Ooi BC, Tan K-L (2001) Global optimization of histograms. In:Proceedings 2001 international conference on management of data (SIGMOD), Santa Barbara (CA), USA, May 2001, zpp 223–234Google Scholar
  13. 13.
    Jawerth B and Sweldens W (1994). An overview of wavelet based multiresolution analyses. SIAM Rev 36(3): 377–412 MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Kooi RP (1980) The optimization of queries in relational databases. Ph.D. Thesis, Case Western Reserve University, Cleveland (OH), USAGoogle Scholar
  15. 15.
    Korn F, Johnson T, Jagadish HV (1999) Range selectivity estimation for continuous attributes. In:Proceedings 11th international conference on scientific and statistical database management (SSDBM), Cleveland (OH), USA, July 1999, pp 244–253Google Scholar
  16. 16.
    Lin X, Liu Q, Yuan Y, Zhou X (2003) Multiscale histograms: summarizing topological relations in large spatial datasets. In:Proceedings 29th international conference on very large databases (VLDB), Berlin, Germany, September 2003, pp 814–825Google Scholar
  17. 17.
    Mamoulis N, Papadias D (2001) Selectivity estimation of complex spatial queries. In:Proceedings 7th international symposium on advances in spatial and temporal databases (SSTD), Redondo Beach (CA), USA, July 2001, pp 155–174Google Scholar
  18. 18.
    Matias Y, Vitter JS, Wang M (1998) Wavelet-based histograms for selectivity estimation. In:Proceedings 1998 international conference on management of data (SIGMOD), Seattle (WA), USA, June 1998, pp 448–459Google Scholar
  19. 19.
    Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitioning in two dimensions: algorithms, complexity and applications. In:Proceedings 7th international conference on database theory (ICDT), Jerusalem, Israel, January 1999, pp 236–256Google Scholar
  20. 20.
    Poh KL (2000). An intelligent decision support system for investment analysis. Knowl Inf Syst 2(3): 340–358 MATHCrossRefGoogle Scholar
  21. 21.
    Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In:Proceedings 23rd international conference on very large databases (VLDB), Athens, Greece, August 1997, pp 486–495Google Scholar
  22. 22.
    Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG (1979) Access path selection in a relational database management system. In:Proceedings 1979 international conference on management of data (SIGMOD), Boston (MA), USA, May 1979, pp 23–34Google Scholar
  23. 23.
    Shanmugasundaram J, Fayyad U, Bradley PS (1999) Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In:Proceedings 5th international conference on knowledge discovery and data mining (KDD), San Diego (CA), USA, August 1999, pp 223–232Google Scholar
  24. 24.
    Stollnitz EJ, DeRose TD and Salesin DH (1996). Wavelets for computer graphics—theory and applications. Morgan Kaufmann Publishers, San Francisco Google Scholar
  25. 25.
    Sun C, Agrawal D, El Abbadi A (2002) Exploring spatial datasets with histograms. In:Proceedings 18th international conference on data engineering (ICDE), San Jose (CA), USA, February 2002, pp 93–102Google Scholar
  26. 26.
    Vitter JS, Wang M, Iyer B (1998) Data cube approximation and histograms via wavelets. In:Proceedings 7th international conference on Information and Knowledge Management (CIKM), Bethesda, Maryland, USA, November 1998, pp 96–104Google Scholar
  27. 27.
    Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In:Proceedings 1999 international conference on management of data (SIGMOD), Philadelphia (PA), USA, June 1999, pp 193–204Google Scholar
  28. 28.
    Zhu Q, Tao Y and Zuzarte C (2005). Optimizing complex queries based on similarities of subqueries. Knowl Inf Syst 8(3): 350–373 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  • Filippo Furfaro
    • 1
  • Giuseppe M. Mazzeo
    • 1
  • Domenico Saccà
    • 1
    • 2
  • Cristina Sirangelo
    • 1
  1. 1.DEIS, Università della CalabriaRende(CS)Italy
  2. 2.ICAR, CNRRende(CS)Italy

Personalised recommendations