Efficient Selectivity Estimation by Histogram Construction Based on Subspace Clustering

  • Andranik Khachatryan
  • Emmanuel Müller
  • Klemens Böhm
  • Jonida Kopper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6809)


Modern databases have to cope with multi-dimensional queries. For efficient processing of these queries, query optimization relies on multi-dimensional selectivity estimation techniques. These techniques in turn typically rely on histograms. A core challenge of histogram construction is the detection of regions with a density higher than the ones of their surroundings. In this paper, we show that subspace clustering algorithms, which detect such regions, can be used to build high quality histograms in multi-dimensional spaces. The clusters are transformed into a memory-efficient histogram representation, while preserving most of the information for the selectivity estimation. We derive a formal criterion for our transformation of clusters into buckets that minimizes the introduced estimation error. In practice, finding optimal buckets is hard, so we propose a heuristic. Our experiments show that our approach is efficient in terms of both runtime and memory usage. Overall, we demonstrate that subspace clustering enables multi-dimensional selectivity estimation with low estimation errors.


Data Space Subspace Cluster Query Optimization Selectivity Estimation Query Workload 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: building histograms without looking at data. In: SIGMOD 1999 (1999)Google Scholar
  2. 2.
    Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. SIGMOD Record 28(2) (1999)Google Scholar
  3. 3.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Record 27(2) (1998)Google Scholar
  4. 4.
    Amenta, N.: Bounded boxes, hausdorff distance, and a new proof of an interesting helly-type theorem. In: SCG 1994 (1994)Google Scholar
  5. 5.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: DUSC: Dimensionality unbiased subspace clustering. In: ICDM 2007 (2007)Google Scholar
  6. 6.
    Assent, I., Krieger, R., Müller, E., Seidl, T.: INSCY: Indexing subspace clusters with in-process-removal of redundancy. In: ICDM 2008 (2008)Google Scholar
  7. 7.
    Baltrunas, L., Mazeika, A., Bohlen, M.: Multi-dimensional histograms with tight bounds for the errorGoogle Scholar
  8. 8.
    Beyer, K.S, Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)Google Scholar
  9. 9.
    Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Record (2001)Google Scholar
  10. 10.
    Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  11. 11.
    Fuchs, D., He, Z., Lee, B.S.: Compressed histograms with arbitrary bucket layouts for selectivity estimation. Inf. Sci. 177, 680–702 (2007)CrossRefzbMATHGoogle Scholar
  12. 12.
    Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. SIGMOD Record (2000)Google Scholar
  13. 13.
    Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: ICDM 2001 (2001)Google Scholar
  14. 14.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)zbMATHGoogle Scholar
  15. 15.
    Ioannidis, Y.: The history of histograms (abridged). In: VLDB 2003 (2003)Google Scholar
  16. 16.
    Jolliffe, I.: Principal Component Analysis. Springer, New York (1986)CrossRefzbMATHGoogle Scholar
  17. 17.
    Kröger, P., Kriegel, H.P., Kailing, K.: Density-connected subspace clustering for high-dimensional data. In: SDM 2004 (2004)Google Scholar
  18. 18.
    Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measuresGoogle Scholar
  19. 19.
    Luo, J., Zhou, X., Zhang, Y., Shen, H.T., Li, J.: Selectivity estimation by batch-query based histogram and parametric method. In: ADC 2007 (2007)Google Scholar
  20. 20.
    Müller, E., Assent, I., Günnemann, S., Krieger, R., Seidl, T.: Relevant Subspace Clustering: mining the most interesting non-redundant concepts in high dimensional data. In: ICDM 2009 (2009)Google Scholar
  21. 21.
    Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. In: PVLDB, vol. 2(1) (2009)Google Scholar
  22. 22.
    Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: Algorithms, complexity, and applicationsGoogle Scholar
  23. 23.
    Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB 1997 (1997)Google Scholar
  24. 24.
    Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD 2002 (2002)Google Scholar
  25. 25.
    Sequeira, K., Zaki, M.J.: Schism: A new approach for interesting subspace mining. In: ICDM 2004 (2004)Google Scholar
  26. 26.
    Srivastava, U., Haas, P., Markl, V., Kutsch, M., Tran, T.: ISOMER: Consistent histogram construction using query feedback. In: ICDE 2006 (2006)Google Scholar
  27. 27.
    Wang, H., Sevcik, K.C.: A multi-dimensional histogram for selectivity estimation and fast approximate query answering. In: CASCON 2003 (2003)Google Scholar
  28. 28.
    Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: ICDM 2003 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Andranik Khachatryan
    • 1
  • Emmanuel Müller
    • 1
  • Klemens Böhm
    • 1
  • Jonida Kopper
    • 1
  1. 1.Institute for Program Structures and Data Organization (IPD)Karlsruhe Institute of Technology (KIT)Germany

Personalised recommendations