Workload-Aware Self-Tuning Histograms of String Data

  • Nickolas Zoulis
  • Effrosyni Mavroudi
  • Anna Lykoura
  • Angelos Charalambidis
  • Stasinos Konstantopoulos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9261)


In this paper we extend STHoles, a very successful algorithm that uses query results to build and maintain multi-dimensional histograms of numerical data. Our contribution is the formal definition of extensions of all relevant concepts; such that they are independent of the domain of the data, but subsume STHoles concepts as their numerical specialization. At the same time, we also derive specializations for the string domain and implement these into a prototype that we use to empirically validate our approach. Our current implementation uses string prefixes as the machinery for describing string ranges. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in hierarchically structured string domains, such as those of filesystem pathnames and URIs. In fact, we base the empirical validation of the approach on existing, publicly available Semantic Web data where we demonstrate convergence to accurate and efficient histograms.


Resource Description Framework Query Result Query Execution Resource Description Framework Data Resource Description Framework Triple 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The work described here was partially carried out at the 2014 edition of the International Research-Centred Summer School, held at NCSR ‘Demokritos’, Athens, Greece, 3–30 July 2014. For more details please see

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007–2013) under grant agreement No. 318497. More details at


  1. 1.
    Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD 1999), pp. 181–192. ACM (1999)Google Scholar
  2. 2.
    Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of 2001 ACM International Conference on Management of Data (SIGMOD 2001), pp. 211–222. ACM (2001)Google Scholar
  3. 3.
    Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006). IEEE Computer Society (2006)Google Scholar
  4. 4.
    Roh, Y.J., Kim, J.H., Chung, Y.D., Son, J.H., Kim, M.H.: Hierarchically organized skew-tolerant histograms for geographic data objects. In: Proceedings of 2010 ACM International Conference on Management of Data (SIGMOD 2010), pp. 627–638. ACM (2010)Google Scholar
  5. 5.
    Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: Overcoming the underestimation problem. In: Proceedings of 20th International Conference on Data Engineering (ICDE 2004). IEEE Computer Society (2004)Google Scholar
  6. 6.
    Lim, L., Wang, M., Vitter, J.S.: CXHist: An on-line classification-based histogram for XML string selectivity estimation. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, 30 August – 2 September 2005, pp. 1187–1198 (2005)Google Scholar
  7. 7.
    Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: A search and metadata engine for the Semantic Web. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659. ACM (2004)Google Scholar
  8. 8.
    Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012) Google Scholar
  9. 9.
    Langegger, A., Wöss, W.: RDFStats - an extensible RDF statistics generator and library. In: Proceedings of DEXA 2009, pp. 79–83 (2009)Google Scholar
  10. 10.
    Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of 19th International World Wide Web Conference (WWW 2010), Raleigh, NC, USA, 26–30 April 2010 (2010)Google Scholar
  11. 11.
    Charalambidis, A., Konstantopoulos, S., Karkaletsis, V.: Dataset descriptions for optimizing federated querying. In: Poster Track, Companion Volume to the Procedings of the 24th Intl World Wide Web Conference (WWW 2015), Florence, Italy, 18–22 May 2015. ACM (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Nickolas Zoulis
    • 1
  • Effrosyni Mavroudi
    • 2
  • Anna Lykoura
    • 3
  • Angelos Charalambidis
    • 4
  • Stasinos Konstantopoulos
    • 4
  1. 1.Computer ScienceAthens University of Economics and BusinessAthensGreece
  2. 2.Electrical and Computer EngineeringNational Technical University of AthensKesarianiGreece
  3. 3.Applied Mathematical and Physical ScienceNational Technical University of AthensKesarianiGreece
  4. 4.Institute of Informatics and TelecommunicationsNCSR ‘Demokritos’AthensGreece

Personalised recommendations