JCC-H: Adding Join Crossing Correlations with Skew to TPC-H
We introduce JCC-H, a drop-in replacement for the data and query generator of TPC-H, that introduces Join-Crossing-Correlations (JCC) and skew into its dataset and query workload. These correlations are carefully designed such that the filter predicates on table columns in the existing TPC-H queries now suddenly can have effects on the value-, frequency- and join-fan-out-distributions, experienced by operators in the query plan. The query generator of JCC-H is able to generate parameter bindings for the 22 query templates in two different equivalence classes: query templates that receive “normal” parameters do not experience skew and behave very similar to default TPC-H queries. Query templates expanded with the “skewed” parameters, though, experience strong join-crossing-correlations and skew in filter, aggregation and join operations. In this paper we discuss the goals of JCC-H, its detailed design, as well as show initial experiments on both a single-server and MPP database system, that confirm that our design goals were largely met. In all, JCC-H provides a convenient way for any system that is already testing with TPC-H to examine how the system can handle skew and correlations, so we hope the community can use it to make progress on issues like skew mitigation and detection and exploitation of join-crossing-correlations in query optimizers and data storage.
This paper is a result of the “Parallelism and Skew” working group at Dagstuhl seminar 17222 (Robust Performance in Database Query Processing). We would like to thank group members Johann-Christoph Freytag (HU Berlin), Alfons Kemper (TU Munich), Glenn Paulley (SAP Canada) and Kai-Uwe Sattler (TU Ilmenau) for their contributions. The research of A.C. Anadiotis was partially funded by the Swiss National Science Foundation, Project No.: 200021_146407/1 (FN–X–Core).
- 3.Erling, O., Averbuch, A., Larriba-Pey, J., Chafi, H., Gubichev, A., Prat, A., Pham, M.-D., Boncz, P.: The LDBC social network benchmark interactive workload. In: SIGMOD (2015)Google Scholar
- 4.Frank, M., Poess, M., Rabl, T.: Efficient update data generation for DBMS benchmarks. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, pp. 169–180 (2012)Google Scholar
- 5.Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: SIGMOD (2013)Google Scholar
- 6.Gubichev, A., Boncz, P.: Parameter curation for benchmark queries. In: TPCTC, pp. 113–129 (2014)Google Scholar
- 9.Poess, M., Nambiar, R.O., Walrath, D.: Why you should run TPC-DS: a workload analysis. In: VLDB (2007)Google Scholar