Holistic Shuffler for the Parallel Processing of SQL Window Functions
Window functions are a sub-class of analytical operators that allow data to be handled in a derived view of a given relation, while taking into account their neighboring tuples. Currently, systems bypass parallelization opportunities which become especially relevant when considering Big Data as data is naturally partitioned. We present a shuffling technique to improve the parallel execution of window functions when data is naturally partitioned when the query holds a partitioning clause that does not match the natural partitioning of the relation. We evaluated this technique with a non-cumulative ranking function and we were able to reduce data transfer among parallel workers in 85 % when compared to a naive approach.
KeywordsWindow Function PARTITION Clause Naturally Partitioned Reduce Data Transfer Neighboring Tuples
In this paper we propose an Holistic shuffler which according to the partitioning considered by the ongoing window function will instruct workers to handle specific partitions according to the data sizes they hold, minimizing data transfer among workers. We present the design and action of the shuffler which is based on prior knowledge of the data size distribution of each column in the relation, reflecting the data size held in each partition rather than considering the actual data value as seen in common use of database indexes. The preliminary evaluation of our mechanism shows that our approach is able reduce data transfer by 85 % when compared with a naive approach.
Roadmap: In the remainder of this paper, Sect. 2 introduces the distribution considered and Sect. 3 presents the design and architecture of the Holistic Shuffler we propose. Section 4 accesses our approach. Section 5 briefly reviews related work and overviews our contributions.
2 Data Transfer Statistics
Histograms are commonly used by query optimizers as they provide a fairly accurate estimate on the data distribution, which is crucial for a query planner. An histogram allows to map keys to their observed frequencies. Database systems use these structures to measure the cardinality of keys or key ranges. Without histograms, the query planner would have to assume uniform distribution of data, leading to incorrect partitioning, particularly with skewed data , a common characteristic of non synthetic workloads.
When a query engine has to generate parallel query execution plans to be dispatched to distinct workers, each one holding a partition of data; having histograms like the aforementioned ones is an asset, but it does not completely present an heuristic that could be used to enhance how parallel workers would share preliminary and final results. This is so as such histograms only introduce and insight about the cardinality of each partition key. In order to improve bandwidth usage, thus reducing the amount of traded information, the histogram also needs to reflect the volume of data existing in each node, instead of just considering the row cardinality.
2.1 Histogram Construction
Algorithm 1 governs how each partition histogram \((hist\_P_n)\) should be built. Briefly, each attribute (attr) is traversed and for each key, the total number of distinct occurrences of that key is computed, together with its size. The pair of values is then added to the histogram. The tables in Fig. 2(a), (b) and (c) present the resulting histograms for each partition according to Fig. 1.
When all workers have completed computing the histogram regarding its own physical partition, they need to share it with a designated worker, so that the global histogram is also computed. The global histogram will traverse each physical partition histogram and evaluate, for each key, which is the physical partition that holds the largest volume (in size, evaluating the \(cardinality \times average\_row\_size\)). The table in Fig. 2(d) depicts the final result of the global histogram. The global histogram will have the same number of attributes of each partition histogram. Please note that the keys for both the physical partition and the global histograms are not the primary keys of the relation, but rather the distinct values found in each attribute during the construction of each partition histogram. Therefore, we provide a brief example on how to read this histogram. Consider that a given query requires data to be partitioned according to attribute A. Then, the histogram informs that key 1 and 2 have the largest volume of data respectively in partitions p1,p3 and, and regarding key 3, partitions p2 and p3 both hold the same volume.
3 Holistic Shuffler
The Holistic Shuffler leverages on the data distribution data collected by the Global Histogram in order to expedite shuffling operations. The Shuffle operator can be translated into a SEND primitive that forwards a bounded piece of data to a given destination, considering the underlying network to be reliable. During the workflow for processing a window operator, there are two different moments where data needs to be shuffled. The first moment occurs immediately after the operator starts, and its goal is to reunite partitions, thus fulfilling the locality requirement. The second moment occurs in the end of the operator and it is intended to reconcile partial results in order to produce the final result.
The Global Shuffle will forward all aggregated rows to the worker that will hold the overall largest data volume, which will from now on designate as master worker. By instructing the workers that hold the least volume of data, we are promoting the minimal usage of bandwidth possible. Algorithm 3 reflects the behavior for the Global Shuffle operator. The input data considered by the Global Shuffler is composed by the ordered and aggregated rows, both produced by earlier stages of the worker work flow. Such rows will now have to be reconciled by a common node, which for this case will be dictated by the master node, as previously stated. Upon start, the Global Shuffle operator will interrogate the histogram regarding the identity of the master node. Afterwards, as each aggregated row is handled by the operator, it is forwarded to the master worker, if the master node is not the current one.
For comparison purposes, we report the results by using a naive approach and our Holistic Shuffler. The naive approach, instead of using any knowledge to forward data, disseminates all data among all participating workers. The results in both pictures are depicted according to a logarithmic scale, in the average of 5 independent tests for each configuration.
The Holistic technique we propose required in average only 14.7 % of the rows required for the Naive approach to reunite all the partition in each computing node, as depicted in Fig. 3(a). The large difference is justified by the fact that the naive approach reunites partitions by forwarding data among all participating nodes, which intrinsically creates duplicates in each node, growing in proportion to the number of nodes. The Local Shuffling stage is depicted in Fig. 3(b), in which we varied the number of computing nodes that participate in the computation of the ranking query, verifying the number of rows that were forwarded according to each technique.
5 Related Work and Conclusion
Despite its relevance, optimizations considering this operator are scarce in the literature. The work by  or  are some of the exceptions. Respectively, the first overcomes optimization challenges related with having multiple window functions in the same query, while the second presents a more broad use of window functions, showing that it is possible to use them as a way to avoid sub-queries and reducing execution time down from quadratic time.
In this paper we proposed an Holistic Shuffler, tailored to be used for the efficient parallel processing of queries with non-cumulative window functions. The design is based on a statistical method that can be used to reduce the amount of data transfered among computing nodes of a distributed query engine, where data is naturally partitioned. Moreover, the preliminary evaluation we present shows that by applying this methodology we were to reduce in average 85 % of data transferred among computing nodes. As future work, we plan to translate this approach to a real query engine.
This work was part-funded by project LeanBigData: Ultra-Scalable and Ultra-Efficient Integrated and Visual Big Data Analytics (FP7-619606), and by the ERDF – European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project \(\ll \)POCI-01-0145-FEDER-006961\(\gg \), and by National Funds through the FCT – Fundação para a Ciẽncia e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013.
- 1.Reactive programming (2015). http://reactivex.io
- 2.Reactive programming for java (2015). https://github.com/ReactiveX/RxJava
- 4.Chen, G., Vo, H.T., Wu, S., Ooi, B.C., Özsu, M.T.: A framework for supporting DBMS-like indexes in the cloud. Proc. VLDB Endowment 4(11), 702–713 (2011)Google Scholar
- 5.Garcia-Molina, H.: Database Systems: The Complete Book. Pearson Education, India (2008)Google Scholar
- 6.Poosala, V., Ganti, V., Ioannidis, Y.E.: Approximate query answering using histograms. IEEE Data Eng. Bull. 22(4), 5–14 (1999)Google Scholar
- 8.Zuzarte, C., Pirahesh, H., Ma, W., Cheng, Q., Liu, L., Wong, K.: Winmagic: subquery elimination using window aggregation. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 652–656. ACM (2003)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.