Our optimizer is part of Rheem, an open-source cross-platform system.Footnote 7 For the sake of simplicity, we henceforth refer to our optimizer simply as Rheem. We have carried out several experiments to evaluate the effectiveness and efficiency of our optimizer. As our work is the first to provide a complete cross-platform optimization framework, we compared it vis-a-vis individual platforms and common practices. For a system-level comparison, refer to [4]. Note that we did not compare our optimizer with a rule-based optimization approach for two main reasons. First, defining simple rules based on the input dataset size, such as in SystemML [15], does not always work: There are non-obvious cases where even if the input is small (e.g., 30 MB), it is better to use a big data platform, such as Spark, as we will see in the following. Thus, rules need to be more complex and descriptive. Second, defining complex rules requires a lot of expertise and results in a huge rule base. For example, Myria requires hundreds of rules for only three platforms [63]. This is not only time-consuming, but it is not easily extensible and maintainable when new platforms are added.
Table 1 Tasks and datasets We evaluate our optimizer by answering the following four main questions. Can our optimizer enable Rheem to: choose the best platform for a given task? (Sect. 9.2); spot hidden opportunities for cross-platform processing that improve performance? (Sect. 9.3); and effectively re-optimize an execution plan on the fly? (Sect. 9.4). Last but not least, we also evaluate the scalability (Sect. 9.5) and design choices (Sect. 9.6) of our optimizer.
General setup
Hardware We ran all our experiments on a cluster of 10 machines: each with one 2 GHz Quad Core Xeon processor, 32 GB main memory, 500 GB SATA hard disks, a 1 Gigabit network card, and runs 64-bit platform Linux Ubuntu 14.04.05.
Processing and storage platforms We considered the following platforms: Java’s Streams (JavaStreams), PostgreSQL 9.6.2 (PSQL), Spark 2.4.0 (Spark), Flink 1.7.1 (Flink), GraphX 1.6.0 (GraphX), Giraph 1.2.0 (Giraph), a self-written Java graph library (JGraph), and HDFS 2.6.5 to store files. We used all these with their default settings and set the RAM of each platform to 20 GB. We disabled the progressive optimization feature of our optimizer in order to first better study its upfront optimization techniques. In Sect. 9.4, we study the effect of progressive optimization.
Tasks and datasets We considered a broad range of data analytics tasks from different areas, namely text mining (TM), relational analytics (RA), machine learning (ML), and graph mining (GM). Details on the datasets and tasks are shown in Table 1. These tasks and datasets individually highlight different features of our optimizer and together demonstrate its general applicability. To challenge Rheem and allow it to choose among most of the available platforms, most tasks’ input datasets are stored on HDFS (except when specified otherwise). We also considered a polystore case where data is dispersed among different stores (PolyJoin); however, such cases are easier to handle as the search space becomes smaller, and we thus omit them from further evaluation.
Cost model To learn the parameters required for the operator’s cost functions, we first generated a number of execution logs using a set of 10 training tasks (Grep, InvertedIndex, SetDifference, SetIntersection, TPC-H Q1 and Q2, PageRank, SVM, Knn, and InclusionDependency) with synthetic datasets of varying sizes. We then used a genetic algorithm. Last, as estimating UDFs’ selectivity is out of the scope of this paper, we assume accurate selectivities for the first sets of experiments studying the upfront optimization. This gives us a better view on how Rheem can perform without being affected by wrong cardinalities estimates. In Sect. 7, we study the progressive optimization and use estimated selectivities computed as discussed in Sect. 4.2.3.
Repeatability All the numbers we report are the average of three runs on the datasets of Table 1. To ensure repeatability, we will provide the code of all our experimental tasks, SQL queries, datasets, and a detailed guideline on how to reproduce our experiments.Footnote 8
Single-platform optimization
Applications might require to switch platforms according to the input datasets and/or tasks in order to achieve better performance. We call such a use case platform independence [40]. We, thus, start our experiments by evaluating how well Rheem selects a single platform to execute a task.
Experiment setup For Rheem, we forced our optimizer to use a single platform throughout a task and checked if it chose the one with the best runtime. We ran all the tasks of Table 1 with increasing dataset sizes. Note that we did not run PolyJoin because it requires using several platforms. For the real-world datasets, we took samples from the initial datasets of increasing size. We also increased the input datasets up to 1TB for most tasks in order to further stress the optimizer. Note that, due to their complexity, we do not report the 1TB numbers for Word2NVec and SimWords: None of the platforms managed to finish in a reasonable time. The iterations for CrocoPR, K-means, and SGD are 10, 100, and 1000, respectively.
Experiment results Figure 7 shows the execution times for all our tasks and for increasing dataset sizes. The stars denote the platform selected by our optimizer. First of all, let us stress that the results show significant differences in the runtimes of the different platforms: even between Spark and Flink, which are big data platform competitors. For example, Flink can be up to 2.4\({\times }\) faster than Spark and Spark can be up to 2\(\times \) faster than Flink. Thus, it is crucial to prevent tasks from falling into such non-obvious worst cases.
The results, in Fig. 7, show that our optimizer indeed makes robust platform choices whenever runtimes differ substantially. This effectiveness of the optimizer for choosing the right platform transparently prevents applications from using suboptimal platforms. For instance, it prevents running: (i) Word2NVec on Spark for 5% and 100% of its input dataset. Spark performs worse than Flink because it employs only two compute nodes (one for each input data partition), while Flink uses all of them;Footnote 9 (ii) SimWords on Java for 1% of its input dataset (\(\sim 30\) MB); as SimWords performs many CPU-intensive vector operations, using JavaStreams (i.e., a single compute node) simply slows down the entire process; (iii) WordCount on Flink for 800% of its input dataset (i.e., 24 GB) and 1TB, where, in contrast to Spark, Flink suffers from a slower data reduce mechanism;Footnote 10 (iv) Aggregate on Flink for scale factors higher than 200, because it tends to write often to disk when dealing with large groups (formed by the \(\mathsf {GroupBy}\) operator); and (v) CrocoPR on JGraph for more than 10% of its input dataset as it simply cannot efficiently process large datasets as well as on Spark and Flink for 1TB whose performance is deteriorated by the number of created objects. Thus, our optimizer is capable of discovering non-obvious cases: For example, for the Word2NVec and SimWords a simple rule-based optimizer based on input cardinalities would choose JavaStreams for the small input of 30 MB (i.e., \(1\%\) of the dataset). However, JavaStreams is 7\(\times \) to 12\(\times \) slower than Spark and Flink in these two cases.
We also observe that Rheem generally chooses the right platform even for the difficult cases where the execution times are quite similar on different platforms. For example, it always selects the right platform for Aggregate and Join even if the execution times for Spark and Flink are quite similar in most of the cases. Only in few of these difficult cases, the optimizer fails to choose the best platform, e.g., Word2NVec and SimWords for 0.1% of input data: The accuracy of our optimizer is sensitive to uncertainty factors, such as cost and cardinality estimates. Still, all these results allow us to conclude that our optimizer chooses the best platform for almost all tasks and it prevents tasks from falling into worst execution cases.
Multi-platform optimization
Table 2 Opportunistic cross-platform breakdown
We now study the efficiency of our optimizer when using multiple platforms for a single task. We evaluate if our optimizer: (i) allows Rheem to spot hidden opportunities for the use of multiple platforms to improve performance (the opportunistic experiment); (ii) performs well in a data lake setting (the polystore experiment); and (iii) efficiently complements the functionalities of one platform with another to perform a given task (the complementary-platforms experiment).
Opportunistic experiment We re-enable Rheem to use any platform combination. We used the same tasks and datasets with three differences: We ran (i) Kmeans on 10x its entire dataset for a varying number of centroids, (ii) SGD on its entire dataset for increasing batch sizes, and (iii) CrocoPR on 10% of its input dataset for a varying number of iterations.
Figure 8 shows the results. Overall, we find that in the worst case, Rheem matches the performance of any single platform execution, but in several cases considerably improves over single-platform executions. Table 2 illustrates the platform choices that our optimizer made as well as the cross-platform data transfer per iteration for all our tasks. We observe Rheem to be up to 20\(\times \) faster than Spark, up to 15\(\times \) faster than Flink, up to 22\(\times \) faster than JavaStreams, up to 2\(\times \) faster than Giraph. There are several reasons for having this large improvement. For SGD, Rheem decided to handle the model parameters, which is typically tiny (\(\sim 0.1\) KB for our input dataset), with JavaStreams, while it processed the data points (typically a large dataset) with Spark. For CrocoPR, surprisingly our optimizer uses a combination of Flink, JGraph, and JavaStreams, even if Giraph is the fastest baseline platform (for 10 iterations). This is because after the preparation phase of this task, the input dataset for the \(\mathsf {PageRank}\) operation on JGraph is \(\sim 544\) MB only. For WordCount, Rheem surprisingly detected that moving the result data (\(\sim 82\) MB) from Spark to JavaStreams and afterward shipping it to the driver application is slightly faster than using Spark only. This is because when moving data to JavaStreams Rheem uses the action \(\mathsf {Rdd.collect()}\), which is more efficient than the \(\mathsf {Rdd.toLocalIterator()}\) operation that Spark offers to move data to the driver. For Aggregate, our optimizer selects Flink and Spark, which allows it to run this task slightly faster than the fastest baseline platform. Our optimizer achieves this improvement by (i) exploiting the fast stream data processing mechanism native in Flink for the projection and selection operations, and (ii) avoiding the slow data reduce mechanism of Flink by using Spark for the \(\mathsf {ReduceBy}\) operation. In contrast to all previous tasks, Rheem can afford to transfer \(\sim 23\)% of the input data because it uses two big data platforms for processing this task. All these are surprising results perse. They show not only that Rheem outperforms state-of-the-art platforms, but also that it can spot hidden opportunities to improve performance and to be much more robust.
To further stress the importance of finding hidden cross-platform execution opportunities, we ran a subtask (JoinX) of PolyJoin. This task gets the account balance ratio between a supplier and all customers in the same nation and calculates the average ratio per nation. For this, it joins the relations SUPPLIER and CUSTOMER (which are stored on Postgres) on the attribute nationkey and aggregates the join results on the same attribute. For this additional experiment, we compare Rheem with the execution of JoinX on Postgres, which is the obvious platform to run this kind of queries. The results are displayed in Fig. 9. Remarkably, we observe that Rheem significantly outperforms Postgres, even though the input data is stored there. In fact, Rheem is 2.5\(\times \) faster than Postgres for a scale factor of 10. This is because it simply pushes down the projection operation into Postgres and then moves the data into Spark to perform the join and aggregation operations, thereby leveraging the Spark parallelism. We thus do confirm that our optimizer both indeed identifies hidden opportunities to improve performance and performs more robustly by using multiple platforms.
Finally, we demonstrate how our optimizer is agnostic to any heterogeneity of the underlying cluster. To illustrate this, we emulated 2 struggle nodes (i.e., 8 workers) by running background applications so that these machines are slowed down. We also modified the cost model to take into account straggler nodes. Figure 10 shows the results for one task of each type. We observe that Spark, Flink, and Giraph are affected by the straggler nodes which slightly decrease their performance.
However, even in such a case Rheem manages to choose the best platform(s) as such information can be incorporated in it UDF-based cost model.
Polystore experiment We now consider the PolyJoin task, which takes the CUSTOMER, LINEITEM, NATION, ORDERS, REGION, and SUPPLIER TPC-H tables as input. We assumed the large LINEITEM and ORDERS tables are stored on HDFS, the medium-size tables CUSTOMER, REGION, and SUPPLIER on Postgres, and the small NATION table on a local file system (LFS). In this scenario, the common practice is either to move the data into a relational database in order to enact the queries inside the database [24, 59] or move the data entirely to HDFS and use Spark. We consider these two cases as the baselines. We measure the data migration time and the task execution time as the total runtime for these baselines. Rheem processes the input datasets directly on the data stores where they reside and moves data if necessary. For a fair comparison in this experiment, we set the parallel query and effective IO concurrency features of Postgres to 4.
Figure 11a shows the results for this experiment. The results are unanimous: Rheem is significantly faster, up to 5\(\times \), than moving data into Postgres and run the query there. In particular, we observed that even if we discard data migration times, Rheem performs quite similarly to Postgres. This is because Rheem can parallelize most part of the task execution by using Spark. We also observe that our optimizer has negligible overhead over the case when the developer writes ad-hoc scripts to move the data to HDFS for running the task on Spark. In particular, Rheem is 3\(\times \) faster than Spark for scale factor 1, because it moves less data from Postgres to Spark. As soon as the Postgres tables get larger, reading them from HDFS rather than directly from Postgres is more beneficial because of its parallel reads. This shows the substantial benefits of our optimizer not only in terms of performance but also in terms of ease of use: Users do not write ad-hoc scripts to integrate different platforms.
Complementary-platforms experiment To evaluate this feature, we consider the CrocoPR and Kmeans tasks. In contrast to previous experiments, we assume both input datasets (\(\mathsf {DBpedia}\) and \(\mathsf {USCensus1990}\)) to be on Postgres. As the implementation of these tasks on Postgres would be very impractical and of utterly inferior performance, it is important to move the computation to a different processing platform. In these experiments, we consider the ideal case as baseline, i.e., the case where data is already on a platform being able to perform the task. As ideal case, we assume that the data is on HDFS and that Rheem uses either JavaStreams or Spark to run the tasks.
Figure 11b shows the results. We observe that Rheem achieves similar performance with the ideal case in almost all scenarios. This is a remarkable result, as it needs to move data out of Postgres to a different processing platform, in contrast to the ideal case. These results clearly show that our optimizer frees users from the burden of complementing the functionalities of diverse platforms, without sacrificing performance.
Progressive optimization
We proceed to evaluate the utility of progressive optimization feature of our optimizer in the presence of incorrect estimates.
Experiment setup We enabled the progressive optimization (PO) feature of our optimizer. We considered the Join task for this experiment. We extended the Join task with a low-selective selection predicate on the names of the suppliers and customers. To simulate the usual cases where users cannot provide accurate selectivity estimates, we provide a high selectivity hint to Rheem for this filter operator.
Experiment results Figure 12 shows the results for this experiment. We clearly observe the benefits of our progressive optimizer. In more detail, our optimizer first generates an execution plan using Spark and JavaStreams. It uses JavaStreams for all the operators after the \(\mathsf {Filter}\) because it sees that \(\mathsf {Filter}\) has a very high selectivity. However, Rheem figures out that \(\mathsf {Filter}\) has in fact a low selectivity. Thus, it runs the re-optimization process and changes on the fly all JavaStreams operators to Spark operators. This allows it to speed up performance by almost 4 times. Last but not least, we observed during our experiment that the PO feature of Rheem has a negligible overhead (less than 2%) over using platforms natively.
Optimizer scalability
We continue our experimental study by evaluating the scalability of our optimizer to determine whether it operates efficiently on large Rheem plans and for a large numbers of platforms.
Experiment setup We start by evaluating our optimizer’s scalability in terms of the number of supported platforms and then proceed to evaluate it in terms of the number of operators in a Rheem plan. For the former, we considered hypothetical platforms with full Rheem operator coverage and three communication channels each. For the latter, we generated Rheem plans with two basic topologies that we found to be at the core of many data analytic tasks: pipeline and tree.
Experiment results Figure 13a shows the optimization time of our optimizer for Kmeans when increasing the number of supported platforms. The results for the other tasks are similar. As expected, the time increases along with the number of platforms. This is because (i) the CCG gets larger, challenging our MCT algorithm, and (ii) our lossless pruning has to retain more alternative subplans. Still, we observe that our optimizer (the no top-k series in Fig. 13a) performs well for a practical number of platforms: It takes less than 10 s when having 5 different platforms. Yet, one could leverage our algebraic formulation of the plan enumeration problem to easily augment our optimizer with a simple top-k pruning strategy, which simply retains the k best subplans when applied to an enumeration. To do so, we just have to specify an additional rule for the Prune operator (see Sect. 6.1) to obtain a pruning strategy that combines the lossless pruning with a top-k one. While the former keeps intermediate subplans diverse, the latter removes the worst plans. Doing so allows our optimizer to gracefully scale with the number platforms, e.g., for \(k=8\), it takes less than 10 s for 10 different platforms (the top-8 series in Fig. 13a). Figure 13b shows the results regarding the scalability of our optimizer in terms of number of operators in a task. We observe that our optimizer scales to very large plans for both topologies. In practice, we do not expect to find situations where we have more than five platforms and plans with more than hundred operators. In fact, in our workload the tasks contain an average of 15 operators. All these numbers show the high scalability of our optimizer.
Optimizer internals
We finally conducted several experiments to further evaluate the efficiency of our optimizer. We study five different aspects of our optimizer: (i) how well our pruning technique reduces the search space; (ii) how important the order is, in which our enumeration algorithm processes join groups; (iii) how effective our channel conversion graph (CCG) is; (iv) how accurate our cost model is; and (v) where the time is spent in the entire optimization process.
Lossless pruning experiment We proceed to compare our lossless pruning strategy (Sect. 6) with several alternatives, namely no pruning at all and just top-k pruning. In contrast to Sect. 9.5 where we used the top-k pruning to augment our lossless pruning, we now consider it independently. Figure 14 shows the efficiency results of all pruning strategies (on the left) as well as their effectiveness (on the right), i.e., the estimated execution times of their optimized plans. Note that we did not use the actual plan execution times to assess the effectiveness of our enumeration strategy in order to eliminate the influence of the calibration of the cost functions. As a first observation, we see that pruning is crucial overall: An exhaustive enumeration was not possible for SimWords and CrocoPR because of the large number of possible execution operators that these plans have. We also found that the top-1 strategy, which merely selects the best alternative for each inflated operator, is pruning too aggressively and fails in 3 out of 7 times to detect the optimal execution plan. While the numbers now seem to suggest that the remaining lossless and top-10 pruning strategies are of the same value, there is a subtle difference: The lossless strategy guarantees to find the optimal plan (w.r.t. the cost estimates) and is, thus, superior.
Join groups ordering experiment We start by analyzing the importance of the join groups order (see Sect. 6.3) by comparing it with a random order. Figure 15a shows that ordering the join groups is indeed crucial for the tree topology. This is not the case for the pipeline topology, where the process of ordering the join groups does not seem to exert any measurable influence on the optimization time.
For large, complex Rheem plans, a combination of the lossless pruning followed by a top-k pruning might be a valuable pruning strategy. While the former keeps intermediate subplans diverse, the latter removes the worst plans. This flexibility is a consequence of our algebraic approach to the plan enumeration problem.
CCG experiment Next, we evaluate the effectiveness of our channel conversion graph (CCG) approach for data movement. For this experiment, we compare our CCG approach with an HDFS-based data movement approach, i.e., only through writing to an HDFS file. Figure 15b shows the results in terms of runtime. We observe that for k-means, Rheem can be more than one order of magnitude faster when using CCG compared to using only HDFS files for data movement. For SGD and CrocoPR, it is always more than one order of magnitude faster. This shows the importance of well-planned data movement.
Cost model experiment We now validate the accuracy of our cost model. Note that similarly to traditional cost-based optimizers in databases, our cost model aims at enabling the optimizer to choose a good plan while avoiding worst cases. That is, it does not aim at precisely estimating the running time of each plan.
Thus, we evaluate the accuracy of our cost model by determining which plan of the search space our optimizer chooses. The ideal case would be to exhaustively run all possible execution plans and validate that our optimizer chooses the best plan or one close to it. However, running all plans is infeasible as that would take already several weeks for the small WordCount task with only 6 operators. For this reason, in Fig. 16a we plot for SGD and WordCount the following: (i) the real execution time of the first three plans with the minimum estimated runtime; and (ii) the minimum, maximum, and average of the real execution times of 100 randomly chosen plans.
We make the following observations: First, the first plan has the minimum real execution time compared to all other plans (including the second and third plans). Second, the first three plans have a better runtime not only compared to the average real execution time of the randomly chosen plans, but also compared to the minimum execution time of the randomly chosen plans. Based on these observations, we conclude that our cost model is sufficient for our optimizer to choose a near-optimal plan.
Breakdown experiment Last, we analyze where the time is spent throughout the entire optimization process. Figure 16b shows the breakdown of our optimizer’s runtime in its several phases for several tasks. At first, we note that the average optimization time amounts to slightly more than a second, which is several orders of magnitude smaller than the time savings from the previous experiments. The lion’s share of the runtime is the source inspection, which obtains cardinality estimates for the source operators of a Rheem plan (e.g., for inspecting an input file). This could be improved, e.g., by a metadata repository or caches. In contrast, the enumeration and MCT discovery finished in the order of tens of milliseconds, even though they are of exponential complexity. The pruning technique is the key that keeps the enumeration time low, while MCT works satisfactorily for a moderate number of underlying platforms that we used in our experiments.