In this section, we summarize our main results so far and provide a brief outlook toward future work.
Runtime Data Sharing Through Open and Decentralized Repositories
First, we present strategies that enable users to benefit from runtime data that was generated outside of their own execution environment.
Results Overview
In several of our prior works [16, 20, 31,32,33], we discussed the idea of exploiting similarities between different jobs and their executions, cultivating runtime data in a collaborative manner among numerous users and thereby improving the prediction capabilities of individual users. This includes decentralized system architectures for sharing context-aware runtime metrics, as well as similarity matching between jobs. An abstract depiction of this idea can be seen in Fig. 2.
Central Findings
Worth highlighting is C3O [33], where we took first steps toward system architectures that organize context-aware runtime data and performance model sharing. For this, we use repositories to share the source code of jobs together with corresponding runtime data on previous executions and their context. Further, we developed performance models that account for the individual execution contexts of different, globally distributed users.
The performance of the C3O runtime predictorFootnote 3 is presented in Fig. 3, where it is compared to Ernest [28]. Both predictors were evaluated on a dataset of 930 unique Spark 2.4 jobs that were executed on Amazon EMR, consisting of different numbers of AWS machines of categories c, m, and r, and sizes large, xlarge, and 2xlarge, which represent different allocations of memory and vCPUs per VM. The jobs in this dataset further cover a variety of algorithm parameters (e.g. k in K‑Means) and key dataset characteristics (e.g. the number of features and observations in Linear Regression). The full experimental setup is documented in [31].
The C3O predictor already outperforms Ernest when trained only on training data stemming from the same, local, context. This difference is exacerbated when the predictors are trained with global runtime data, since C3O is context-aware and can make good use of the information, while Ernest cannot. Hence, we observe that our collaborative approach with context-aware runtime prediction models outperforms traditional single-user approaches, especially when shared runtime metrics are available.
In another work extending the C3O system, we presented a way to minimize storage and transfer costs by reducing the training data while retaining model accuracy [32].
Limitations and Future Work
A major limitation of the approach we presented in C3O is that it only works for well-established jobs (like Grep or K‑Means). We will therefore work on performance models and cluster configuration methods that do not rely on the particular job having been previously executed elsewhere. To solidify the sharing angle of our methods, we will work on approaches for data validation and establishing trust among users of collaborative systems that rely on shared runtime data.
Performance Prediction with Context-Aware and Reusable Models
In the following, we present first results on performance models that are able to detect and leverage differences in the execution context of jobs and are thus more reusable.
Results Overview
The previous subsection underlined the benefits of data and model sharing across users. It also demonstrated that a proper representation of a job execution in form of data measurements has advantages. To this end, we achieved initial results on performance prediction with context-aware and reusable models [21, 29]. More specifically, we developed first models which are able to differentiate and leverage different execution contexts, going beyond fairly general performance models and automatic model selection as seen in our earliest work on the topic, Bell [25]. This idea is depicted in Fig. 4.
Central Findings
Our first efforts [25, 29] encompassed the utilization of various regression models with comparably few parameters. While fast to train, these models are too simple to sufficiently capture the execution context of dataflow jobs, leading to large estimation errors. More recently, we therefore investigated the applicability of neural networks for performance estimation and found that they can improve previous results by a significant margin [21]. Specifically, we trained a multi-component neural network on data originating from various execution contexts. At its core, our neural network architecture BellamyFootnote 4 utilizes an auto-encoder for encoding and exploiting descriptive properties of the enclosing execution context. This approach effectively enables the reuse of data from various contexts and a better approximation of a job’s scale-out behavior, hence leading to improved prediction results. For any encoded resource configuration as input, Bellamy predicts a runtime value, which can in turn be used to select the best candidate configuration according to user-defined objectives and runtime constraints.
The full experimental setup can be found in [21], though in short, we utilized the C3O dataset and investigated among other things the interpolation and extrapolation capabilities of our approach under varying data availability for model pre-training using random sub-sampling cross-validation and various concrete model configurations. Fig. 5 shows the interpolation results for various dataflow jobs, where the advantage of our approach over comparative methods is especially evident for jobs with presumably non-linear scale-out behavior. It can be seen that even hardly related data can improve the prediction performance for the execution context at hand. We also find that this generally mitigates the cold-start problem through the incorporation of knowledge from historical workload executions. Moreover, such neural network models can be pre-trained and fine-tuned, and are thus reusable across contexts.
Limitations and Future Work
Currently, our approach mainly leverages textual properties, which are often difficult to interpret and compare. Thus, a promising direction is the concretization of textual properties of a job execution, i.e. by measuring and including appropriate metrics. Moreover, there is room for improvement with regards to efficient model re-trainings from scratch, as well as appropriate stopping criterions in case of limited available training data. We also plan to incorporate further aspects of the execution context (such as explicit information about the underlying infrastructure) into prediction models systematically in the future.
Runtime Adjustments of Performance Models and Resource Configurations
Next, we present concrete techniques to realize dynamic changes to cluster configurations for data-parallel distributed processing jobs.
Results Overview
The aforementioned results optimize initial resource allocations. Going beyond this, we worked on mitigating many dynamic effects that cannot be foreseen accurately, such as changing input data distributions or interference with co-located jobs. To continuously facilitate efficiency, while meeting provided runtime targets, we have developed approaches for dynamic adjustments of performance models and resource configurations [22, 26]. The general idea is depicted in Fig. 6.
Central Findings
Worth discussing here is that we implemented dynamic adjustments along the structures of iterative jobs, namely synchronization barriers between subsequent job iterations to re-assess and, if needed, change configurations. In particular, we build upon our idea initially proposed with Ellis [26], where we introduced a dynamic horizontal scaling method for distributed dataflow jobs with runtime targets and leveraged the fact that iterative jobs can be logically dissected into many individual stages. Yet, in contrast to training an ensemble of stage-specific, specialized performance prediction models, we turned to employ a single global graph model with Enel [22], which is trained on the entire available execution data. Annotated with descriptive properties and collected monitoring metrics, the directed acyclic graph of tasks in individual stages, as well as the graph of all stages on a meta-level, is leveraged and exploited by our global model for proper rescaling recommendations via runtime prediction and subsequent prediction-based configuration ranking. Moreover, in contrast to Ellis, our newer method Enel does not simply employ heuristics to assess the trade-off between rescaling and its overhead, but learns the expected overhead in a data-driven manner and incorporates this into the decision-making process through consideration of the aforementioned properties and graph structures.
The full experimental setup can be found in [22], where we compare EnelFootnote 5 against Ellis on a selection of four commonly employed iterative Spark jobs running in a commodity cluster of 50 machines. Each job is executed 65 times, where for some executions we simulated anomalous behavior by randomly injecting failures into Spark executors. Moreover, we investigated the scale-out range from 4 to 36 Spark executors. We found that this single model is more robust and reusable across the stages of dataflow jobs. Though requiring a sufficient amount of data which manifests in a longer profiling phase, our global graph models tend to better capture the enclosing execution context in the long run. It is superior in detecting and mitigating anomalous execution behavior as shown in Fig. 7, and requires only a single generalized model instead of a multitude of specialized ones.
Limitations and Future Work
A major limitation of our current approach is the assumption of jobs being executed in isolation, ignoring the potential interference with co-located jobs. Moreover, the selection and appropriate representation of monitoring metrics remains a challenge. In the future, we further intend to integrate forecasting methods into our approaches to enable pro-active dynamic resource configurations and we will work on more accurately identifying points in time that are especially suitable for performance model updates. In addition, we want to investigate how our approach to cluster resource configuration can be used in combination with similar ideas for indexing [13] or query processing [17].