Predictable Performance for QoS-Sensitive, Scalable, Multi-tenant Function-as-a-Service Deployments

In this paper we present the results of our studies focused on enabling predictable performance for functions executing in scalable, multi-tenant Function-as-a-Service environments. We start by analyzing QoS and performance requirements and use cases from the point of view of End-Users, Developers and Infrastructure Owners. Then we take a closer look at functions’ resource utilization patterns and investigate functions’ sensitivity to those resources. We specifically focus on the CPU microarchitecture resources as they have significant impact on functions’ overall performance. As part of our studies we have conducted experiments to research the effect of co-locating different functions on the compute nodes. We discuss the results and provide an overview of how we have further modified the scheduling logic of our containers orchestrator (Kubernetes), and how that impacted functions’ execution times and performance variation. We have specifically leveraged the low-level telemetry data, mostly exposed by the Intel® Resource Director Technology (Intel® RDT) [1]. Finally, we provide an overview of our future studies, which will be centered around node-level resource allocations, further improving a function’s performance, and conclude with key takeaways.


Introduction
The general Cloud Computing model relies on centralizing computing power and then re-distribution of this computing power among multiple users and tenants. The benefits of such approach, among others, are inherit scalability and, from the end user perspective, simplified resources management.
Additional layers built on top of Cloud Computing, like Function-as-a-Service deployments, release the burden of managing hardware and software resources, from service developers, even further. At the same time, however, resource providers must ensure that performance of services is stable and independent from performance and resource utilization of other services running at the same time on the same set of resources.
In this paper we investigate the methods for improving services' performance stability, which we view as an important aspect of overall Quality of Service.

The Importance of Predictable Functions Performance
The predictability of function execution performance (most often function execution time) is important from several reasons. Here are the QoS and performance expectations from Users, Developers/Application Owners and Infrastructure Owners/Admins: In this study we define predictable performance in relation to Coefficient of Variation (CV) for function execution time. The CV itself is defined as [2]: Where, c v is a coefficient of variation, r is a standard deviation and l is a mean. We consider function to have predictable performance when its CV is less than or equal to 15%. Otherwise, we consider the performance to be unpredictable. When average function execution time is 1 s, and resource utilization billing is done at 100 ms granularity, then 15% execution time churn corresponds to up to 2 billing cycles, which we consider tolerable from function owner perspective.
FaaS deployments are intrinsically multi-tenant and expected to scale rapidly, ondemand. To enable such scaling, without sacrificing performance, we propose to pay special attention to CPU microarchitecture resources utilization, as it directly correlates with functions' performance. Here is the high-level view of resources for Intel Xeon Processor (Fig. 1).
Especially shared resources, like memory bandwidth to DRAM (controlled by the Integrated Memory Controller) and Last Level Cache (i.e. Third Level Cache) should be closely monitored, as minimizing contention on those resources improves overall functions' performance. Also, for multi-socket platforms, crossing socket or NUMA node boundary might be associated with performance penalty (due to narrower remote memory bandwidth). The study analyzing impact of memory latency and memory bandwidth to the workload's performance is described in [3].
In general, the pool of CPU cores is also a constrained resource on which contention might happen. But we leave the task of allocating software threads to CPU cores to the Linux scheduler and did not interfere with that in our study.

Analyzing Functions Performance and Performance Predictability
In the following sections we describe how we were analyzing functions performance.
We have started with gathering information about functions characteristics, especially resources sensitivity patterns. That enabled us to further analyze performance related problems and propose solutions.

Test Stack and Test Functions
We've conducted our experiments on a 4-node Kubernetes cluster, with 1 master and 3 worker nodes.  In our experiments we are using following functions: • Incept, which uses Tensorflow for image recognition • Nmt, which uses Tensorflow for English to German translation • Sgemm, which does single precision floating General Matrix Multiply • Stream, the STREAM benchmark [5].

Introduction to Top-Down Microarchitecture Analysis Methodology
Our test functions have been profiled using Top-Down Microarchitecture Analysis methodology [6,7]. This approach facilitates finding categories of platform resources, and individual resources, that are most critical to the workload (e.g. function) and can limit performance when not available. The results of the profiling, at high CPU utilization (ranging from 95 to 100%), are presented in the Table 1  This knowledge can be leveraged in optimizing scheduling and load balancing logic, so that functions' performance is not hampered by the lack of critical platform resources. This is specifically important in large scale, multi-tenant deployment where noisy neighbor effects are most prominent.

Platform Resources Utilization Monitoring
During functions' execution we collect telemetry data to better understand resources utilization patterns. For each function instance, per each call, we are collecting the following: • Memory bandwidth utilizationexposed by the Linux 'resctrl' filesystem, the source of data is Intel RDT Memory Bandwidth Monitoring technology • Last Level Cache Occupancyexposed by the Linux 'resctrl' filesystem, the source of data is Intel RDT Cache Monitoring Technology • Last Level Cache Misses Per Kilo Instructionsexposed by the platform as a CPU architectural performance monitoring event, can be collected, for example: via Linux perf tool • CPU utilizationexposed by the Linux CGroup filesystem We also record function execution times as an indicator of a function's performance.
Having insight into nodes' resource utilization and availability is critical in order to improve placement of functions on the nodes. Here are the most important telemetry data that we collect per each compute node: • CPU utilizationexposed by the Linux/proc/stat file • Memory bandwidth utilizationexposed by the CPU Performance Monitoring Unit (of Integrated Memory Controller), can be calculated from events collected, for example via Linux perf tool • Average memory latencyexposed by the CPU Performance Monitoring Unit (of Integrated Memory Controller), can be calculated from events collected, for example via Linux perf tool.

Analyzing Functions' Co-location Cases
In this experiment we use "hey" [8] to stress the test functions. We start from light load (low Request-Per-Second values) and continue stressing functions up to the point where all cores (36 total for 2 sockets, 18 cores per function) on the platform are utilized, thus translating to high RPS values. Theoretically, functions with moderate memory bandwidth consumption should co-exist better on the same node than functions with high memory bandwidth requirements. The reason is less contention on the resource required by both functions. We should also see improved function execution times and lower resources utilization when functions are not competing over the same, shared resource. The results for the "Incept" function scheduled along with other functions are depicted below (Fig. 2): We can observe that, if Incept is located with Sgemm it achieves the best performance predictability (lowest CV values across the RPS range) and best throughput (lowest average function execution time). An optimal scheduler should co-locate Incept with Sgemm, rather than Nmt or Stream. The worst colocation case is placing Incept and Nmt on the same node, and optimal scheduler should avoid that. Incept and Nmt are poor candidates for colocation because they are heavy memory bandwidth users and natural contenders for this resource.
The table below presents comparison of average node resources utilization when Incept is collocated with Nmt (sub-optimal placement) and when Incept is collocated with Sgemm (optimal placement) in case when all CPU cores on the platform are utilized ( Table 2).
Sub-optimal placement results in almost 20% higher memory bandwidth utilization, increased memory latency, and around 20% higher CPU utilization. And as we've seen before, wrong placement decision ultimately impacts function execution time and execution time variability.

Scheduling Improvements
By leveraging per-container telemetry (especially memory bandwidth utilization) and per-node resource availability we tried to improve the scheduling logic. In Kubernetes, which we are using as our containers' orchestrator, scheduling is a two-stage process. In the first step (filtering) we exclude any nodes without enough available memory bandwidth. In the second step (prioritization) we assign scores to the nodes and select the node with the highest score. Here are the scoring categories: • Available memory bandwidthnodes are sorted with available memory bandwidth in descending order. The node with maximum available memory bandwidth is assigned highest score, and the one with the lowest amount of available memory bandwidth is assigned the lowest score. • Memory Latencynodes are sorted and assigned scored based on the memory latency (lower values are preferred over higher values) • CPU utilizationnodes are sorted based on available CPU (more available CPU equals to higher score) Scores from all categories are summarized per node and the node with highest overall score is selected.
Graph below present comparison of Incept's CV when using default scheduling logic vs scheduling logic which takes memory bandwidth and memory latency into account. Scheduling enhancement were done by leveraging Kubernetes scheduler extender mechanism [9] (Fig. 3).
For lower RPS (up to around 7), the scheduler extender reduces CV to acceptable level (15%). Execution time also improves slightly, which can result in improved cluster throughput. Those results can be further improved with RDT Memory Bandwidth Allocation feature, which we plan to leverage in future experiments.

Future Work
As a next step we plan to research how at-node-level allocation of resources (e.g. by using Intel RDT Memory Bandwidth Allocation and Cache Allocation Technology) impacts functions' performance.
We would also like to deepen studies on differentiated performance for QoSsensitive workloads. The Service Level Agreements are commonly used for managing QoS. At its simplest form the SLA can be expressed as a two-level function prioritization agreement, distinguishing between high and low priority tasks (e.g. functions). We'd like to research how high-level SLAs can be mapped to resource allocations and how allocations enforcement can be used for improving performance predictability even further.