Provisioning Input and Output Data Rates in Data Processing Frameworks

This paper is motivated by the need of deadline-bounded applications in live mobile network environments to obtain the guarantee and the appropriate share of an input and output (I/O) data rate. However, data processing frameworks only support the request of memory and the computing capacity at present. In this paper, we propose a solution that allows the control of disk I/O and network I/O for data processing applications in YARN and Mesos frameworks. Experimental results show that our tool can provision the I/O data rate sharing of competing data processing applications.


Introduction and Motivation
When a specific application submits a job, a data processing framework such as Apache Hadoop [1,26], Hadoop YARN [25], Mesos [2], reserves and allocates necessary computing resources for the execution of the job. Many of the resource models include the amount of memory, the number of virtual CPU cores, but do not contain information on the disk I/O capability of the commodity servers and the network I/O rate. Since a computing cluster can be built up of heterogeneous hardware and software components, the I/O data rate perceived by applications is unpredictable due to the contention for the resources of the physical servers, and there is a need to monitor applications running on these platforms as well [12]. As shown in [11,24], the I/O contention of applications leads to the degradation of quality of service.
Nowadays telecommunication operators often apply frameworks to regularly process big data sets with specific deadlines in their computing clusters. Therefore, the I/O data rate guarantee and the control of sharing the available data rate for applications are critical issues in the environment of telecommunication operators. When applications compete for a resource in the hardware and the network level, which is hidden from programmers (and therefore from applications), they may suffer the I/O performance degradation.
Motivated by the need, we design a complete solution that can be applied to provision the I/O data rate of applications in both the Mesos and YARN frameworks. Note that this is a result 1 that has gradually been improved over the years based on our previous experiences [10,11,24]. We demonstrate that the proposed functionalities can be integrated into two popular data processing frameworks such as Mesos and YARN to control the I/O data rates (disk I/O and network I/O) of applications, which may relieve the pain of service providers on the integration of schedulers to existing frameworks. Also, we show that our solution can be applied to provision the share of the I/O data rate of competing and deadline-bounded applications.
The rest of this paper is organised as follows. A summary of related works is provided in Section 2. The design and the implementation are described in Section 3. Experimental results are presented in Section 4. Section 5 concludes the paper.

Related Work
In this section, we provide a short overview of papers that dealt with the issue related to our work which has gradually been improved over years. Kc and Freeh [15] proposed a feedback-based dynamic approach for controlling the parallelism of concurrent containers on the node level. This method takes into account the percentage of time spent by the CPU in user mode, a number of processes blocked on I/O data, and a total number of context switches of each worker node to suit an application and avoid bottlenecks for all types of MapReduce applications in a YARN cluster. The characteristics of MapReduce workloads (such as CPU-bound and I/O-bound) are considered in scheduling algorithms for improving the performance of the Hadoop cluster [17]. Ko et al. [16] addressed the overhead of I/O data block processing in virtualized Hadoop clusters and propose a large segment scheme for I/O ring (the structure between frontend and backend driver for transferring I/O requests) to improve the performance perceived by applications. Spivak and Nasonov [23] suggested a distributed cache to preload data before its computation in Hadoop systems. They demonstrated 1 https://github.com/dohona/hadoop that their method effectively reduces the execution time of MapReduce jobs, particularly when the operation time of I/O data is lower than the intensive phase of the CPU. Jung et al. [21] proposed to integrate Hadoop Distributed File System (HDFS) with a lightweight Adaptive I/O system that supports various I/O-based methods. The study shows that by choosing the optimal I/O method for a particular platform, the reduction of I/O time increases the system performance. Malensek et al. [19] studied a disk I/O contention issue between virtual machines and proposed a scheduler based on Linux tools to provide the share of a disk I/O. Enes et al. [13] investigated the resource contention and its effects on PaaS services. They proposed a Platform as a Service (PaaS) architecture where a disk I/O is offered as resources. Liao et al. [18] presented a new scheme of I/O scheduling to yield better I/O performance in the servers of distributed/parallel file systems. Amamou et al. [9] proposed a dynamic bandwidth allocator that limits the network bandwidth for virtual machines in a cloud environment. Kamal et al. [15] proposed a feedbackbased dynamic approach for controlling the parallelism of concurrent containers on node level. Their proposal takes into account the percentage of time spent by the CPU in user mode, a number of processes blocked on I/O data, and a total number of context switches of each worker node to suit an application and avoid bottlenecks for all types of MapReduce applications in a YARN cluster.
The idea of co-locating different scheduler frameworks in the same data center has been in discussion for the benefit of the operators and service providers. For example, only HDFS read traffic shaping in a YARN cluster using Traffic Control (LTC) mechanism was proposed in [11], while the disk I/O problem for Hadoop MapReduce applications and Spark applications [3,28] was investigated in [10,24]. Xu and Zhao [27] presented an interposed big-data I/O scheduler to provide I/O performance differentiation for competing applications in a shared big-data system. It is worth mentioning that the work by Xu and Zhao [27] is close to our previous solution [11]. However, they did not present the detail of their solution. In this paper, we propose a tool with a set of functionality to monitor and control the competition of applications in production environments, which is the result of improving works presented in [10,11,24]. The main contribution of this paper is the systematical design of a tool based on exploring necessary functionality needed in both the Mesos and YARN frameworks. Furthermore, we show the provisioning capability of our tool in a real use case with competing and deadline-bounded applications.

Functionalities to Control the I/O Data Throughput of Applications
In this Section, we provide an overview of the design and implementation of our tool.

Design
Our aim is a set of functionalities that control and enforce the I/O data rate for applications in data processing frameworks. Following the ITU-T G.1000 [22] recommendation, a procedure to guarantee resources, therefore, should include: -the specification of resource requirements of users and their applications, -information about the shared compute cluster, -the maximum capacity of the resource (e.g. disk I/O capacity) of the cluster, -the amount of resources occupied by containers and applications, -a unified resource management strategy, -a set of policies to schedule and allocate the resources for applications, -a set of mechanisms to isolate a granted amount of the resource given to a container, -a set of features and tools to enhance the ability to perform and automate operational tasks such as configuration, monitoring, auditing, etc.
As a consequence, we propose software components for I/O monitoring and enforcement in Fig. 1.
Controller keeps information about the capability of physical servers and also maintains the status information of agents. Agents that run in physical servers must report the capacity of the new resource type of the corresponding worker node, monitor the amount of new resource type occupied by containers, and apply an isolation mechanism to enforce the resource usage of containers. The agent executed in each physical machine contains two components: (i) ResourceMonitor monitors the resource usage of the running containers and the worker machine. (ii) IOController performs resource enforcements.
The tool forms a layer between a big data processing framework and an underlying operating system. The tool maps the abstract resource quantities to the real capabilities of commodity servers and networks. The tool interacts with the underlying operating system to monitor and enforce the use of resources in the computing cluster and the data flow of HDFS. Furthermore, it provides monitoring data, such as the resource usage of the containers, Data Nodes, etc. as well as the free resources for the resource management frameworks. For the coordination of the controller with agents and data frameworks (i.e., the exchange of resource reservation, allocation, monitoring and enforcement information), the following interfaces are defined.
-Interface I1 is to configure the Controller and agents. -Interface I2 is to exchange data on I/O resource usage between our tool and data processing frameworks. -Interface I3 is to obtain information of containers and I/O data rate requirements. -Interface Ca1 is to exchange data between the agents and the Controller.

The ResourceMonitor Component
The software modules of the ResourceMonitor component are plotted in Fig. 2. The ResourceMonitor is responsible for -registering and maintaining the information of containers requested by the computing frameworks, -monitoring the resource usage of the containers that have been started, -collecting all processes belonging to each container, -observing all established HDFS connections and collecting host:port information of each container, -forming and sending enforcement requests to IOController component in order to modify and perform an appropriate enforcement, -unregistering and stopping the action that monitors the resource usage of a container.
When a new container is launched by YARN (or Mesos), NodeManager of YARN (or Mesos Slave) send the information through an interface to the ResourceMonitor component residing in each machine. A container registration message includes -the unique string identifier (containerId) of a container, -the identifier (pid) of the process spawned to handle the container, and -the resource requirements (with the resource type, HDFS read limit rate, disk I/O read throttling rate).
The details of the registration process when a new container is launched by YARN or Mesos are presented in Sections 3.2.1 and 3.2.2. The ResourceMonitor has the following main software modules.
-ContainerManager provides an interface for registering containers reported by the slaves such as YARN NodeManager or Mesos slave. -Proc Monitor detects the establishment of TCP connections and collects the incoming HDFS connections of a given container in an accumulative way by subsequently walking the /proc/net directory (containing networking parameters and statistics in a Linux machine) and parsing the output of Linux built-in ss utility -that displays socket statistics. Proc Monitor also monitors a container's resource usage, periodically collects the child processes of the reported process. Then, it either delivers this data to the IOController component or submits them to the Controller. The cooperating frameworks report the establishment of the launched containers through an interface with container data (e.g. pid, resource requirements). After the registration of a container, the Proc Monitor periodically monitors the container to collect the list of processes belonging to the container as well as the list of its connections to HDFS, besides other resource usages. They are used to form enforcement requests that are sent to the adjacent IOController for enforcing the local disk I/O or for shaping the local HDFS read traffic. If an HDFS read connection involves a remote DataNode, then the appropriate enforcement request is submitted to the agent. It is worth mentioning that in the case of YARN MapReduce, one container is spawned to handle only one map/reduce task. However, in the case of Spark on Mesos, one container is spawned to handle a Spark executor, which in turn may run many tasks in parallel. Hence the requirement specifications (e.g. the HDFS read limit rate) should be determined based on the number of current tasks in the container and are adjusted during the lifetime of the container.

The IOController Component
The IOController components of the agent are illustrated in Fig. 3. The IOController performs an enforcement based on the information from the ResourceMonitor component and the Controller. The IOController component has the following modules.
-Forwarder receives and routes the enforcement requests to the appropriate handler modules.
-NetworkTrafficController enforces the data rate for TCP/IP connections between the DataNode and a particular container. TCP flows are grouped into classes specified by filters of various conditions (e.g., priority, interface, or originating program). TCP/IP traffics in one class equally share bandwidth. Classes are enforced by the classbased queueing scheduler -CBQ [6]. LTCEvent-Builder build appropriate filters to classify data flows. Then, the information is passed to Traffic-ControllDeviceExecutor to enforce the I/O data rate on flows with the use of Linux tc program. Note that there may be more instances of TrafficControllDeviceExecutor in one machine.
Each instance corresponds to one Network interface card (NIC) device. TcFilterExecutor is used to create and specify the class a specific flow belongs to, while TcClassExecutor is for giving the enforcement instruction for the CBQ scheduler. It is worth emphasizing that the control information of HDFS belongs into a class of the highest priority. -DiskIOController performs a disk I/O enforcement for requested containers. DiskIOController has two submodules to perform a disk I/O enforcement.
-IOMapper provides the mapping of abstract resource quantities in the enforcement request (used by a cluster resource management framework) to the real capabilities of commodity servers (used by the underlying operating systems). The IOController is also responsible for cleaning the settings of the CGroups Block I/O Controller and LTC-based TCP/IP Controller when a container is removed. For example, when HDFS is enforced, the ResourceMonitor submits the container an empty list of connections. Then all LTC settings related to this container are deleted in each related DataNode machine. In the initialization phase, the TrafficCon-trolDeviceExecutor collects all the old LTC settings. These data are then merged with the up-to-date information to generate the new LTC settings for the enforcement. The old settings are deleted, and the new LTC settings are applied.

Implementation
The software components (Fig. 4) have been implemented. The Controller applies ZooKeeper to store data and communicate with the agents, and to route enforcement requests to the corresponding agents. The data structure used by the Controller was presented in [11]. Agents use the Linux built-in utilities to monitor and enforce I/O data rates.
The interface I3 was implemented as a file-based plugin as follows.
-In a machine, an agent is configured to monitor a specific folder periodically. new files are created, or some files are modified, deleted), it parses the content of the files in the folder and uses them to synchronize its persistent data as well as to construct and submit enforcement requests to the Controller.

Cooperation with YARN
The following steps are required to cooperate with YARN.
(i) The NodeManager communicates with the agent to add, modify or remove containers. (ii) After receiving the resource monitoring and enforcement requests from the NodeManager (which shall contain the container ID, the enforcement type and the resource requirements for each container), the agent starts the procedures to monitor and enforce resources for the requested containers. To perform resource enforcement, the agents share data through the Controller. (iii) The NodeManager may modify existing enforcement requests during the operation.
When a YARN client submits a job to the Resource-Manager, the submission of the job contains the resource requirements for a container that hosts the Application Master (AM) of the client [20,26]. The AM is responsible for requesting resources for a container or containers to run application tasks. An instance of Resource class conveys the resource requirements of the containers. Therefore, to support a new type of resource, the Resource class should be extended to include the requirements of a new resource type (e.g., the IOPS, the reads per second, the writes per seconds, throughput). Since the NodeManager is responsible for notifying the agent about the creation and deletion of containers as well as about the enforcement data, a modification is needed in the code of the NodeManager.
The execution of the MapReduce WordCount (that reads files stored in HDFS and count the occurrence of words) with the proposed components is depicted in Fig. 5.

S) Applications specify I/O rate requirements when
negotiating resources with the ResourceManager (RM) to start the AM and the containers. The HDFS bandwidth rate is configured as a resource. The ApplicationMaster sends resource requests, including the HDFS read/write limit rate requirements. These resource requirements are then passed to the appropriate NodeManagers and agents. When an NM launches a container, it also submits the request of the HDFS read limit rate to the adjacent agent.
N) Based on the list of tasks and the I/O data rate requirements (per task), a NodeManager (NM) calculates the total amount of resources for a given container, and reports a resource enforcement request to the client. E) The IOController then translates the request into appropriate enforcement settings and applies LTC rules to shape the requested HDFS downstream traffic. The enforcement is performed during the lifetime of the container. M) The ResourceMonitor monitors the HDFS read connections of the given container and may submit the collected connection information to the Controller for a remote IOController.

Cooperation with Mesos
The following aspects need to be considered to cooperate with Mesos: -Mesos slaves communicate with the agent to add, modify or remove containers. -After receiving the resource monitoring and enforcement requests from the Mesos slave (which should contain the container ID, the Fig. 5 The execution flow of MapReduce WordCount example with the proposed approach enforcement type and resource requirements for each container) the agent starts procedures to monitor and enforce an I/O rate for the requested containers. To perform resource enforcement, the agent may share data with other agents through the Controller. -Mesos slaves may request the monitored data from the agent to report its data usage to the Mesos master.
In Fig. 6 the execution flow of the Spark Word-Count example with the proposed components is depicted.

S) A Mesos slave retrieves information from the
ResourceMonitor and reports the amount of usable resource (e.g. I/O data rate) to the Mesos master, which in turn advertises it to a specific Spark driver. The Spark driver must specify I/O data rate requirements when a resource offer is accepted.
To shape the HDFS downstream traffic of containers, Mesos slaves must report the HDFS read limit bandwidth to the Mesos master. The application driver should specify its requirement for the resource by consuming the offers from the Mesos master. The resource model of Mesos assures that the requested HDFS read limit of a container is passed to a corresponding Mesos slave, which in turn launches a container and assigned tasks. N) Based on the list of tasks and the I/O data rate requirements (per task), the Mesos slave can calculate the total amount of the resource for the given container, then it can notify the agent about the resource enforcement request. As a part of the resource enforcement procedure, the Mesos slave calculates the amount of HDFS read limit rates for the given container, based on its running tasks, and submits the request to the adjacent agent. For example, in case of the mentioned file-based plugin it must export the container data into a flag file, the name of which is in the ContainerId string of the message, in a pre-configured folder processed by the adjacent agent. M) The ResourceMonitor monitors the container, then forms and sends enforcement requests to the IOController. It may share data with a remote IOController via the Controller. E) The IOController in turn translates the request into the appropriate enforcement settings, and the enforcement is performed during the lifetime of a given executor. The ResourceMonitor monitors the HDFS read connections of the given container and sends the collected connection data to the IOController through the Controller. The IOController component then applies LTC rules to shape the requested HDFS downstream traffic. The setting for the HDFS read limit of the container can be adjusted during the container's lifetime based on its current tasks.

Experimental Results with Controlling I/O Activities
In this section, we present numerical results to show that I/O activities can be controlled with the proposed framework.

Testbed and Performance metrics
Experiments are performed in a shared cluster that consists of the following elements: For each scenario, the average value of ten repeated measurements is calculated. The buffer cache was dropped before each measurement and was removed at the end.

Controlling I/O activities in Mesos
The disk I/O contention may occur when multiple tasks concurrently read or write data from the same disk. Figure 7 shows captured read data rate of Spark TeraSort and FkmerG on Mesos when running successively or in parallel. It is worth mentioning that FkmerG executors can run up to two tasks, while Tera-Sort executors can run up to three tasks in parallel. The HDFS block size is 256MB. There is a significant degradation in the disk I/O performance of the  FkmerG if TeraSort is used, as shown in Fig. 7b. Figure 8 depicts the HDFS read rates when FkmerG and the TeraSort are executed in parallel. Figure 8a presents the data rate when a rate limit of 10 MB/s per task is set for both FkmerG and TeraSort. Similarly, Fig. 8b plots the read data rate when FkmerG was prioritized. Clearly, the I/O data rate performance of the FkmerG application can be provisioned with our solution. Figures 9 and 10 plot the measured I/O performance metrics of the Spark FkmerG and TeraSort applications with/without an I/O greedy application. Note that the FkmerG and TeraSort executors were configured to run up to four tasks in parallel and the HDFS data block size is 256MB. When a specific workload (e.g., FkmerG or TeraSort) needs to be executed in a specific timeframe, administrators can choose our solution to limit the I/O contention of I/O greedy applications. In this experiment, the maximum throughput of I/O greedy applications is limited to 20 MB/s. For example, FkmerG achieves a read rate of 7.48 MB/s when an HDFS-writer is run with the throttled I/O rate in parallel (as for comparison, FkmerG provides the read rate at 1.75 MB/s, when an HDFSwriter is run in an uncontrolled environment, until FkmerG achieves the read rate at 8.3 MB/s when there is no contention).
The experimental results show that the control of I/O-greedy applications is beneficial, in terms of the average delay for an I/O disk block for typical applications. As CGroups does not support the control of writing asynchronous data to disks, the I/O throughput of Fio-writer cannot be limited. Therefore, it is recommended to locate the HDFS storage in separate disks when the Hadoop cluster is shared with other applications [24].           Fig. 13a, which plots the captured throughput of the containers of MapReduce TestDFSIO and MapReduce Grep applications that were launched at the same time.
In this experiment, TestDFSIO (to process two files with the size of 2GB) and Grep (to process 1GB data with the block size of 256MB) were configured to execute up to two map tasks in parallel. Thus, four containers can try to access the same DataNode at any moment. In this case, FkmerG and TeraSort executors were configured to run up to two or three tasks in parallel, as described in Section 4.2 to shape the HDFS traffic of these applications. Figure 13b plots the captured results when the limited rate per task of TestDFSIO and Grep was set to 10 MB/s and 40 MB/s, respectively, while Fig. 13c plots the captured data when TestDFSIO was preferred with controlled data rate at 40 MB/s and Grep's rate is set to 10 MB/s. This case demonstrates that the HDFS read data rate of MapReduce applications can be appropriately shaped using our approach. Furthermore, the throughput of a container seems to be more predictable compared to the results in case of a limiting rate.

A Typical use Case
In this section, we present a typical use case at mobile network operators where our tool can provision the data rate of deadline-bounded applications. In a computing cloud at mobile network operators data processing jobs consist of the data ingestion phase (prepare the acceptance of data from sources and stored in HDFS) and the processing phases as illustrated in Fig. 14. These applications are executed with periodicities and the execution should be finished by deadlines. Two applications are assumed.
-In a streaming application, events are continuously streamed from an external source with Kafka 4 message bus and a Flume 5 agent into HDFS. A MapReduce Join job is started at the beginning of each period to process the streaming data of the previous period. Thus, both data ingestion and Join job must be finished within 3 minutes. Note that the ingestion data of the streaming application (around 5 MB/s) is very small compared to the full copy rate of the HDFSwriter (around 100 MB/s) -In a batch application, the data must be copied into HDFS, using an HDFS-writer tool from an external machine, which is not a part of the Hadoop cluster. A MapReduce Grep job is started after finishing the data ingestion phase.
The run times of two applications without competitions are presented in (Table 1). The streaming application is to be executed with the periodicity of 3 minutes, while the data processing job submitted by For each period, the streaming application must process around 900 MB data of events. Thus there are 2 Map tasks in each MapReduce Join job. The batch application must process 20 GB data, generated by HDFS-writer with the block size of 512 MB. Thus, there are 40 Map tasks in each MapReduce Grep job. The Capacity scheduler of YARN is configured so that the batch application can run up to 5 Map tasks in parallel, while the streaming application can run up to 2 Map tasks in parallel. When both MapReduce Grep and Join run in parallel, then the Join job can have a higher priority to run its Map tasks.
As both MapReduce Join and HDFS-writer are write-I/O intensive, they have a significant impact on the process of streaming data. It can be observed in Table 2 that the requirements of the streaming application cannot be guaranteed when the HDFSwriter copies the data to the HDFS with the full rate.
The proposed framework was used to enforce HDFS read/write rates of HDFS-writer and MapReduce jobs. The writing rate of HDFS-writer is limited to 60 MB/s. The HDFS write/read rate of each Grep and Join map tasks are limited to 20MB/s and 40 MB/s, respectively. The HDFS write/read rate of reduce tasks is 50 MB/s for both jobs. Therefore, we can organise the processing pipelines for two applications as in Fig. 15 to guarantee the deadlines.

Conclusion
In this study, we have demonstrated that free competition profoundly degrades the performance of applications in computing frameworks. We have proposed software components (available at https://github.com/ dohona/hadoop), a set of functionalities to enable monitoring of the I/O capability (disk and network) of the physical infrastructure and the control of I/O competition. Our experimental results have shown that our controlling tool that can easily be integrated into YARN and Mesos environments as well can minimize contention situations and I/O data rate overload of applications in shared clusters. It is worth emphasizing that our tool can be used to provision the disk I/O and the network bandwidth for containers in a cloud as well.