Keywords

1 Introduction

Container technique is recently experiencing rapid development with the support from the industry like Google and Alibaba and is widely used in large scale production environments [1]. Container technique is also called operating-system-level virtualization, which allows multiple isolated user-space instances sharing the same operating system kernel and uses cgroups [2] to take control of the resources in the host. This provides functionality similar to a virtual machine (VM) but with a lighter footprint.

The Docker container is one of the mainstream containers as it solves many of the challenges of Internet services. The microservices provided by the Docker container can be executed on any supported container platform, which solves the problem of application portability. The emergence of containers and microservices based on container technology has led to a paradigm shift in the development and deployment of software applications. The granularity of application functionality becomes finer, and the scalability and resiliency become better, which brings additional challenges to the traditional monitoring solutions. It is important to understand the interdependencies between containers and to maintain dependability in large-scale container-based cloud environments. The monitoring tool should be able to provide graphical interfaces of application and resource metrics for each container, as well as an overview of the resource usages at the image level. Specifically, system monitoring services is also the foundation of many resource management solutions.

In order to enhance the stability of container-based clouds and detect any suspected abnormal events or operations, it is necessary to provide a monitoring and alarming mechanism for the container system. In this paper, we design and implement a monitoring and alarming platform - CMonitor for the container-based clouds. It is based on the some interfaces provided by the current Docker platform. We also add new functions:

  • Integrated monitoring services. CMonitor not only monitors the basic resource usages of each container but also provides hardware-level and application-level monitoring services.

  • Global topology view. CMonitor generates a global topology structure for containers by parsing network traffics among containers.

  • Intelligent alarming mechanism. CMonitor contains several anomaly detection algorithms to identify any abnormal behaviors in containers and notifies the alarm to users.

  • Rich visualization functions. CMonitor records the runtime log for both system resources and application behaviors and generates tables and figures with advanced data visualization techniques.

The rest of this paper is organized as follows. In Sect. 2, we introduce the related work. In Sect. 3, we describe the system architecture of CMonitor. In Sect. 4, we perform the performance evaluation of CMonitor. Section 5 concludes the whole paper.

2 Related Work

Currently, there are already several online monitoring tools developed for containers such as Docker stats [3], cAdvidor [4], Scout [5], etc. Docker provides built-in command monitoring for Docker hosts via the docker stats command. Administrators can query the Docker daemon and get detailed real-time information about container resource consumptions, including CPU and memory usage, disk and network I/O bandwidth, and the number of processes running. Docker stats can only monitor a single host, and there is no graphic interface to collect data from multiple hosts. cAdvisor is originally developed by Google as a monitoring tool that collects, aggregates, processes, and exports information of running containers. cAdvisor has a web interface that generates multiple charts, but can only monitor one host. cAdvisor itself is not a complete monitoring solution, but it is often used as part of other monitoring solutions. Scout provides comprehensive data collection, filtering, and monitoring capabilities. But the commercial license fee is very expensive, that the standard package price for monthly use ranges from $99 to $299.

There have been many works focusing on cloud monitoring. Fatema et al. [6] surveyed monitoring tools revealing common characteristics and distinctions for clouds. Alhamazani et al. [7] also did a survey on commercial cloud monitoring tools and discussed the major research dimensions and design issues related to the development of cloud monitoring tools. Aceto et al. [8] analyzed and discussed properties of a monitoring system for the cloud, and described both commercial and open source platforms for cloud monitoring. Shao et al. [9] proposed a runtime model for cloud monitoring (RMCM), which gives an intuitive representation of a running cloud by focusing on common monitoring concerns. Zou et al. [10] designed a trusted monitoring framework, which provides a chain of trust that excludes the untrusted privileged domain, by deploying an independent guest domain for the monitoring purpose, as well as using the trusted computing technology to ensure the integrity of the monitoring environment.

Most recently, the cloud anomaly/fault has also attracted much attention. Sharma et al. [11] proposed a fault management framework - CloudPD for clouds which leverages a canonical representation of the operating environment to quantify the impact of resource sharing; an online learning process to tackle dynamism; a correlation-based performance model for higher detection accuracy; and an integrated end-to-end feedback loop to synergize with a cloud management ecosystem. Gunawi et al. conducted a comprehensive study of bug study in six popular cloud system [12]. CoMA [13] is a container monitoring agent that oversees resource consumption of operating system level virtualization platforms, primarily targeting container-based platforms such as Docker.

While our work CMonitor supports both monitoring and alarming functions for container-based clouds. This is also one of the very early work on the monitoring system for containers.

3 CMonitor Architecture

3.1 Design Challenges

LXC (Linux Container) is a kind of kernel virtualization technology that realizes the resource virtualization at the Linux operating system level. Docker encapsulates the underlying technology of LXC, implements resource isolation through namespaces, and implements resource restrictions through cgroups. The running mechanism of applications in Docker containers is different from that in hosts. Multiple containers may run many applications, sharing resources of one or more underlying hosts.

We firstly analyze the special challenges of monitoring the container in real time. In a traditional environment, most of the servers and applications we need to monitor are relatively static. While in a container-based cloud environment, the containers keep changing and are subject to more interference. Real-time monitoring of containerized environments can be more difficult. It is not possible to accurately understand what is happening inside a container by simply running a monitoring command such as top or ps on the host.

3.2 Performance Metrics

In order to achieve detailed monitoring information of the container, we collect the low-level resource metrics from CPU utilization, memory usage, block I/O and network I/O. As shown in Table 1, the metrics include fine-grained monitoring information.

Table 1. The description of per metric

CPU utilization includes Total CPU usage, CPU usage by the user, the CPU usage by system and Throttled CPU time. Knowing the CPU usage of containers can maximize the resource utilization of hosts. Lowering the CPU time of high-loaded containers can effectively ensure other services get the necessary resources.

Container memory usage includes Total memory usage, RSS memory usage, Cache memory usage, and Swap memory usage. Understanding the current operation and work plan is important for the memory usage of each container. According to the changes of Docker memory usage, we can improve the resource utilization by dynamically adjusting the capacity of containers.

Block I/O usage includes Total I/O bytes, read/written bytes and Sync/Async bytes. Docker container images consume additional host disk space and perform corresponding file reads and writes. Persistent Docker volumes also consume host disk space. Proper use of cleanup tools is important for the continued running of the Docker containers.

Network I/O includes Rx/Tx bytes, Rx/Tx bytes, Rx/Tx errors and Rx/Tx dropped. Docker containers will share a LAN within the same host. Focusing on container failures and lost packets can detect specific network failures of the host system. Especially for containers such as load balancers, the throughput of virtual networks is a big bottleneck.

3.3 Core Modules

Container monitoring is an extremely important part of the container management platform. Monitoring not only needs to get the running status of containers in real time but also needs to obtain the dynamic changes of the resources occupied by containers. We propose an architecture for CMonitor, including agents, server and client. The agent we designed can collect various monitoring data of containers on the host machines. Note that there are many monitoring metrics and the data amount is huge, the data collection of agents must be efficient. Besides, the same host machine can run dozens or even hundreds of containers with a large amount of data collection, sorting and reporting process must have low overheads. The Agent is deployed in each host running containers, and the monitored data is transferred to the server through the HTTP protocol.

The overall architecture design of CMonitor is shown in Fig. 1. The core modules are described as follows:

Fig. 1.
figure 1

The overall architecture and core modules of CMonitor

  • Monitoring Agent Module. There are multiple nodes in a container cluster. Each node runs multiple containers. The monitoring agent of CMonitor is deployed on each physical node, which is responsible for collecting performance monitoring data such as CPU, memory, network, disk and so on. We develop the agent monitoring module based on libcontainer [14], which collects container data from the host proc [15], cgroups, and other system interfaces. The agent is encapsulated as a Docker image and exposes a port for external access to transmit monitoring data, and uses JSON technology to send the collected data to the server. The monitored data collection and reporting process does not go through the Docker daemon [16], so it does not burden daemon. The agent can set different configuration files for different host environments, and sends performance monitoring data to the server according to the predefined data transmission mechanism. This design allows the agent to customize the type of monitoring service and flexibly to adapt to the monitoring scenarios in different host environments.

  • Monitoring Server Module. The monitoring server module is responsible for collecting, processing and analyzing the monitoring data. It includes three sub-modules.

    • Processing Sub-Module. The data processing sub-module pulls data collected by the agent through a specific port. Data preprocessing preserves important container operational status information through data categorizing and aggregating, removing meaningless noises. When the number of cluster nodes increases, the monitoring data amount becomes larger. This requires a concurrent processing mechanism.

    • Topology Generation Sub-Module. From the perspective of network traffic analysis, the entire system is visually observed by generating a topology of access relationships between containers. The topology generation sub-module updates the node and edge information by requesting the Processing sub-module. We use the a new data structure to store the topological relationships between each container and the operational performance metrics of containers. The topology generation sub-module then converts the original data structure into a new JSON format and transmits it to the real-time monitoring module through the HTTP protocol.

    • Anomaly Detection Sub-Module. This module is responsible for the normal and abnormal status monitoring. The simplest method for anomaly detection is the threshold-based detection. An alert will be issued if the usage limit exceeds the threshold. The abnormal information includes abnormal application behaviors, system faults, and other external attacks. The data is preprocessed by the Processing sub-module, and the anomaly detection sub-module can also configure other machine learning detection algorithms for further anomaly detection.

  • Client Machine Module. Users can access the monitoring platform through the a Web interface. This module includes three sub-modules.

    • Real-time Monitoring Sub-Module. The real-time monitoring sub-module is responsible for further processing the data pulled from the server. Then it forwards data to the visualization sub-module and alarming sub-module. CMonitor records the runtime log for both system resources and application performance.

    • Report Visualization Sub-Module. The visualization module can collect the data sent by the real-time monitoring sub-module through a synchronous refresh. Then it visualizes various container topological relationships and operational metrics, and uses Echarts [17] to draw various types of charts.

    • Alarming Sub-Module. This module is responsible for triggering the alert and notifying users with animation and sound. If the CPU and memory usage exceed the pre-set threshold value within a certain sampling period, the overload reminder function will be triggered and the system will send a notification to the administrator.

4 CMonitor Evaluation

The primary objective of the evaluate is to assess the validity of the collected metrics by CMonitor from various Docker containers. Validity means to the monitored data should reflect the real value. Considering the numerous metrics collected in the report regarding CPU, memory, network I/O, and block I/O, part of these metrics are selected for evaluation. We evaluate the effectiveness for user CPU utilization, system CPU utilization, memory utilization, and the number of bytes read and written to disk, where user CPU utilization and system CPU utilization refer to the percentage of execution CPU used. For all evaluations, Ubuntu 18.04 LTS and the Docker platform 1.35 have been used. The physical machine runs on Intel i5-8250U processor with 1.60 GHz and 2 GB of RAM with 2400 MHz.

Fig. 2.
figure 2

Scenario 1, no workload is running in the host environment

Fig. 3.
figure 3

Scenario 2, workload on host OS

Fig. 4.
figure 4

Scenario 3, workload in container

Fig. 5.
figure 5

CPU utilization is reported at host OS level by NetData in each scenario.

To assess the accuracy of memory and CPU utilization metrics, three different scenarios were set up, with five rounds running. In scenario 1 (See Fig. 2), the Docker platform runs an empty container without any workloads, which presents the baseline of CPU and memory usage for Docker execution. Scenario 2 (See Fig. 3) is similar to scenario 1, except for the fact that a workload generator is executed in the host OS. Stress-ng [18] has been used to generate load on both CPU and memory. The CPU worker executes the maximum prime number addition to generate the load. In order to stress the memory, five workers are started, and the size of each worker is set to 400 MB. The layout of scenario 3 (See Fig. 4) is exactly the same as that of scenario 2, the only difference is that the Stress-ng process is implemented on the Docker platform through containerization.

Table 2. CPU utilization is reported by NetData in each scenario for Fig. 5. Total CPU is the aggregation of user CPU and system CPU. SD means standard deviation.

4.1 CPU Workload Test on Single Container

We obtain the host level data from each scenario by using NetData [19]. Figure 5 shows the user CPU, system CPU, and total CPU utilization reported by NetData for each scenario. To determine that the measurements of the metrics collected by CMonitor are valid, we compare the data collected by NetData with collected by CMonitor in three scenarios, as shown in Figs. 5 and 6. The total CPU in scenario 1 (See Table 2) aggregated to the total CPU of the container reported by CMonitor (See Table 3) should be similar to the total CPU reported in scenario 3 (See Table 2). As shown in Table 2, the aggregation of two values (30.52% and 69.26%) results in a total CPU utilization rate of 99.78 %. In scenario 3 (See Table 2), the total CPU utilization of the whole host is 99.92% and the standard deviation is 14.89, which verifies whether the values of CPU metrics collected by CMonitor are valid. In Fig. 5 and Table 2, It can be determined that there is a 0.68% difference between host and container running the same Stress-ng workload.

Fig. 6.
figure 6

Container CPU utilization reported by CMonitor

Table 3. CMonitor reports CPU utilization in each scenario for Fig. 6. There are no workload running in scenarios 1.
Fig. 7.
figure 7

Each container runs the same workload, generating the same CPU workload across all 10 containers, and CMonitor reports the total CPU utilization of the 10 containers and the value of the Docker platform. NetData reports the value for the host system.

Table 4. Total CPU utilization and standard deviation for Fig. 7.

From Fig. 5 and Table 2, There is a slight difference in running the same Stress-ng workload on the local host and container. This difference may be due to errors in the data collected. Moreover, it can be seen from the test results of scenario 3 that running the same Stress-ng workload process and containerization can improve the CPU utilization of the system. The result is opposite when it comes to user CPU utilization.

4.2 CPU Workload Test on Multiple Containers

The above scenario proves that the CPU utilization of single container collected by CMonitor is effective. However, we still need to evaluate whether CMonitor can correctly report CPU utilization metrics for multiple containers. In the first evaluation, the same 10 containers with the same initialization configuration are used to generate exactly the same Stress-ng workload. According to the default process scheduling principle in Linux hosts, the expected result is that each container will obtain the same CPU utilization on average. As it can be observed from Fig. 7 and Table 4, each container accounts for approximately 9% of the CPU. The total CPU utilization of each container is 89.85%, approximately matching the total CPU utilization of 89.9% reported by CMonitor for Docker platform.

Fig. 8.
figure 8

The total CPU utilization of two containers running different Stress-ng workload configurations. CMonitor reports metrics for both containers and Docker platforms, respectively. NetData reports the values for the Host OS.

Table 5. Total CPU utilization and standard deviation for Fig. 8

This result also shows that CMointor can independently collect CPU utilization of each container under the Docker platform. We also conduct the second evaluation to further determine the correctness of the collection metrics by running different Stress-ng workloads with the same initialized containers. As shown in Fig. 8 and Table 5, metrics collected by CMonitor show that one container uses 24.56% of the CPU and the other 62.62%. The total use of 87.18% for both containers is similar to the 87.35% CPU usage for the Docker platform.

Fig. 9.
figure 9

Memory usage of the host OS and container in each scenario. Container memory usage is reported by CMonitor and host OS memory usage is reported by NetData.

Table 6. Memory usage and standard deviation for Fig. 9

4.3 Memory Workload Test on Single Container

We analyze memory usage and collect the memory usage in three different scenarios. The values collected on memory usage for each scenario are shown in Fig. 9 and Table 6, which shows that the reported memory values fluctuate more than the CPU. The host memory usage in scenario 1 (417.26 MB) aggregating the container memory usage reported in CMonitor in scenario 3 (1456.28 MB) should be similar to the host memory in scenario 3 (1755.39 MB). There are approximately 118 MB differences in this example, which may be due to memory management in Linux system according to different usage scenarios.

4.4 Memory Workload Test on Multiple Containers

Testing memory load usage is the same as testing CPU usage for multiple containers and we also use CMonitor to evaluate multiple concurrently running container metrics. As can be seen from Fig. 10 and Table 7, the memory usage of each container is close to 150 MB, and the total memory usage of all 10 containers is 1524.25 MB. The memory usage of the host system is 1821.32 MB. The difference between the two values is 297.07 MB, which represents the memory of some processes in the operating system other than containers. In the second evaluation, we set two containers to run different workload. As shown in Fig. 11 and Table 8, CMonitor collected two containers using 431.35 MB and 1148.41 MB of memory, respectively, and 1764.85 MB of host memory. The difference between these two metrics is the size of memory occupied by non-container processes in the host.

4.5 Validity of Block I/O Test

To evaluate whether the disk I/O measurements collected by Monitor are reliable, the container runs the Sysbench [20] file workload to test disk reads and writes. For this purpose, the tests are written from three containers running simultaneously to disk. Sysbench configures different load workers and file sizes for each container. The first container runs two file load workers, and each worker writes 16 files with a total size of 300 MB to the host disk shared folder in a random write mode. The second container initiates a workload that the two workers have to write to the disk with a file size of 500 MB. The third container starts a worker to write 1000 MB to the disk folder. For each container, the number of bytes written by CMonitor is exactly correct. The first container takes around 90 s to write 600 MB to disk. The second and third containers take around 115 s to write 1000 MB to disk and the time each container writes files to disk is related to the number of worker loads in the container.

Fig. 10.
figure 10

Memory usage of 10 containers running the same process. CMonitor collects metrics for containers and NetData reports host values.

Table 7. Memory usage and standard deviations for Fig. 10
Fig. 11.
figure 11

Memory usage of two containers, where each container runs a different process. CMonitor collects metrics for containers and NetData reports host values.

Table 8. Memory usage and standard deviations for Fig. 11

5 Conclusion

As an emerging platform for building deployment applications, Docker has been recognized by the industry. More and more applications are starting to use Docker as the underlying resource abstraction platform. When users deploy their applications in the Docker platform, they want the container to be in good running condition on the Docker platform. This research is based on the basic functions of the current Docker platform, and designed a container monitoring application for distributed system. The contribution of this research is to propose a Docker container monitoring and alarming platform - CMonitor, which can help the operation and maintenance managers to control the security and performance of the container. In the experiment, we evaluate the CMonitor’s effectiveness in collecting metrics including CPU utilization, memory usage and Block I/O for containers in different scenarios. From the evaluation, we know that CMonitor can accurately collect the running information of containers and has certain robustness.