A multilayered architecture is often the chosen strategy regarding the development of a software architecture. Each layer is dedicated to a specific role, independently of other layers, and is hosted by a tier, or a physical layout, that can contain multiple layers at once.
In this paper, we focus on a tier, or physical machine, hosting multiple layers of virtualized systems (VS) also called virtualized execution environment. A virtualized system will be considered as a virtual machine or a container. Figure 1 shows how the different layers can be organized in practical cases. Without using multilayers of virtual environments, the system is reduced to a single layer which is the host, also called the physical machine. This layer will be called L
0. Virtual machines adding a layer above the host will be labeled as L
1 VMs, and recursively, any VM above a L
n
VM will be a L
n+1 VM. Containers will not be labeled but will be associated to the machine directly hosting them. Containers can be running directly in L
0 but, for security reasons [12], they are most often used within virtual machines.
The idea we introduce here is to erase the bounds between L
0, its VMs in L
1 and L
2 and every container, to simplify the analysis and the understanding of complex multilayer architectures. Some methods for detecting performance degradations already exist for single-layer architectures. To reuse some of these techniques on multilayer architectures, one might remodel such systems as if all the activity was involving only one layer.
Architecture
The architecture of this work is described as follows: first we need to trace the host and the virtual machines, then because of clock drift [13] we have to synchronize those traces. After this phase, a data analyzer fuses all the data available from the different traces to put them in a data model. Finally, we need to provide an efficient tool to visualize the model that will allow the user to distinguish easily the different layers and their interactions. Those steps are summarized in Fig. 2.
A trace consists of a chronologically ordered list of events characterized by a name, a time stamp and a payload. The name is used to identify the type of the event, the payload provides information relative to the event and the time stamp will specify the time when the event occurred.
In this study, we use the Linux Trace Toolkit Next Generation (LTTng) [14] to trace the machine kernels. This low impact tracing framework suits our needs, although other tracing methods can also be adopted. By tracing the kernel, there is no requirement to instrument applications. Therefore, even a program using proprietary code can be analyzed by tracing the kernel. However, some events from the hypervisors managing the VMs are needed for the efficiency of the fused analysis. The analysis needs to know when the hypervisor is letting a VM run its own code or when it is stopped. Since, in our study, we are using KVM [15], merged in the Linux kernel since version 2.6.20 [16], and because the required trace points already exist, there is no need for us to add further instrumentation to the hypervisor. In our case, with KVM using Intel x86 virtualization extensions, VMX [17], the event indicating a return to a VM running mode will always be recorded on L
0 and will be generically called a VMEntry. The opposite event will be called a VMExit.
Synchronization is an essential part of the analysis. Since traces are generated on multiple machines by different instances of tracers, we have no guaranty that a time stamp for an event in a first trace will have any sense in the context of a second trace. Each machine may have its own timing sources, from the software interrupt timer to the cycle counter. When tracing the operating system kernel, each system instance (i.e., host, VM, container, etc.) uses its own internal clock to specify the events time stamps. But, in order to have a common sense of all systems behaviors, which are recorded as trace events separately in each system, it is essential to properly measure the differences and drifts between these machines.
Figure 3 shows that without synchronization two traces recorded at the same time may seem to be created at two different times. The right scheduling of events, even coming from different traces, is crucial because, when fusing the traces of a VM with its host, the events of the VM will have to be handled exactly between the VMEntry and the VMExit of L
0, relative to this specific VM. An imperfect synchronization can be the vector of incoherent observations that would impede the fused analysis. Figure 4 shows the difference between an analysis done on two pairs of traces with respectively an accurate and inaccurate synchronizations. The inexact synchronization can lead to false conclusions. In this case, a process from the VM seems to continue using the processor while in reality the VM has been preempted by the host.
There are different possible solutions to synchronize the trace events between host kernel and VMs. One way is using TSC (Time Stamp Counter) that is built in the processors as a register. TSC is a 64-bit register which counts CPU cycles since the boot time of the system, and can be read by single assembly instruction (rdtcs) and therefore could be considered as a time reference, anywhere in the system (i.e., both kernel, hypervisor, and application). However, using TSC for timekeeping in a virtual machine has several drawbacks. The TSC_OFFSET field for VM can be changed especially during VM migration which forces tracer to keep track of this field in VMCS. If this event is lost, or the tracer is not started at that time, the calculated time will not be true anymore. Furthermore, some processors stop the TSC in their lower-power halt states which causes time shifting in VM. Also, timekeeping for full virtualization is not possible since TSC_OFFSET is part of Intel and AMD virtualization extensions.
Because VMs can be seen as nodes spread through a network, a trace synchronization method for distributed systems [18] can be adapted. As [7] we use hypercalls from the VMs to generate events on the host that will be related to the event recorded on the VM before triggering the hypercall. With a set of matching events, it is possible to use the fully incremental convex hull synchronization algorithm [19] to achieve trace synchronization. Because of clocks drift, a simple offset applied on the time stamps of a trace’s events is not enough to synchronize the traces. To solve this issue, the fully incremental convex hull algorithm will generate two coefficients, a and b, for each VM trace while the host’s trace is taken as time reference. Each event e
i
will have its time stamp \(t_{e_{i}}\) transformed to \(t^{\prime }_{e_{i}}\) with the formula:
$$t'_{e_{i}} = a t_{e_{i}} + b $$
Gebai et al. [7] used the hypercall only between L
0 and L
1. However, the method also applies between L
n
and L
n+1, since an hypercall generated in L
n+1 will necessarily be handled by L
n
. In our case, synchronization events will be generated between L
0 and all its machines in L
1, and between machines of L
1 and their hosted machines. Consequently, a machine in L
2 will be synchronized with its host that will have previously been synchronized with L
0.
The purpose of the data analyzer is to extract from the synchronized traces all relevant data and to add them in a data model. Besides analyzing events specific to VMs and containers, our data analyzer should handle events generally related to the kernel activity. For this reason, the fused analysis is based on a preexisting kernel analysis used in Trace Compass [20], a trace analyzer and visualizer framework. Therefore, the fused analysis will by default handle events from the scheduler, the creation, destruction and waking up of processes, the modification of a thread’s priority, and even the beginning and the end of system calls.
Unlike in a basic kernel analysis, the fused analysis will not consider each trace independently but as a whole. Consequently, the core of our analysis is to recreate the full hierarchy of containers and VMs, and to consider events coming from VMs as if they were directly happening in L
0. As shown in Fig. 5, for the simple case of SLVMs, the main objective is to construct one execution flow by fusing those occurring in L
0 and its VMs. The result is a unique structure encompassing all the execution layers at the same time, replacing what was seen as the hypervisor’s execution, from the point of view of L
0, by what was really happening inside L
1 and L
2.
KVM works in a way such that each vCPU of a VM is represented by a single thread on its host. Therefore, to complete the fused analysis, we need to map every VM’s vCPU with its respective thread. This mapping is achieved by using the payloads of both synchronization and VMEntry events. On the one hand, a synchronization event recorded on the host contains the identification number of the VM, so we can match the thread generating the event with the machine. On the other hand, a VMEntry gives the ID of the vCPU going to run. This second information allows the association of the host thread with its corresponding vCPU.
Data model
The data analysis needs an adapted structure as data model. This structure needs to satisfy multiple criteria. A fast access to data is preferred to provide a more pleasant visualizer, so it should be efficiently accessible by a view to dynamically display information to users. The structure will also need to provide a way to store and organize the state of the whole system, while keeping information relative to the different layers. For this reason, we need a design that can store information about diverse aspects of the system.
As seen in Fig. 6, the structure contains information relating to the state of the different threads but also of the numerous CPUs, VMs and containers. Each CPU of L
0 will contain information concerning the layer that is currently using it, like the name of the VM running and with which thread and which virtual CPU. The Machine node will contain basic information about VMs and L
0, like the list of physical CPUs they have been using, their number of vCPUs or their list of containers. This node is fundamental since it is used to recreate the full hierarchy of the traced systems, in addition to the hierarchy of all the containers inside each machine.
Finally, the data model provides a time dimension aspect, since the state of each object attribute in the structure is relevant for a time interval. Those intervals introduce the need for a scalable model, able to record information valid from a few nanoseconds to the full trace duration.
In this study, we chose to work with a State History Tree (SHT) [21]. A SHT is a disk-based data structure designed to manage large streaming interval data. Furthermore, it provides an efficient way to retrieve, in logarithmic access time, intervals stored within this tree organization [22].
Algorithm 1 constructs the SHT by parsing the events in the traces. If the event was generated by the host, then the CPU that created the event is directly used to handle the event. However, if the event was generated by a virtual machine, we need to recursively find the CPU of the machine’s parent harboring the virtual CPU that created the event, until the parent is L
0. Only then, the right pCPU is recovered and we can handle the event. This process is presented in Algorithm 2.
The fundamental aspect of the construction of the SHT is the detection of the frontiers between the execution of the different machines and the containers. This detection is achieved by handling specific events and the application of multiple strategies.
Single layered VMs detection
In the case of SLVMs, the strategy is straightforward. The mapping is direct between the vCPUs of a VM in L
1 and its threads in L
0, a VM will be running its vCPU immediately after the recording of a VMEntry on its corresponding thread. Conversely, L
0 stops a vCPU immediately before the recording of a VMExit.
Algorithm 3 describes the handling of a VMEntry event for the construction of the SHT. In this case, we query the virtual CPU that is going to run on the physical CPU. Then, we restore the state of the virtual CPU in the SHT, while we save the state of the physical CPU. The exact opposite treatment is done for handling a VMExit event.
Nested VMs detection
For VMs in L
2, the previous strategy needs to be extended. Being a single-level virtualization architecture [23], the Intel x86 architecture has only a single hypervisor mode. Consequently, any VMEntry or VMExit happening at any layer higher or equal than L
1, is trapped to L
0. Figure 7 shows an example of the sequence of events and the hypervisors executions occurring on a pCPU when a VM in L
1 wants to let its guest execute its own code, and when L
2 is stopped by L
1. The dotted line represents the different hypervisors executing while the plain line shows when L
2 uses the physical CPU.
This architecture supersedes the previous strategy used for SLVMs. A VMEntry recorded in L
1 does not imply that a vCPU of a VM in L
2 is going to run immediately after. Likewise, L
2 does not yield a pCPU shortly before an occurrence of a VMExit in L
1, but when the hypervisor in L
0 is running, preceded by its own VMExit.
The challenge we overcome here is to distinguish which VMEntries in L
0 are meant for a VM in L
1 or L
2. Knowing that a VM of L
2 is stopped is straightforward, if the previous distinction is done. If a thread of L
0 resumes a vCPU of L
1 or L
2 with a VMEntry, then a VMExit from this same thread means that the vCPU was stopped.
We created two lists of threads in L
0. The waiting list and the ready list. If a thread is in the ready list, it means that the next VMEntry generated by this thread is meant to run a vCPU of a VM in L
2. The second part of Algorithm 4 shows that we retrieve the vCPU of L
2 going to run by querying it from the vCPU of L
1 associated to the thread. The pairing between the vCPUs of L
1 and L
2 is done in the first part of the algorithm, during the previous VMEntry recorded on L
1. It is also at this moment that the thread of L
0 is put in the waiting list.
Algorithm 5 shows that the same principle is used for handling a VMExit in L
0. If the thread was ready, then we need again to query the vCPU of L
2 before modifying the SHT.
When a thread of L
0 is put in the waiting list, it means that a vCPU of L
2 is going to be resumed. However, at this point, we don’t know for sure which VMEntry will resume the vCPU. The kvm_mmu_get_page event solves this uncertainty by indicating that the next VMEntry of a waiting thread will be for L
2. Algorithm 6 shows the handling of this event and the shifting of the thread from the waiting list to the ready list.
As seen in Fig. 7, it is possible to have multiple entries and exits between L
0 and L
2 without going back to L
1. This means that a VMExit recorded on L
0 does not necessarily implies that the thread stopped being ready. In fact, the thread stops being ready when L
1 needs to handle the VMExit. To do so, L
0 must inject the VMExit into L
1 and this action is recorded by the kvm_nested_vmexit_inject event. Algorithm 7 shows that the handling of this event consists in removing the thread from the ready list.
The process will repeat itself with the next occurrence of a VMEntry in L
1.
Containers detection
The main difference between a container and a VM is that the container shares it’s kernel with its host while a VM has its own. As a consequence, there is no need to trace a container since the kernel trace of the host will suffice. Furthermore, all the processes in containers are also processes of the host. Knowing if a container is currently running comes down to whether the current running process is from the said container or not.
The strategy we propose here is to handle specific events from the kernel traces to detect all the PID namespaces inside a machine. Then, we find out the virtual IDs of each thread (vTID) contained in a PID namespace.
A kernel trace generated with LTTng contains at least one state dump for the processes. A lttng_statedump_process_state event is created for each thread and any of its instances in PID namespaces. Furthermore, as seen in Fig. 8, the payload of the event contains the vTID and the namespace ID (NSID) of the namespace containing the thread.
Figure 9 shows how this information is added to the SHT. The full hierarchy of NSIDs and vTIDs is stored inside the thread’s node to be retrieved later for the view. Moreover, each NSID and their contained threads are stored under it’s host node. This allows to quickly know in which namespaces a thread is contained and, reciprocally, to known which threads belong to a namespace.
The analysis also needs to handle the process fork events to detect the creation of a new namespace or a new thread inside a namespace. In LTTng, the payload of this event provides the list of vTIDs of the new thread, besides of the NSID of the namespace containing it. Because the new thread’s parent process was already handled by a previous process fork or a state dump, the payload combined with the SHT contains enough information to identify all the name spaces and vTIDs of a new thread.
Visualization
After the fused analysis phase, we obtain a structure containing state information about threads, physical CPUs, virtual CPUs, VMs and containers through the traces duration. Our intention at this step is to create a view made especially for kernel analysis and able to manipulate all the information about the multiple layers contained inside our SHT. The objective is also to allow the user to see the complete hierarchy of virtualized systems. This view is called the Fused Virtualized Systems (FVS) view.
This view shows at first a machine’s entry representing L
0. Each machine’s entry of the FVS view can have at most three nodes. A PCPUs node, displaying the physical CPUs used by the machine, a Virtual Machine node, containing an entry for each of the machine’s VM, and a Containers node, displaying one entry for each container. Because VMs are considered as machines, their nodes can contain the three previously mentioned nodes. However, a container will at most contain the PCPUs and Containers nodes. Even if it is possible to launch a VM from a container, we decided to regroup the VMs only under their host’s node.
Figure 10 is a high level representation of a multilayered virtualized system. When traced and visualized in the FVS view, the hierarchy can directly be observed, as seen in Fig. 11.
The PCPUs entries will display the state of each physical CPU during a tracing session. This state can either be idle, running in user space, or running in kernel space. Those states are respectively represented in gray, green and blue. However, there is technically no restriction on the number of CPU states, if an extension of the view is needed.
The Resources view is a time graph view in Trace Compass that is also used to analyze a kernel trace. It normally manages different traces separately and doesn’t take into account the multiple layers of virtual execution. Figure 12 shows the difference between the FVS view and the Resources view displaying respectively a fused analysis and a kernel analysis coming from the same set of traces.
In this set, servers 1, 2 and 3 are VMs running on the host. All VMs are trying to take some CPU resources. As should be, the FVS view shows all the traces as a whole, instead of creating separate displays as seen in the Resources view. The first advantage of this configuration is that we only need to display the physical CPUs rows instead of one row for each CPU, physical or virtual. With this structure, we gain in visibility. The information from multiple layers is condensed within the rows of the physical CPUs.
To display information about virtual CPUs, VMs and containers, the FVS view asks the data analyzer to extract some information from the SHT. Consequently, for a given time stamp, it is possible to know which process was running on a physical CPU, and on which virtual CPU and VM or container it was running, if the process was not directly executed on the host. Figure 13 shows the displayed tooltip when the cursor is placed on a PCPU entry. These are part of the information used to populate the entry.
We noticed that, in the Resources view, the information is often too condensed. For instance, if several processes are using the CPUs, it can become tedious to distinguish them. Therefore, this situation is worse in the FVS view, because more layers come into play. For this reason, we developed a new filter system in Trace Compass that allows developers of time graph views to highlight any part of their view, depending on information contained in their data model.
Using this filter, it is possible to highlight one or more physical or virtual machines, containers, some physical or virtual CPUs, and some specifically selected processes. In particular, this filter will display what the user doesn’t want to see, as if it was covered with a semi opaque white band. Selected areas will appear highlighted by comparison. Consequently, it is possible to see the execution of a specific machine, container, CPU or process directly in that view.
Figure 14 shows the real execution location of a virtual machine on its host. With this filter, we can distinctively see when the CPU was used by another machine, instead of the highlighted one.
In the FVS view, the states in the PCPUs entries of a virtualized system are a subset of the states visible in the PCPUs entries of the VS’s parent. Only the physical host PCPUs display the full state history. The other entries can be considered as permanent filters dedicated to display only a VS and its virtualized subsystems. Figure 15 shows a magnified part of Fig. 11 with all PCPUs nodes expanded. We can see that their sum equals the physical PCPUs entries.