Fine-grained preemption analysis for latency investigation across virtual machines
- First Online:
- Cite this article as:
- Gebai, M., Giraldeau, F. & Dagenais, M.R. J Cloud Comp (2014) 3: 23. doi:10.1186/s13677-014-0023-3
- 2.9k Downloads
This paper studies the preemption between programs running in different virtual machines on the same computer. One of the current monitoring methods consist of updating the average steal time through collaboration with the hypervisor. However, the average is insufficient to diagnose abnormal latencies in time-sensitive applications. Moreover, the added latency is not directly visible from the virtual machine point of view. The main challenge is to recover the cause of preemption of a task running in a virtual machine, whether it is a task on the host computer or in another virtual machine.
We propose a new method to study thread preemption crossing virtual machines boundaries using kernel tracing. The host computer and each monitored virtual machine are traced simultaneously. We developed an efficient and portable trace synchronization method, which is required to account for time offset and drift that occur within each virtual machine. We then devised an algorithm to recover the root cause of preemption between threads at every level. The algorithm successfully detected interactions between multiple competing threads in distinct virtual machines on a multi-core machine.
KeywordsVirtual machine Tracing KVM LTTng Performance CPU
Cloud environments present advantages of increased flexibility and reduced maintenance cost through resource sharing and server consolidation . However, virtual machines (VMs, or guests) on the same host computer may compete for shared resources, introducing undesirable latency. Previous study found that jitter impacts response time of programs on popular commercial cloud environment . In cloud environments, virtual machines have the illusion of absolute and exclusive control over the physical resources. However, the host’s resources are more often than not overcommitted, whereas they appear to guest operating systems as being more available than they actually are . As a result, virtual machines on the same host computer may interfere with each other without their knowledge, inducing invisible yet real latency.
The diagnosis is more complex when the guest is isolated from its external environment and an additional virtualization layer is introduced. It is therefore necessary to have powerful and efficient tools to diagnose the root cause of unexpected delays at low granularity when they occur in a virtualized environment. To our knowledge, no such tool was available.
This study focuses on processor multiplexing across virtual machines. In particular, we are interested in automatically identifying the root cause of task preemption crossing virtual machines boundaries. The approach we propose is based on kernel tracing, which is an effective and efficient way to investigate latency problems . The method we propose consists of aggregating kernel traces recorded simultaneously on the host and each virtual machine. However, more often than not, timekeeping is a task left to each of the operating systems. In such cases, timestamps from different traces are not issued using the same clock reference. As a result, trace merging without an appropriate synchronization method to account for clock differences would produce incoherent results.
The challenge is to consider the system as a whole, while preserving virtual machine isolation. Flexibility and portability constraints are also important for practical considerations. The approach should be independent from the underlying architecture and the operating system to account for portability, whereas flexibility requires independence from the hypervisor and the tracer.
Three main contributions are presented in this paper. First, we propose an approach for trace synchronization. At the trace merging step, we propose an algorithm that modifies timestamps of the guests’ traces to bring them back to the same timespan as the host. Secondly, we implemented an analysis program that transforms aggregated kernel traces to a graphical view that shows the states of the virtual machines and their respective virtual CPUs (vCPUs) through time while taking into consideration virtualization and its impact. Thirdly, we implemented an additional analysis program that presents the interactions of threads across different systems. Such an analysis can be performed by recovering the execution flow centered around a particular thread.
The rest of this paper is structured as follows: Section ‘Related work’ goes through different approaches currently used for virtual machine monitoring. Section ‘Problem statement and definitions’ introduces the required concepts in virtualization and tracing, and states the problem addressed by this paper. Section ‘Trace synchronization’ explains our approach for trace synchronization at the aggregation step. Each of sections ‘Multi-level trace analysis’ and ‘Execution flow recovery’ introduces an analysis module and its inner working. Section ‘Use cases’ shows some representative use cases and their analysis results. Section ‘Flexibility and portability’ reiterates over flexibility and portability. Section ‘Conclusion’ concludes.
On Linux kernels supporting paravirtualization, top top reports a metric specific to virtual machines, named steal time. This metric shows the percentage of time for which a vCPU of the VM is preempted on the host. While this information can give a general idea or a hint of overcommitment of the CPU, it does not report the actual impact on the running threads nor the source cause of preemption. Additionally, this approach is specific for Linux paravirtualized systems and thus limits portability. Moreover, top top adds significant overhead as it gathers information by reading entries in the proc proc pseudo-filesystem, and offline analysis or replay of the execution flow are not possible.
Perf has been extended to support profiling and tracing specifically for KVM. Using its “kvmkvm” subcommand, one can use Perf to get runtime statistics and metrics about each virtual machine. Common metrics include the number of traps caught by the hypervisor, their cause and the time to process each of them. The information reported by Perf also includes CPU time for the guest kernel, host kernel and the hypervisor, which are good indicators about the overhead introduced by virtualization. Perf also reports information about the Performance Monitoring Unit (PMU), which is a set of counters that keep track of particular events such as cache misses, TLB misses, CPU cycles, etc. However, these performance counters aren’t available for virtual machines. In , an approach for PMU virtualization is proposed, which are then used to monitor the runtime behavior of virtual machines in more detail. In  and , the authors also use Perf for virtual machine profiling and resource utilization. Such methods may also require exporting the symbol table of the guest kernel to the host to resolve. While it is possible to detect performance degradation due to resource sharing among virtual machines, the analysis doesn’t cover detailed fine-grained information about the root causes of preemption. However, the interactions between the virtual machines through the usage of shared resources are essential to understand performance degradations and easily pinpoint their cause in order to remedy them. Finally, the approach using perf kvm perf kvm is dependent on both the operating system and the hypervisor, which doesn’t meet the portability requirement.
Shao et al. use an approach based on tracing within Xen to generate useful metrics for virtual machines . Based on scheduling events, latency due to virtual CPU preemption can be easily calculated. Other metrics of interest are also presented such as the wake-to-schedule time. However, these metrics are mostly useful for analyzing Xen’s scheduler itself. Such an analysis would be less relevant in the case of KVM (or some other hypervisors) as it is a an “extension” to the Linux kernel via loadable kernel modules and thus uses its scheduler. Moreover, the impact of the applications running inside the virtual machines on the system as a whole can not be retrieved from Xen traces. Differently put, perturbations caused by userspace applications across different virtual machines cannot be analyzed or quantified using solely Xen traces.
As for trace synchronization, previous studies , have used the TSC (TimeStamp Counter) as a common time reference to approach timekeeping and clock drift issues among VMs. The TSC is a CPU register on x86 architectures which counts CPU cycles since the boot of the system (uptime). When read from a virtualized system, the TSC is usually automatically offset in hardware to reflect the uptime of the guest operating system. The value of the offset is specified by the TSC_OFFSET TSC_OFFSET field in the Virtual Machine Control Structure (VMCS). Each VM has its own TSC_OFFSET TSC_OFFSET value, and reading the TSC from different systems always returns a coherent value with respect to their respective uptime. Once traces are recorded on different systems, converting guest TSC values to host TSC values comes down to subtracting the value of TSC_OFFSET TSC_OFFSET from each timestamp. However, the TSC offset may have to be adjusted during the execution of the VM upon certain events, such as virtual machine migration. As a result, TSC_OFFSET TSC_OFFSET adjustments have to be tracked down by the tracer at runtime. If tracing is not enabled before the creation of the virtual machine, the initial value of the TSC offset cannot be obtained, unless explicitly requested by the tracer. Additionally, this approach does not allow for the possibility of lost events since a TSC adjustment event could be lost. In any manner, even if the TSC isn’t virtualized and is unique across all systems, synchronization using the TSC does not meet our requirement of portability, as it is an x86-specific register. Moreover, TSC offsetting is specific to hardware-assisted virtualization, thus it cannot be used with other virtualization methods, which does not meet our flexibility requirement. Finally, the TSC register only counts CPU cycles since boot time, which is not as meaningful as an absolute wall clock time, especially on computers with a non-constant TSC where the conversion from TSC to real time would be an additional challenge (CPU flag constant_tsc constant_tsc can be queried to verify this property).
Problem statement and definitions
Addressed problem and motivation
We noticed that one of the main limitations of current approaches for virtual machine monitoring is the lack of a general approach, which takes into account in-depth analysis of all the involved systems. Most of the monitoring tools are designed to be centered either around the hypervisor or the guest OS. In the former case, only an analysis from the host point of view, abstracted by the virtualization layer, is possible. In the latter case, the analysis is too restricted inside the guest OS and doesn’t consider the outside environment for detecting the root causes of performance degradations.
As mentioned in section ‘Introduction’, investigating latency problems in virtualized systems is a non-trivial task. The isolation of virtual machines from their environments imposes limits on the scope of traditional analysis tools. Moreover, the virtualization layer itself adds overhead due to the involvement of the hypervisor for privileged operations . Furthermore, the assumption of exclusive access to the hardware layer by each virtual machine inevitably induces hidden latency due to the overcommitment of resources, particularly the CPU. As a result, the CPU becomes a scarce resource, which has to be shared among running VMs. As we presented in the previous section, there is no obvious way for a VM to detect runtime perturbation caused by the “outside world”. While a guest OS may perceive one of its processes taking full use of the CPU for a certain amount of time, this might not be effectively the case on the actual hardware. Indeed, when a process is allocated a limited amount of vCPU time in a guest OS, it might get deprived of this resource by the host’s scheduler which might elect a different VM for execution at any moment. Analyzing preemption across virtual machines boundaries (inter-VM) allows the user to detect such perturbations and take actions to remedy them.
In this paper, we explain how we used kernel traces recorded in each VM and on the host simultaneously to investigate such problems. As we present in section ‘Use cases’, the tools resulting from our study help the users to easily find the latency cause due to CPU sharing among virtual machines, as well as the actual threads that affect the completion time of a certain workload. However, merging distributed traces is a problem in itself as each operating system is solely responsible for its own timekeeping. The next sections present prerquisites in order to understand all of the parts used in our final solution.
CPU vendors introduced extensions at the hardware level which allow for efficient architecture virtualization and overcome design issues, as presented in  for x86. On Intel hardware, this CPU extension is called VMX (Virtual Mahine eXtension), while AMD-V is its counterpart from AMD. On hardware-assisted virtualization, the CPU transits between non-root and root modes. On Intel CPUs, these modes are respectively called VMX non-root and VMX-root. The former is entered using the vmentry vmentry instruction by the hypervisor, giving control to the VM’s native code. The later is reached when the VM executes an instruction that triggers a trap, called vmexitvmexit. A trap is usually a sensitive instruction such as writing to a privileged register, and allows the hypervisor to take control of the execution and react to the trapped instruction, usually through emulation. Vmexit Vmexit can be thought of as a reaction, as opposed to vmentry vmentry which is an actual instruction. Moreover, a data structure called VMCS (Virtual Machine Control Structure)  contains runtime information about a virtual machine. This data structure is used as an interaction mechanism between the VM and the hypervisor  (i.e. between non-root and root modes), as well as a way to define behavioral elements, such as enabling or disabling hardware TSC offsetting.
The software that interacts with these hardware extensions is called a hypervisor. KVM  is an example of such software and is included in Linux as a loadable kernel module. Its role is to exploit and manage the virtualization capabilities of the hardware, and provide easy access to these capabilities to any userspace component via the ioctl ioctl interface. As a result, many userspace emulators can be built atop KVM without reimplementing hardware-specific functionalities. We use QEMU as this userspace component that interacts with KVM to take advantage of hardware assistance. Moreover, as KVM is an extension to the Linux kernel, it can take advantage of its basic functionalities, such as the scheduling, NUMA node management, and even its tracing infrastructure. Thus, KVM is instrumented with tracepoints which can be traced using any kernel tracer. In QEMU/KVM, each virtual machine is a QEMU process, and each of its virtual CPUs (vCPUs) is emulated by a separate thread that belongs to that process.
In this article, the terms hypervisor and VMM (Virtual Machine Monitor) will be used interchangeably. The same applies for the terms VM, guest system and virtualized system.
We implemented our trace analysis algorithm using the Trace Compass trace viewer (previously TMF - Tracing and Monitoring Framework) . Trace Compass is a Free and Open tool for viewing traces in different graphical views. Views are usually designed for specific kind of analyses. The most common views in Trace Compass are the Control Flow view and the Resource view. The former shows the states of all threads on a system throughout the tracing session (Running, Idle, Preempted, Blocked), whereas the latter shows the states of different resources such as the CPU and IRQ lines. This project resulted in two additional views integrated to Trace Compass, which can be used for Virtual Machine runtime analysis of inter-VM preemption.
Trace Compass indexes the trace using a State History Tree (SHT) . The SHT represents the state of the whole system, and is updated at each event to define time intervals . This index allows efficient stabbing queries, returning the complete state of the system at a given time. A node of the tree is a key-value pair, where the key is a path component, and the value is an attribute associated with a duration, that gets updated as the trace is being processed. The rules, by which attributes are updated, are established by our algorithm presented in section ‘Multi-level trace analysis’.
In this study, we use LTTng as a kernel tracer. LTTng was designed for high throughput tracing while reducing as much as possible its impact on the traced system . We now introduce the key tracepoints for our analysis. We present their significance, as well as the content of their respective payload. This section is complementary to section ‘Multi-level trace analysis’ which explains how these tracepoints are used to update the SHT.
The sched_switch sched_switch tracepoint indicates a context switch on the CPU which recorded the event. Useful payload fields are the names and the TIDs (Thread Identifiers) of the former and new threads involved in the context switch. Since all events are timestamped using the system’s time at the nanosecond scale, the amount of time spent on each CPU by a specific thread is easily computed by subtracting the timestamps of the sched_switch sched_switch events involving a particular thread.
Tracepoint sched_migrate_task sched_migrate_task indicates the migration of a thread from one CPU to another. Its payload holds the TID of the migrated task, as well as the origin and the destination CPU identifiers. Tracepoint sched_process_ sched_process_ fork fork indicates the creation of a new process, and exposes the names, PIDs (Process Identifiers) and TIDs of the newly created process as well as its parent’s. Its complementary event, sched_process_exitsched_process_exit, records the end of life of a thread. The payload contains the name and TID of the process.
VMX mode transitions by KVM can be tracked by enabling the kvm_entry kvm_entry and kvm_exit kvm_exit events. Tracepoint kvm_entry kvm_entry indicates a transition from root to non-root modes, and thus the beginning of the execution of the VM’s native code. On the other hand, tracepoint kvm_exit kvm_exit indicates the opposite transition, which interrupts the execution of the VM and gives control to KVM. Elapsed time between consecutive kvm_exit kvm_exit and kvm_entry kvm_entry events represents overhead introduced by the hypervisor.
where T is a coarse-grained value updated on system timer interrupts. For a finer timekeeping, T needs to be adjusted using the TSC to account for the elapsed time since the last update (last timer interrupt). This is done using function f(), which translates the TSC to an actual time value that can be used for fine-grained timekeeping.
In addition, the monotonic clock guarantees total ordering, even in the case of modification of the system’s wall clock time while tracing, and therefore is an ideal source for event timestamps.
Although the TSC is paced at the same rate across the different virtual systems, the offset values T of each system are not, and thus are subject to drifting apart as time goes by. In fact, modern tickless operating systems disable timer interrupts on idle processors to reduce energy consumption. As a result, the update period of T is variable, which may contribute to increase the time difference between systems. Furthermore, virtual machines may be set to different timezones, introducing even more incoherent timestamping when traces are merged together, which would make them appear as being recorded at different moments. As a result, high precision timestamping and clock drifting do not allow for simple clock offsetting to ensure coherency between traces, and require a specific synchronization method. The next section presents our approach to ensure coherent trace merging.
Causality: a must (quickly) trigger b;
Bijection: a and b must share a common and unique key k in their payloads;
Every event b must be matched to at most one event a (one-to-one). Unmatched events b are ignored.
Implementation in virtualized systems
Previous section ‘Event matching’ explained the theory of the fully incremental convex hull algorithm for trace synchronization. However, the requirements for this algorithm are not directly met in the case of virtual machines. Originally, the algorithm was built based on TCP packet exchange events, where send send and receive receive events are respectively a and b: send send triggers receive receive and the TCP packet number is the key k. In Cloud environments, virtual machines do not necessarily exchange TCP packets with each other or with the host, as VMs are usually provided to different clients. As a result, we need to customize the setup of the virtual machines to generate events both in the VMs and on the host that would respect the requirements established earlier. This section introduces our approach to obtain events that can be used to achieve trace synchronization.
We added tracepoints to the kernel through a loadable module for flexibility, so no modification to kernel code would be required to perform trace synchronization. Upon loading, this module registers a probe to the system timer’s interrupt. In other words, every time the system timer issues an interrupt to the CPU, our synchronization routine will be invoked. The synchronization routine can be summed up as follows:
Guest: Trigger hypercall (event a)
Host: Acknowledge hypercall (event b)
Host: Give control back to the VM (event c)
Guest: Acknowledge control (event d)
The first pair of events (a,b) can be simulated by issuing a hypercall. When executing the vmcall vmcall instruction from the guest OS (event a), a trap is generated by the CPU and control is given to the hypervisor, which in turn acknowledges the trap (event b). A counter X is passed to the host OS as a parameter to the hypercall. This parameter will serve as the shared key required by the synchronization algorithm. As a result, events a and b are both recorded in a short period of time on the guest and host OS respectively, both holding the same value X as their payload.
Simulating the pair of events (c,d) is not as trivial since different constraints are imposed on the host-to-guest communication, as no mechanism of parameter transmission is easily accessible. Implementing shared memory between the guest and host is too intrusive as it would add too much complexity to both systems, and would probably require modification to both kernels. However, the trap generated by the hypercall is virtually invisible to the guest OS, which continues execution “normally” after involvement from the hypervisor. We can take advantage of this property to simulate a parameter transmission when the hypercall handling returns. Event c is recorded on the host right before it finishes the synchronization routine and gives control back to the VM. Event d is recorded on the guest right after the hypercall, which effectively is as soon as the guest OS resumes execution. This model simulates property (1) as c indicates that the host is giving control to the VM and d represents its acknowledgement. Both these events hold X+1 in their payloads to respect the one-to-one relationship.
Overhead induced by the hypercall
One synchronization tracepoint
Once traces are generated on both systems, the fully incremental convex hull algorithm is applied, which derives a synchronization function applied on all of the guest’s timestamps. This approach is resistant to clock drifts as the convex hull algorithm considers this issue and compensates for it in the generated formula. Additionally, it does not require TSC_OFFSET TSC_OFFSET tracking or any other architecture-specific configuration.
To show the results of our trace synchronization algorithm, we traced simultaneously a running virtual machine and its host. We then merged the traces recorded from both systems and used Trace Compass to view the result.
Virtual Machine Analysis Color legend
Grayed out green
Multi-level trace analysis
For clarity, we introduce the term pCPU which designates a physical CPU, as opposing to a vCPU which is in reality a QEMU thread emulating the CPU of a VM.
Virtual CPU states
State VMM VMM represents the state when a QEMU thread is running hypervisor code instead of virtual machine code. In other words, it represents participation or involvement from the VMM, as to provide emulation, inject an interrupt into the guest’s OS, or any other instruction requiring the external help of KVM. As explained in section ‘Hypervisor’, CPU transitions between non-root and root are instrumented with tracepoints kvm_entry kvm_entry and kvm_exit kvm_exit respectively. When a kvm_exit kvm_exit event is reached, the vCPU’s state is set to VMM VMM (transition 2). On the other hand, it leaves this state on a kvm_entry kvm_entry event, returning to the state it was in prior to the ’s involvement (transition 1). This state serves as an intermediate between any two states, as hypervisor cooperation is required for QEMU threads scheduling.
We also noticed that this state is reached everytime a QEMU thread is involved in a context switch, i.e., when a vCPU is scheduled out of a pCPU. Interestingly, when a QEMU thread is selected by the scheduler to run again, it first executes in the VMM VMM state before explicitly invoking the vmentry vmentry instruction to give control to the guest’s OS. This procedure is required because KVM needs to execute specific operations related to the Virtual Machine Control Structure of the VM. KVM uses Linux’s notifier chains to “register” on context switches involving a vCPU. When a vCPU reaches this state, its current thread’s status is set to PROCESS_VIRT_PREEMPTEDPROCESS_VIRT_PREEMPTED, which designates wasted time due to the virtualization layer (we see it as preemption to execute hypervisor code, the thread is marked as “virtually preempted”).
RUNNING RUNNING shows execution of the VM’s code. When in this state, a virtual CPU is considered as running without any involvement from the hypervisor, and instructions dedicated to a specific vCPU are running directly on one of the host’s pCPUs. For this state to be reached, two conditions must be satisfied. First, the QEMU thread emulating a vCPU must be in a running state on the host operating system. Secondly, in the guest operating system, the CPU associated with the specified QEMU thread must be in the running state as well, meaning that any process other than the idle task (swapperswapper) is executing on the CPU.
IDLE IDLE represents a state when a vCPU is not executing any code, and thus voluntarily yields the physical CPU. This state is reached when the QEMU thread emulating a vCPU is scheduled out of a pCPU, and if no thread other than the idle idle task is scheduled to run on this vCPU in the guest OS (transition 4). On Linux, the purpose of swapper swapper (the idle idle task) is to invoke the scheduler to choose potential threads ready for execution, or to halt the CPU in case no thread is ready to run. The vCPU goes out of this state as soon as the thread emulating it gets scheduled back on the host (transition 3).
PREEMPTED PREEMPTED is the state that indicates direct latency to the execution of a virtual machine. This state is reached when a vCPU is scheduled out of the pCPU by the host’s scheduler (transition 5), while the vCPU was effectively serving a thread. Note that the running process on the vCPU stays in the PROCESS_VIRT_PREEMPTED PROCESS_VIRT_PREEMPTED state, which indicates that the vCPU on which the thread is running was preempted on the host operating system. Usually, this kind of information is not visible to a virtual machine, though it directly impacts the completion time of a task by introducing delays throughout the execution. As a result, a task may seem to complete in much longer than the effective time during which it was running. When scheduled back in (transition 6), the vCPU passes by the VMM VMM state again to finally reach the RUNNING RUNNING state and resume VM code execution.
The reason why top top reported a 0% vCPU preemption (steal time), before starting the Fibonacci task, is because the vCPU was mostly idle. As a result, when it asks for CPU time, its request is immediately answered by the host’s scheduler as it has the “highest priority” due to its idle nature. We can see that using such a tool to measure resource availability can actually be misleading. The only way to detect vCPU preemption using top top would be to actively monitor the steal time while running the Fibonacci task.
Execution flow recovery
We now reach the second part of the analysis, which is to reconstruct the execution flow for a specific task of one of the virtual machines. The execution flow with regard to a certain task A A is defined as the ordered set of execution intervals of all the tasks affecting the completion time of AA. The purpose of the execution flow is to show detailed information about the execution of a certain thread as well as its interactions with other threads.
For a task executing inside a virtual machine, the computation of the execution flow should be adjusted to take into consideration interactions between different operating systems through the usage of shared resources. The objective of such an analysis is to provide detailed information not only about the execution of a certain task, but also about its interactions with other threads, whether they belong to the same VM, the host, or even a different VM. With such information, major causes of overhead can be easily tracked down by the host’s administrator, and adjustments can be made to resolve the issues. Recovering the execution flow comes down to tracking all preemption events involving AA. Causes of preemption can be within the same operating system and thus easier to investigate, or from a different system making them almost completely hidden. In this section, we show that the execution flow recovery can be computed simply by querying the state system for key attributes modifications, without having to read the trace again.
In this section, we call A the thread around which we want to recover the execution flow. The first step of the algorithm is to find all the entries involved in the execution flow according to task A. In Figure 8, each of tasks A, B and C represents an entry. First, all the threads of all the systems are inserted as entries in the execution flow. This list of all the threads across systems can be recovered by parsing through attributes “/Virtual Machines/*/Threads/*/Virtual Machines/*/Threads/*” and “/Host/Threads/*/Host/Threads/*” in the SHT. The second step of the algorithm is to compute the execution intervals of each entry with regard to task AA. As a final step, we remove the entries that have minimal or no impact on the analyzed thread according to a minimal impact threshold. The selection of the threshold doesn’t affect the computing time of the algorihtm, as it is performed only after the whole algorithm has executed and all the durations have been computed. The impact of each thread can be measured using Equation 3, as explained in section ‘Investigation of execution anomalies’.
To respect the relationship of affiliation between a thread and its system (host or VM), entries are stored in a tree-like structure with a depth of 2, where each node on the first level represents a system, and its children on the second level represent its threads.
As mentioned earlier, the execution flow can be represented with an ordered list of intervals, where each interval contains a start time, end time, a state, and the TID of the thread executing for the said interval. Algorithms 1 and 2 explain how this list can be built, recovering the execution flow with regard to task A.
Algorithm 1 is used to insert all intervals of A holding the “RUNNINGRUNNING” state. As a first step, we query the SHT to retrieve all modifications to the “StatusStatus” attribute of thread A. The SHT returns a list of intervals for different values of this attribute. Algorithm 1 parses this list and each interval holding the “RUNNINGRUNNING” state is directly inserted in the result result list. However, for each interval holding the “PREEMPTEDPREEMPTED” value, a separate function is invoked to find which thread is preempting A. This function is shown in Algorithm 2.
The resolve resolve function requires an interval as well as a pCPU Id as input values. The work done by this routine is to find which threads are running on the pCPU for the duration of the interval. In the case where the running thread is the vCPU of another VM, this function will then query the state system to get the running thread inside this VM. Once the running thread preempting A is deduced, it is returned to Algorithm 1 which will insert it in the result result list.
Where prevInterval prevInterval and nextInterval nextInterval are respectively the previous and next intervals of interval interval in the ordered list result result from Algorithm 1.
This section shows how our work can be used in real-life cases to investigate latency in virtual machines. We start with a follow-up on the Fibonacci example introduced earlier (section ‘Follow-up on the Fibonacci Case’). We then present different use cases that show either how to investigate a known issue (section ‘Investigation of execution anomalies’), or a general analysis to verify normal execution of the system as a whole (Investigating a residual timer).
Follow-up on the Fibonacci Case
Investigation of execution anomalies
We can see that the critical task ran for 274 ms (as expected, however it was over a longer period), the Ubuntu virtual machine ran for 270 ms and the host ran for 260 ms. These numbers indicate approximately a 33% usage of the CPU for each system, which indicates that the pCPU is strongly shared amongst them. Moreover, we see that process irq/46-iwlwifi irq/46-iwlwifi executes for 296 us, indicating heavy network usage and packet processing.
Investigating a residual timer
Flexibility and portability
Throughout this project, we set different constraints to ensure for a portable and flexible solution to our initial problem. First, we used the State History Tree as an abstraction for the traces. The SHT not only delivers performance enhancement for event querying in the trace , but allows to dissociate the analysis step from the trace itself. In other words, multiple trace parsers can be used to handle the kernel traces, regardless of the operating system on which they were recorded, or the tracer used, as long as the trace format is open. As long as the backend used for trace representation is the SHT, and given that the required events are reported in the trace, our proposed algorithms will produce the expected results, which accounts for both portability (independant from the OS) and flexibility (independant from the tracers and trace formats). It is worth mentioning that although we used LTTng as a kernel tracer for this project, any other kernel tracer could have been potentially used as long as a trace parser is available.
Moreover, as we explained in previous sections, although a TSC-based approach for trace synchronisation is potentially simpler to use under certain conditions, the TSC is an x86-specific CPU register. Using a higher-level algorithm such as the fully incremental convex hull algorithm provides portability to the synchronisation solution. And, although using hypercalls as a communication mechanism between the host and its guests is specific to hardware virtualization, any pair of events across systems with a causality relation can be a potential replacement, such as network packets exchange.
Finally, it is worth mentioning that the only KVM-specific tracepoints are kvm_entry kvm_entry and kvm_exitkvm_exit, which represent VMX mode transitions. Since these transitions are common to all hypervisors supporting hardware-assisted virtualization, our approach is therefore not specific to KVM. And, although these tracepoints are already included in Linux’s source tree, they can be added in any hypervisor by simply instrumenting all calls to vmentry vmentry and vmexit vmexit instructions which requires very little effort, thus allowing this model to be used with any hypervisor. Moreover, if the administrator chooses not to instrument these transitions, little information would be lost, as the only state lost in Figure 5 would be VMMVMM. Preemption and execution recovery would still be possible with little analysis precision lost (hypervisor involvement would account as effective CPU time instead of overhead due to virtualization). Furthermore, kernel traces generated from other operating systems can be used as well with minimal effort. As long as the events required to cover the FSM presented in section ‘Multi-level trace analysis’ are available, the model can be ported by simply specifying the names of these events. Moreover, in the case of microcomputers without hardware virtualization, the synchronization approach could potentially be extended to any other type of communication between the guest and the host, such as a TCP packet exchange. The rest of the analysis is based on the state system built, and thus does not depend on the details of the underlying traces.
Cloud computing and virtualization are evolving at a rapid pace. These emerging technologies created a need for analysis tools that can live up to the technological advance. In this paper, we showed that kernel tracing can be used to analyze the execution of virtual machines under such conditions. We first proposed an approach to resolve the problem of clock drift and offset between operating systems. We then showed how the merged traces can be processed to rebuild the state of the virtual machines, as well as their vCPUs, throughout the trace. Finally, we explained how the execution flow with regard to a certain thread can be rebuilt for an in-depth analysis of its execution and interactions with other systems. All the solutions proposed in this paper were designed with requirements of portability and flexibility in mind. As a result, all the approaches explained are portable across operating systems, computer architectures, and complementary software (tracer and hypervisor).
The authors would like to thank Ericsson for their input to this study as well as for funding this research project. We also thank Geneviève Bastien for her help in the Open Source project Trace Compass and Naser Ezzati Jivan for reviewing this paper.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.