Optimizing pre-copy live virtual machine migration in cloud computing using machine learning-based prediction model

Haris, Raseena M.; Barhamgi, Mahmoud; Nhlabatsi, Armstrong; Khan, Khaled M.

doi:10.1007/s00607-024-01318-6

Optimizing pre-copy live virtual machine migration in cloud computing using machine learning-based prediction model

Regular Paper
Open access
Published: 08 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Computing Aims and scope Submit manuscript

Optimizing pre-copy live virtual machine migration in cloud computing using machine learning-based prediction model

Download PDF

Raseena M. Haris¹,
Mahmoud Barhamgi¹,
Armstrong Nhlabatsi¹ &
…
Khaled M. Khan¹

309 Accesses
Explore all metrics

Abstract

One of the preconditions for efficient cloud computing services is the continuous availability of services to clients. However, there are various reasons for temporary service unavailability due to routine maintenance, load balancing, cyber-attacks, power management, fault tolerance, emergency incident response, and resource usage. Live Virtual Machine Migration (LVM) is an option to address service unavailability by moving virtual machines between hosts without disrupting running services. Pre-copy memory migration is a common LVM approach used in cloud systems, but it faces challenges due to the high rate of frequently updated memory pages known as dirty pages. Transferring these dirty pages during pre-copy migration prolongs the overall migration time. If there are large numbers of remaining memory pages after a predefined iteration of page transfer, the stop-and-copy phase is initiated, which significantly increases downtime and negatively impacts service availability. To mitigate this issue, we introduce a prediction-based approach that optimizes the migration process by dynamically halting the iteration phase when the predicted downtime falls below a predefined threshold. Our proposed machine learning method was rigorously evaluated through experiments conducted on a dedicated testbed using KVM/QEMU technology, involving different VM sizes and memory-intensive workloads. A comparative analysis against proposed pre-copy methods and default migration approach reveals a remarkable improvement, with an average 64.91% reduction in downtime for different RAM configurations in high-write-intensive workloads, along with an average reduction in total migration time of approximately 85.81%. These findings underscore the practical advantages of our method in reducing service disruptions during live virtual machine migration in cloud systems.

A machine learning-based optimization approach for pre-copy live virtual machine migration

Article Open access 09 May 2023

Prediction-Based Optimization of Live Virtual Machine Migration

Machine Learning Based Live VM Migration for Efficient Cloud Data Center

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cloud computing has revolutionized the way services are delivered and managed by enabling the dynamic allocation of computing resources. Live migration [1, 2] has emerged as a vital mechanism in cloud computing, allowing the dynamic relocation of virtual machines between physical hosts. It plays a pivotal role in resource management, load balancing, and system maintenance. The traditional method of live migration, known as pre-copy migration [2], has been widely adopted due to its effectiveness in preserving the integrity of the VM’s state during transfer. However, during periods of high workload, pre-copy migration faces significant challenges. The increased rate of dirty pages [3], caused by frequent memory modifications, necessitates additional iterations to ensure a complete and accurate transfer. As a consequence, downtime and total migration time are prolonged, adversely affecting the services operating on the cloud platform.

Recognizing the negative impact of extended downtime and service delays, numerous research efforts have focused on reducing the drawbacks associated with pre-copy migration. A variety of techniques have been proposed to improve the efficiency of live migration, but challenges persist. Traditional approaches, such as those implemented in the KVM hypervisor [4, 5] employ static stopping conditions for the pre-copy phase. Unfortunately, these static conditions often result in prolonged total migration times when VMs are running memory-intensive workloads.

To address this challenge, we previously proposed a comprehensive three-stage approach that integrates feature selection, ML model generation, and the application of the ML model to pre-copy migration [6]. This methodology leverages the power of ML algorithms to analyze various system metrics and make informed predictions about the optimal time to initiate the migration process. Our earlier work [6] successfully demonstrated the efficacy of our approach through simulation-based experiments using CloudSim [7], a widely used cloud simulation framework.

The necessity of our proposed approach lies in the evolving nature of cloud workloads and the critical need to minimize downtime and migration time for seamless service delivery [8,9,10]. Recent research challenges in live VM migration [11, 12] highlight the complexities faced in achieving continuous services. By integrating ML into the live migration process, we aim to offer a more adaptive solution to the dynamic demands of modern cloud computing. These considerations emphasize the importance of minimizing downtime and migration time, aligning with our system’s goal of providing a seamless and adaptive solution to the evolving demands of cloud computing.

Building upon our previous work [6], we implemented and validated our proposed approach in real-world scenarios. By transitioning from simulation to actual hardware setups, we assessed the practical feasibility and performance of our approach. Furthermore, we conducted an in-depth comparative analysis, benchmarking our methodology against existing techniques to demonstrate its efficiency and effectiveness. This paper provides a detailed evaluation study of our approach in real life settings, demonstrating its efficacy and performance compared to existing works.

The main contributions of this paper are as follows:

KNN-based model for optimal migration time prediction: We introduced a machine learning model based on the K-nearest neighbors (KNN) algorithm, where we leveraged identified features to predict optimal migration times for pre-copy migration. Our model achieved high accuracy and adaptability, representing a novel contribution that enhanced migration time prediction.
Deployment of ML model in pre-copy migration: We integrated our algorithm into traditional migration methods, resulting in minimized service delays, downtime, and total migration time. This deployment showcased the practical effectiveness of our machine learning model, representing a valuable enhancement to existing migration techniques.
Validation and performance evaluation: Based on a real-life KVM testbed and data, we carried out thorough experiments and tests to validate our solution. We demonstrated the value-added and significance of our feature selection algorithm and thoroughly evaluated our pre-copy migration approach, demonstrating its superiority over the traditional pre-copy method in reducing downtime and migration time across various RAM configurations, particularly in high-write-intensive scenarios. Additionally, we assessed scalability across different RAM sizes (1GB, 2GB, and 4GB) and compared our results with the traditional pre-copy migration algorithm, showcasing the practicality and adaptability of our approach. The obtained results show that our method consistently outperformed traditional KVM pre-copy migration, offering notable improvements in migration efficiency and flexibility to different workloads and virtual machine configurations.

The remainder of this paper is organized as follows: Sect. 2 provides a brief overview of pre-copy migration in KVM and related work in the field of pre-copy migration. Section 3 outlines our three-stage approach proposed in our previous work and elaborates on our novel implementation of applying the ML model in a real hardware setup, discussing the identified challenges, experimental design, and measurement metrics. Section 4.2 presents the results and analysis, showcasing the performance of our approach compared to the traditional pre-copy method. Section 5 discusses the observations made during the experiment. Finally, Sect. 6 concludes the paper, summarizing the key findings and outlining future directions for research in pre-copy migration optimization.

2 Background and related works

2.1 Virtualization technologies

Various virtualization technologies, each distinguished by unique features and advantages, play an important role in cloud computing environments. This section explains KVM, QEMU, Xen, and VMware, shedding light on their characteristics and contributions to virtualization.

KVM or Kernel-based Virtual Machine [13], stands out as a Linux kernel module that transforms the host OS into a Type 1 hypervisor, supporting the concurrent execution of multiple VMs. A Type 1 hypervisor operates directly on the hardware, providing optimal performance and resource utilization for virtual machines. Leveraging hardware virtualization extensions like Intel VT or AMD-V, KVM enables efficient and secure virtualization. Noteworthy features include live migration, facilitated by tools like virt-manager and virsh, making it suitable for dynamic environments.

QEMU, the Quick Emulator [14], is an open-source emulator and virtualizer frequently paired with KVM on Linux systems. Capable of emulating diverse hardware components, QEMU complements KVM’s capabilities and enhances virtualization performance. The combination of KVM and QEMU ensures low overhead and high performance, making it popular in data centers and cloud environments.

In addition to KVM and QEMU, the virtualization landscape includes Xen and VMware. Xen [13, 15], operating as a lightweight Type 1 hypervisor, is recognized for its paravirtualization approach. Xen employs a distinctive paravirtualization approach, modifying the guest OS to replace privileged instructions with direct system calls into the hypervisor. The Xen Hypervisor comprises two components: the core hypervisor, which handles CPU scheduling, memory, and power management, and Domain0 (dom0), a privileged virtual machine with direct hardware access. Unprivileged virtual machines (domU) run modified Linux kernels, communicating with the Xen hypervisor as the hardware interface. CPU and memory access are managed directly by the hypervisor, while I/O is handled by dom0.

VMware [16], a leading virtualization software, offers a range of products such as workstations, players, servers, and virtual desktop infrastructure. VMware’s Type 1 hypervisor can run directly on hardware, as exemplified by the VMware workstation, which is compatible with ×64 systems. This platform supports multiple virtual machines, each running its own operating system, and facilitates efficient testing in diverse environments. Notably, VMware provides vMotion, allowing seamless migration of virtual machines across physical systems without downtime. This service enhances disaster recovery and enables load balancing, optimizing virtual machine performance and resource utilization. vMotion supports live migration of virtual machines over ESXi hosts, ensuring zero downtime and uninterrupted access during the migration process. VMware’s vMotion stands out for its reliability, minimal downtime during live migrations, versatility across hardware generations, and user-friendly management through migration wizards. These features collectively contribute to VMware’s reputation as a secure, efficient, and widely adopted virtualization platform.

Table 1 summarizes the virtualization technologies.

Table 1 Summary of virtualization technologies

Full size table

In this research, we mainly focus on the pre-copy live migration in KVM. The following subsection will provide more details about pre-copy migration in KVM/QEMU.

2.2 Pre-copy migration in KVM/QEMU

Pre-copy migration [10, 17,18,19,20] is an important feature of KVM/QEMU virtualization [4, 21, 22], allowing seamless transfer of virtual machines between physical hosts without service disruption. Pre-copy migration encompasses several essential stages that contribute to a well-executed and efficient migration process. The primary stages of pre-copy migration can be categorized as follows: pre-migration stage, reservation stage, memory copy phase, iterative phase, stop and copy phase, and activation stage. These stages are shown in Fig. 1.

Pre-migration stage: The migration process starts at this stage. To start the migration process, communication is established between the source host (the original physical host) and the destination host (the target physical host). In order to prepare for migration, the source host begins collecting information on the virtual machine’s memory and CPU usage.
Reservation stage: During this stage, the destination host reserves the resources required for the migrated virtual machine. It ensures that the target host has sufficient space to accommodate the memory, CPU, and other requirements of the virtual machine.
Memory copy phase: The initial memory state of the VM is transmitted from the source host to the destination host during the memory copy phase. This process involves copying the VM’s memory pages over the network or through shared storage. The source host keeps track of the modified memory pages during the copy phase, which are referred to as dirty pages. The memory copy phase is an important step in pre-copy migration as it sets the initial state of the VM on the destination host. By copying the memory pages, the destination host can start executing the VM and minimize the downtime experienced by the VM during migration.
Iterative phase: The iterative phase is the core of the pre-copy migration process. It involves iterative cycles of memory copying and transferring the dirty pages from the source host to the destination host. During each iteration, the source host identifies the dirty pages that have changed since the last iteration and transfers them to the destination host. This process continues until the number of dirty pages converges to a minimal value or falls below a predefined threshold.
Stop and copy phase: When the number of dirty pages reaches the specified threshold or the time limit for memory copying is exceeded, the source host pauses the virtual machine, allowing the final dirty memory pages to be copied to the destination host. This phase ensures that all remaining dirty pages are transferred to the target host before the virtual machine is fully migrated.
Activation stage: In this final stage, the destination host activates the migrated virtual machine, taking over its execution from the source host. The virtual machine resumes its operation on the destination host using the copied memory pages. The virtual machine’s networking and I/O connections are reestablished and fully functional on the target host.

These stages collectively ensure a smooth and efficient migration of a running virtual machine from one physical host to another using the pre-copy migration technique.

2.3 Related work

Live virtual machine migration is crucial in virtualized environments, enabling seamless relocation of virtual machines (VMs) between physical hosts without interrupting execution. In the context of Kernel-based Virtual Machines (KVM), pre-copy live migration has gained significant attention due to its potential to minimize downtime and total migration time. The main challenge of transferring large volumes of data during live migration is generating highly dirty memory pages, which can significantly prolong migration time and downtime. If the rate of dirty page generation exceeds the transferring bandwidth of pre-copy migration, it may even lead to migration failure.

To address these issues, Jin et al. [23] proposed an optimized scheme for live migration by controlling the CPU scheduler of the VM monitor to limit the speed of memory changes. This approach is particularly applicable when there is a high memory writing speed or the pre-copy speed is slow. They proved experimentally that the optimized scheme could reduce application downtime significantly compared to the original live migration, with acceptable overhead. However, its effectiveness may vary depending on workload characteristics and network conditions, and integrating additional parameters could introduce complexity and scalability challenges.

Similarly, Sharma et al. [24] proposed a three-phase optimization (TPO) method for pre-copy migration to address the issue of highly dirty memory page generation. The method operates in three sequential phases: initially reducing the transfer of memory pages, followed by minimizing the transfer of duplicate pages by classifying frequently and non-frequently updated pages, and finally, reducing the data sent in the last iteration of migration through the application of the simple RLE compression technique. As a result, each phase significantly reduces the total pages transferred, migration time, and downtime, respectively. Experimentally, it was shown that the method significantly reduced total pages transferred, total migration time, and downtime for higher workloads, and it does not impose significant overhead compared to the traditional pre-copy method. However, the main limitations of this approach are the static number of iterations, difficulty in managing concurrent migrations of multiple virtual machines with performance preservation and minimal overhead, and limited suitability for wide area network (WAN) environments.

Machine learning techniques have emerged as a valuable tool for optimizing migration processes in data center environments. Administrators can efficiently manage total migration time and minimize downtime by accurately predicting live migration costs, thereby enhancing operational efficiency and reliability. Elsaid et al. [25] introduced a machine-learning approach in the VMware environment, enabling the prediction of live migration cost, network overhead, and power consumption based on the VM’s active memory size. Experimentally, they demonstrated the feasibility and accuracy of the model in predicting migration costs. However, the main limitation is the scalability issues when this application deals with large-scale VM environments, and the suggested approach is focused solely on VMware setups, making it unclear how well the concept would apply to other systems.

Amani et al. [26] introduced the Scheduler Credit algorithm, optimizing live migration scheduling and achieving substantial reductions in migration time and downtime. Experimentally, they showed improved efficiency and effectiveness in live migration. However, the proposed study failed to address the fundamental challenges associated with dynamic workloads and lacked further discussion on essential aspects such as potential network overhead, migration speed, and efficiency. Nonetheless, scalability details were not thoroughly addressed, posing challenges in larger systems.

Addressing the dirty memory challenges, Yong et al. [27] proposed a new algorithm called Context-Based Prediction (CBP) to forecast future dirty pages based on historical statistics of the dirty page bitmap. Compared to KVM’s default approach, experimental findings show that CBP significantly decreases total migration time, downtime, and the number of pages transmitted. However, the main drawback of the approach is the dependence of the prediction accuracy on the gathered context parameters and the used prediction models. Also, the method’s reliance on stable conditions may limit its effectiveness in constantly changing cloud environments.

Patel et al. [28] proposed a time series-based prediction technique to predict dirty pages using historical analysis of past data. They developed two regression-based models for time series prediction. The first model uses a statistical probability-based regression approach based on the ARIMA (autoregressive integrated moving average) model. The second model is developed using a statistical learning-based regression model based on the SVR (support vector regression) model. Evaluating these models on a real Xen dataset, they computed downtime, the total number of pages transferred, and migration time. Results indicate that the ARIMA model accurately predicts dirty pages, while the SVR model outperforms, demonstrating its superiority over ARIMA. While their SVR model exhibited high accuracy, challenges such as potential overfitting or underfitting need consideration.

Instead of predicting only dirty memory page rate, Motaki et al. [29] proposed a prediction-based model for managing the live migration of virtual machines (VMs), aiming to optimize performance metrics and reduce SLA violations. The model employs an ensemble learning strategy, incorporating linear and non-parametric regression methods to predict key metrics for various live migration algorithms. Experimental results demonstrate significant improvements in SLA violations and decreased CPU time for prediction. However, the generalizability of their approach to complex data center environments remains limited.

Hummaida et al. [30] also proposed reinforcement learning (RL) management policy seeks to optimize performance metrics, particularly in terms of Service Level Agreement (SLA) violations, by dynamically allocating cloud resources in response to changing workloads. Unlike traditional heuristic approaches, the RL policy operates decentralized, facilitating fast convergence towards efficient resource allocation strategies. The RL policy enhances decision-making efficiency through parallel learning and state/action space reduction, leading to lower SLA violations and reduced downtime. Additionally, by integrating multi-level RL agent cooperation, the policy improves performance metrics by effectively coordinating resource allocation between network controllers and local nodes. Therefore, while the primary focus is on minimizing SLA violations, the improved resource allocation facilitated by the RL policy can also reduce downtime and total migration time. However, accurately modeling dynamic cloud environments poses significant challenges.

Along with optimizing VM migration, Talwani et al. [31] focused on reducing energy consumption. They proposed an energy-efficient VM allocation and migration approach using an ABC algorithm. Through simulation analysis, the proposed Energy Aware-ABC algorithm demonstrates superior performance compared to existing methods, effectively reducing total energy consumption and minimizing the number of migrations. However, constraints related to resource availability in cloud data centers must be considered for practical implementation.

Mangalampalli et al. [32] introduced WBATimeNet, a deep learning network to predict which Virtual Machines (VMs) should undergo Live Migration (LM) in Azure cloud environments to optimize resource utilization and maintain VM availability. WBATimeNet utilizes Multivariate Time Series data of Memory, CPU, and Disk, employing a stacked CNN and LSTM architecture enhanced by White-box Adversarial Training to handle high variability and uncertainty in time series data. Experimental results demonstrate WBATimeNet’s superior performance over baseline models, ensuring minimal impact on running workloads during LM. The limitation of WBATimeNet lies in its reliance on Multivariate Time Series data, which may struggle with highly dynamic or unpredictable workload patterns.

A summary of existing work on live VM migration is shown in Table 2.

Table 2 Summary of related works

Full size table

The literature review highlights the growing importance of machine learning-based VM live migration modeling in improving the efficiency of pre-copy migration techniques in cloud computing. Existing models vary in their objectives, migration algorithms, and relevant parameters. In response to the identified challenges, we proposed a novel methodology that distinguishes itself by incorporating a machine learning model to predict downtime during each iterative migration phase. By dynamically estimating downtime based on critical features such as VM size, page dirty rate, bandwidth, and workload, our approach enables more accurate scheduling of the stop and copy phase, thereby minimizing downtime and total migration time. This enhances client performance and contributes to higher customer satisfaction with cloud services. By reducing service delays and enhancing efficiency, users can experience optimized performance without disruptions, benefiting cloud service providers and end-users.

3 Proposed live virtual machine migration using machine learning model

We propose a three-stage approach [6] for determining the optimal time for a pre-copy migration, which is a technique used in virtual machine (VM) migration to minimize downtime or service delay during the migration process. This approach is depicted in Fig. 2.

The first stage of our approach is feature selection. This stage is crucial as it involves selecting a set of relevant and important features from the available feature sets. The selection of input features requires domain knowledge and expertise to identify the factors that impact the performance of the migration process. By simulating a pre-copy migration and observing the impact of each feature on the output metrics, we can determine which features are most influential. The feature selection stage outputs a set of selected features that are used in the subsequent stages.

The second stage involves generating machine learning (ML) models. Using the selected features from the previous stage, we generate multiple ML models and evaluate their prediction accuracy with performance metrics such as downtime and total migration time. Our evaluation demonstrates that the proposed model outperforms others, achieving an error rate of less than 5%. This iterative process allows us to refine the models and identify the most accurate ML model with the minimum number of features. It is important to note that the details of the feature selection and ML model generation stages are described in our prior work. However, this paper primarily focuses on the deployment of the proposed ML model in a real hardware environment.

In the final stage, we apply the generated ML model in the pre-copy migration process to determine the optimal time for migrating the VM from the source to the destination with minimal impact on downtime or service delay. By leveraging the insights provided by the ML model, we can make informed decisions about the timing of the migration to ensure the smoothest transition possible. It is worth mentioning that while our prior work applies the ML model using simulations. This paper specifically focuses on deploying the model in a real operational environment with different workload scenarios.

In our previous work, we introduced a comprehensive feature selection methodology designed specifically for pre-copy migration in virtualized environments. The primary objective of this methodology is to pinpoint key features that exert a significant influence on crucial performance parameters, such as total migration time and downtime. To achieve this, the feature selection process comprises a series of simulations aimed at understanding the impact of different input features on performance metrics. The algorithm systematically iterates through each feature, simulating pre-copy migration scenarios and recording performance metrics for subsequent analysis.

After the simulation experiments conducted using the CloudSim simulator, the selected features undergo validation through well-established techniques like ANOVA and Chi-square tests. This validation process ensures that the identified features are not only relevant but also influential in predicting pre-copy migration outcomes. The validated features then serve as the foundation for developing various machine learning models, including linear regression, support vector regression (SVR), SVR with bootstrap aggregation, artificial neural networks (ANN), and k-nearest Neighbors (KNN). The performance of these models is rigorously assessed in terms of accuracy for predicting both total migration time and downtime.

Remarkably, our models consistently outperform those utilizing a larger set of features (fourteen and twenty), underscoring the efficacy of our systematic feature selection approach. The chosen features undergo further validation through the coefficient of determination ($\hbox {R}^2$), indicating their suitability for ensuring accurate predictions. In summary, our feature selection methodology, succinctly summarized here, provides a robust foundation for developing precise machine learning models tailored specifically for pre-copy migration scenarios in virtualized environments.

For a more in-depth understanding of the feature selection process, the relationships between different features, and the intricacies of ML model generation, we recommend referring to our prior work for comprehensive details. This current paper builds upon these established foundations and accentuates the practical deployment of the proposed ML model in pre-copy migration scenarios, utilizing real hardware setups.

3.1 Algorithm to set up proposed machine learning model in QEMU–KVM VM migration environment

Algorithm 1 outlines a proposed machine learning-based approach for pre-copy migration in a KVM (Kernel-based Virtual Machine) environment. The algorithm aims to migrate a virtual machine (VM) from a source host (SH) to a destination host (DH) while minimizing downtime. To assess the performance of the VMs, cloud workload benchmarks are executed. A threshold value for downtime is set, typically at 0.05 ms. The threshold downtime of 0.05 ms is a commonly used value in the research community for pre-copy migration, as it provides a good balance between migration time and downtime. The selection of a suitable threshold value for the pre-copy migration technique is an important consideration to ensure the successful completion of the migration process. In general, a lower threshold value means that more frequent memory updates are sent from the source host to the destination host, resulting in a longer total migration time but reduced downtime. Conversely, a higher threshold value means that fewer memory updates are sent, resulting in a shorter total migration time but increased downtime. Many research studies [33,34,35,36,37] have used this threshold value to evaluate the performance of different pre-copy migration techniques.

The algorithm implementation of 1 begins by setting up the live migration environment, ensuring compatibility between hardware, shared storage, network connectivity, hypervisor, and operating system. A secure SSH link is established between the source and target servers. The virtual machine (VM) is started at the source host, and the channel bandwidth parameter is configured for the QEMU/KVM hypervisor. The algorithm then proceeds with destination server selection and resource reservation at the destination. The migration process is divided into three phases: initial memory copy, iterative phase, and stop and copy phase.

The algorithm then enters the iterative phase (Phase 2), where it repeatedly estimates the predicted downtime of the VM migration until the predicted downtime falls below a predefined threshold. The downtime is predicted using a machine learning model, which takes various input parameters such as VM size, page dirty rate (the rate at which memory pages are modified), bandwidth, and working set size. The model predicts the expected downtime based on these parameters. This iterative phase allows the algorithm to optimize the migration process by dynamically adjusting the migration pace based on the predicted downtime.

Once the predicted downtime falls below the threshold, the algorithm enters the stop and copy phase (Phase 3). In this phase, the algorithm again iterates through the memory pages of the VM. Each memory page is copied from the source host to the destination host. If there are no more memory pages remaining, the migration is considered complete. By utilizing machine learning and predictive modeling, the proposed algorithm aims to minimize downtime during the pre-copy migration process. Overall, the algorithm combines traditional pre-copy migration techniques with ML-based prediction to minimize VM downtime during migration. By copying the modified memory pages initially and iteratively estimating downtime using an ML model, it optimizes the migration process and ensures a smooth and efficient transfer of the VM from the SH to the DH.

4 Setting up a testbed for QEMU–KVM live VM migration experiments

We have developed our own testbed to conduct experiments with our algorithms and the datasets generated from the hardware environment. The testbed consists of three Dell servers running Ubuntu 22.04 with kernel version 5.19.0-41-generic as shown in Fig. 3.

The experiment aimed to enhance the performance of the Pre-Copy Virtual Machine Migration Technique using QEMU–KVM v 6.2.0 hypervisor and libvirt 8.0.0 on an Intel Core i7 (3rd Gen) 3770 S/ 3.10 GHz processor with 8 GB of RAM storage. To manage the networking complexity, VMs were connected through a virtual bridge interface, while physical servers were connected to the Ethernet Local Area Network (LAN) [38]interface for VM migration. To avoid storage synchronization overhead during the migration process, we installed a Network File System (NFS) server on one of the servers to share the migrated VM’s disk image. The migrated VM was configured with Ubuntu 22.04 Linux kernel v3.8.0-29, two logical processors, and 4 GB of RAM. We generated data traffic and simulated cloud user application benchmarks by running Idle VM OS pages, MemTester [39, 40], Sysbench [41], and Stress [42] on the VMs installed on the other two servers.

4.1 Performance evaluation with traditional pre-copy VM migration techniques

To assess the performance of the proposed ML-based pre-copy migration in comparison with the traditional pre-copy migration technique, various test workloads were executed across VMs with different sizes (1GB, 2GB, and 4GB). The VM migration process was initiated using KVM and an NFS server with a 1000MB bandwidth, and machine communication was confirmed through ping commands.

4.1.1 Calculation of input features for ML model

In the iterative phase of our proposed algorithm (Algorithm 1), the dynamic calculation of input features for the machine learning (ML) model plays an important role in predicting the optimal time for the migration process. The following equations detail the computation of key input features, namely VM size, page dirty rate, and working set size:

$$\begin{aligned} \text {vm}\_\text {size} = (\text {data}\_\text {total} \times 1024) - \text {data}\_\text {remaining} \end{aligned}$$

(1)

The VM size is determined by subtracting the remaining data to be migrated (data_remaining) from the total amount of data (data_total). This calculation is performed in kilobytes.

$$\begin{aligned} \text {page}\_\text {dirty}\_\text {rate} = \frac{{\text {dirty}\_\text {rate} \times 4096}}{{\text {memory}\_\text {total} \times 1024 \times 1024 \times 1024}} \end{aligned}$$

(2)

The page dirty rate [43,44,45] represents the rate at which memory pages are modified within a virtual machine (VM) during its operation and is measured in megabytes per second. It is calculated by multiplying the dirty rate (in megabytes per second) by the page size (4096 bytes) and dividing it by the total memory size (in gigabytes).

$$\begin{aligned} \text {working}\_\text {set}\_\text {size} = \frac{{(\text {memory}\_\text {processed} \times 1024) - \text {memory}\_\text {remaining}}}{{\text {time}\_\text {elapsed}}} \end{aligned}$$

(3)

The working set size represents the rate at which the memory is being processed. It is calculated by subtracting the remaining memory (memory_remaining) from the processed memory (memory_processed), multiplied by 1024 for consistency, and then dividing by the elapsed time (in seconds).

For more details on the parameters used in our experiments, refer to the definition of the parameter in Table 3.

Table 3 Parameters definition

Full size table

These calculations are integral to providing the ML model with accurate and dynamic input features, enabling it to predict the optimal timing for transitioning to the stop and copy phase during the live migration process.

4.1.2 Downtime calculation

Downtime [46] is the period during which services are unavailable to the clients due to the suspension of the VM on the source server for migration purposes and resuming it on the destination server. Calculating downtime during live migration is crucial for evaluating system performance.

During the live migration process, communication between the source and destination servers is essential for transferring the virtual machine’s state and memory. This communication involves sending packets of data from the source server to the destination server. When a packet is sent from the source server to the destination server, it traverses various network devices and links. Ideally, each packet should reach its destination without any loss. However, due to network congestion, hardware failures, or other factors, packets may get lost or dropped along the way. In the context of live migration, if packets are lost during transmission, it indicates a disruption in the communication between the source and destination servers. This disruption can lead to delays in transferring the virtual machine’s state, potentially causing the virtual machine to go offline temporarily. As a result, clients accessing services provided by the virtual machine may experience interruptions or delays in service availability.

Monitoring packet loss during live migration using tools like My Traceroute (MTR) [47, 48] allows us to quantify the extent of communication disruptions between the source and destination servers. By analyzing packet loss rates, we can assess the impact on service availability and calculate the downtime experienced during the migration process. MTR is a diagnostic tool that combines the functionalities of traceroute and ping tools, serving as a network diagnostic tool. It investigates the routers along the path by sending limited-hop packets and monitoring their expiration responses. The tool typically performs these probes once per second and records the response times of each hop on the path. Unlike traditional traceroute tools, MTR offers dynamic insights by continuously updating data on latency and packet loss throughout the network path to the destination. This real-time information proves invaluable in troubleshooting network issues, providing users with a comprehensive view of ongoing events along the route. MTR identifies the network path like a traceroute and consistently dispatches packets to gather up-to-date details, ensuring a constantly refreshed perspective on the network’s performance.

Packet loss and round-trip time are fundamental metrics in network performance measurement and analysis, [46, 49,50,51,52] where packet loss reflects the reliability of data transmission. In contrast, round-trip time provides insights into the latency experienced by packets during migration. The packet loss will likely increase if a virtual machine goes offline during migration.

We set a baseline packet loss rate by observing network behavior during normal operation, where services are known to be available and functioning properly. Based on our analysis, we determined the baseline packet loss rate to be 0%. Additionally, to detect abnormal increases in packet loss during migration scenarios, we set a conservative threshold slightly above the baseline level. The threshold value is set at 1% to 5%, accounting for potential network variations and occasional packet loss during normal operation.

We considered two approximation methods to calculate the downtime based on a packet loss-based approach and a timestamp-based approach.

Packet loss-based approach

The packet loss-based approach, as described by the equation below, calculates downtime by multiplying the number of lost packets by the average round-trip time:

$$\begin{aligned} \text {Downtime} = (\text {(Packet loss }\times \text { Total number of packets})\ \times \text { Average round-trip time} \end{aligned}$$

(4)

WHERE:

Packet loss: The percentage of lost packets during the migration process.
Total number of packets: The total number of packets sent during the migration process.
Average round-trip time: The average time it takes for a packet to travel from the source to the destination and back.

It is worthwhile to mention that the downtime calculation above is approximative (not exact) and is based on the intuition that packet loss is directly related to service unavailability. While acknowledging the limitations of using packet loss as a metric, we justify its usage based on the assumption that increased packet loss correlates with periods of service unavailability. We also presume that packet loss due to network issues is minimal and does not significantly disrupt service availability. By establishing a baseline and defining thresholds, we aim to detect deviations from normal network behavior during migration scenarios.

The MTR test results, illustrated in Fig. 4, offer a practical insight into our methodology for calculating downtime during a pre-copy live migration. This specific evaluation was conducted in an idle workload scenario with 1 GB of RAM.

Example of packet loss calculation: In Fig. 4, the 30.3% packet loss indicates that around 30.3% of the 34 packets dispatched during migration were lost in transit. While seemingly high, several factors contribute to this phenomenon. Live migration inherently introduces additional network traffic and computational overhead. The limited network resources, potential bandwidth constraints, and dynamic network conditions amplify the likelihood of packet loss.

$$\begin{aligned} \text {Packet loss}= & {} 30\% \times 34 = 10 \text { packets lost}\\ \text {Downtime}\approx & {} 10 \text { packets lost} \times 0.5 \, \text {ms (average round-trip time)} = 5 \, \text {milliseconds} \end{aligned}$$

In the specific scenario presented, the calculated downtime amounted to approximately 5 milliseconds. This process enabled us to carefully evaluate the performance of live migration under various conditions and workloads.

Timestamp-based approach

An alternative method was also used, involving recording timestamps for when the service transitions from being available to unavailable and back again during migration. The equation for this approach is:

$$\begin{aligned} {\text {Downtime} = T_{\text {end}} - T_{\text {start}}} \end{aligned}$$

(5)

where:

Downtime is the total time the service was unavailable.
$T_{\text {start}}$ is the timestamp when the service became unavailable.
$T_{\text {end}}$ is the timestamp when the service became available again.

The timestamp-based approach is a more direct method of measuring the phenomena of service availability as it records the exact times when the service goes from available to unavailable and then back to available. It provides precise information about the downtime duration during the migration process. It focuses on specific events (service becoming unavailable and then available again), which can offer a clear understanding of the migration’s impact on service availability. The granularity of this method is determined by the precision of the timestamp recording mechanism. High-resolution timestamps can offer very detailed insights on downtime.

Comparison between the downtime measurement methods

Upon evaluating both approaches, including their practical implications and alignment with industry standards, the packet loss-based equation was retained in the methodology. This decision was based on several key considerations. Firstly, the use of packet loss as an indicator aligns with established practices in network performance measurement and analysis. Packet loss is widely recognized as a fundamental metric for assessing the reliability of data transmission, particularly in the context of live migration [46, 49,50,51,52] where communication disruptions can occur between source and destination servers. Secondly, while it is acknowledged that packet loss may not exclusively result from service unavailability during migration, we have assumed that network-related packet loss is minimal and does not significantly impact service availability. By establishing a baseline packet loss rate during regular operation and defining thresholds for abnormal increases in packet loss, we aim to detect any degradation in service quality indicative of service unavailability during migration.

Thirdly, practical implementation considerations played a significant role in the decision-making process. While the timestamp-based approach may appear conceptually straightforward, its execution introduces practical complexities in instrumentation and monitoring. Attaining accuracy in timestamp recording for service unavailability requires meticulous attention to detail and access to advanced monitoring tools. Nevertheless, in numerous practical deployment situations, like those involving small-to-medium-sized enterprises (SMEs) with constrained resources, acquiring or implementing such tools may pose challenges due to their limited availability or feasibility. For example, consider an SME tasked with migrating virtual machines between servers within their data center. Despite recognizing the potential benefits of the timestamp-based approach, the SME may lack the budget and expertise required to deploy and maintain sophisticated monitoring infrastructure. In such cases, the packet loss-based approach offers a more accessible and cost-effective solution for estimating downtime during migration scenarios. SMEs can utilize readily available tools and technologies to monitor and analyze communication disruptions between servers using existing network performance metrics, such as packet loss and round-trip time. This approach minimizes the burden on IT teams by reducing the need for specialized monitoring infrastructure, making it a more feasible option for organizations facing resource constraints.

Conversely, the packet loss-based approach leverages established network performance metrics, streamlining implementation efforts and reducing the dependency on specialized monitoring infrastructure. This practicality makes it a more feasible option for estimating downtime in diverse migration scenarios.

In summary, practical considerations regarding implementation complexity and resource constraints underscored the advantages of the packet loss-based approach, affirming its suitability for estimating downtime during live migration scenarios.

4.1.3 Total migration time calculation

In addition to evaluating downtime, another important metric for assessing the performance of VM migration is the total migration time. We measured this time by recording the start and end timestamps of the migration process. The total migration time is calculated using the following equation:

$$\begin{aligned} \text {total}\_\text {migration}\_\text {time} = \text {end}\_\text {time} - \text {start}\_\text {time} \end{aligned}$$

(6)

Here $\text {start}\_\text {time}$ represents the timestamp before initiating the VM migration. $\text {end}\_\text {time}$ denotes the timestamp after the completion of the VM migration.

The difference between these two timestamps, denoted as $\text {total}\_\text {migration}\_\text {time}$, quantifies the overall time taken for the complete migration process.

4.1.4 Experimental results

The results of our experiments, including downtime and total migration time, are presented in Fig. 5. These graphical representations depict the comparative performance of the proposed ML-based pre-copy migration technique and traditional pre-copy migration across different workloads and VM sizes. The results are shown in Fig. 5.

The analysis of the test results is presented in section 4.2

4.2 Analysis of the experimental result

In this section, we present the results of our experiment on enhancing the performance of traditional pre-copy migration in KVM using a machine learning (ML) model. We compare our proposed approach with the existing KVM pre-copy technique and discuss the implications of our findings.

In the experiment, we executed diverse memory-intensive workloads on virtual machines with varying RAM sizes. The outcomes of these tests are depicted in Fig. 5. Our experimental results, as illustrated in Fig. 5, exhibit a significant enhancement in the efficiency of live migration when utilizing our proposed machine learning model. In comparison, the traditional KVM approach showed longer migration times and increased downtime, especially for virtual machines (VMs) hosting heavy workloads such as Memtester, Sysbench, and Stress. Additionally, we analyze the test results and calculate the percentage improvement in total migration time and downtime for each case as well as overall. The following comparison presents the analysis results.

Performance for 1GB RAM

For 1GB RAM, the proposed pre-copy method demonstrates significant improvements in migration time across different workloads. In an idle state, the method shows a 15.4% improvement in migration time and a 20.71% reduction in downtime compared to the traditional pre-copy method. Under stress conditions, the method offers a substantial 74.2% improvement in migration time and a 38.47% reduction in downtime. During Memtester activities, the proposed pre-copy method achieves an impressive 86.5% improvement in migration time and a downtime reduction of 61.18%. When subjected to SysCPU, the method results in a 7.1% improvement in migration time and a 39.91% reduction in downtime. Lastly, for SysMemory, the proposed pre-copy method records a 6.3% improvement in migration time and a substantial 56.19% reduction in downtime.

Overall, for 1GB RAM, the proposed pre-copy method shows an average improvement of approximately 36.7% in migration time and 43.1% in downtime compared to the traditional pre-copy method.
Performance for 2GB RAM

In the case of 2GB RAM, the proposed pre-copy method yields impressive results. In an idle state, it achieves a 21.7% improvement in migration time and a significant 68.8% reduction in downtime. Under stress conditions, the method showcases a 79.6% improvement in migration time and a remarkable 83.27% reduction in downtime. When subjected to Memtester activities, the proposed pre-copy method demonstrates a remarkable 92.4% improvement in migration time and a significant 64.72% reduction in downtime. During SysCPU activities, the method yields a substantial 56 improvement in migration time and a noteworthy 78.42% reduction in downtime. Lastly, in SysMemory scenarios, the proposed pre-copy method records a 38% improvement in migration time and a notable 41.48% reduction in downtime.

For 2GB RAM, the proposed pre-copy method shows an average improvement of approximately 40.6% in migration time and 67.34% in downtime compared to the traditional pre-copy method.
Performance for 4GB RAM

With 4GB of RAM, the proposed pre-copy method continues to deliver significant improvements. In an idle state, it achieves a 35.1% improvement in migration time and a substantial 78.63% reduction in downtime. Under stress conditions, the method exhibits a 49.5% improvement in migration time and a considerable 74.48% reduction in downtime. During Memtester activities, the proposed pre-copy method shows an impressive 78.5% improvement in migration time and a significant 68.89% reduction in downtime. In SysCPU scenarios, the method records a 71.4% improvement in migration time and a noteworthy 42.94% reduction in downtime. Lastly, for SysMemory, the proposed pre-copy method yields a 31.8% improvement in migration time and a 2.24% reduction in downtime. Overall, for 4GB RAM, the proposed pre-copy method shows an average improvement of approximately 53.4% in migration time and 52.22% in downtime compared to the traditional pre-copy method.

In summary, the proposed pre-copy method consistently demonstrates significant improvements in migration time and downtime across various workloads and RAM sizes. Remarkably, it reduces downtime by approximately 61.13% for 1GB RAM, 64.74% for 2GB RAM, and 68.85% for 4GB RAM compared to traditional methods in high-write-intensive workloads. Additionally, the method achieves impressive reductions in total migration time, such as approximately 86.45% for 1GB RAM, 92.42% for 2GB RAM, and 78.57% for 4GB RAM, when handling high write-intensive workloads compared to the traditional method.

4.3 Comparison with existing methods

We could not make direct quantitative comparisons with our results with existing algorithms due to variations in hardware configurations across different studies, such as RAM size, operating system, network speed, etc. This diversity made it challenging to ensure fair comparisons. Additionally, due to the lack of detailed information on the implementation of existing algorithms, we couldn’t accurately implement them in our setup.

To overcome these challenges and to ensure fairness in our evaluation, we compared our approach with the traditional KVM pre-copy method, which is commonly implemented and reproducible across various environments. This approach allowed us to establish a benchmark for evaluating the effectiveness of different algorithms.

POF-SVLM [53] achieves an overall 55–60% reduction in Migration Time, Data Transfer, and Application Downtime compared to traditional algorithm. In comparison with the default migration approach, iMIG [54] approach achieves a 40% reduction in migration latency and a 45% reduction in energy consumption.

Li et al. [54] demonstrated a 41.63% increase in downtime, signaling challenges in specific scenarios. In contrast, Deshpande et al.’s [55] inter-rack migration approach managed to reduce total migration time by 26%, indicating efficiency improvements. Singh et al.’s [56] geometric programming-based method significantly decreased both total migration time and downtime, showcasing reductions of 37% and 28% in the worst-case scenarios. While Chen et al.’s [57] prediction-based optimization focused on minimizing downtime without specific figures, Elsaid et al.’s [16] live migration timing optimization saved 50% in migration time and averaged 32% in VM migration time for memory-intensive workloads, reaching up to 27% reduction for work-intensive applications, with an average 21% savings. Lu et al.’s [58] introduced vGPU optimization with significant downtime reductions. Meanwhile, MigVisor by Zhang et al. [59] accurately predicted migration behavior, enhancing resource management without specific downtime or total migration time comparisons. Gilesh et al. [45] opportunistic approach demonstrated time savings of up to 25% in VM migration. Kumar et al. [60] performance upsurge emphasized improved bandwidth usage, migration time, and downtime without specific figures.

Our proposed algorithm outperforms the traditional pre-copy method with an average of 64.91% reduction in downtime for different RAM configurations in high-write-intensive workloads, along with an average reduction in total migration time of approximately 85.81%. Focusing on a dynamic stopping condition rather than a static one, we leverage our ML model to accurately predict the stopping condition based on factors like VM size, working set size, bandwidth, and dirty page rate. This intelligent approach significantly reduces the page dirty rate in each iteration, minimizing downtime by precisely determining the stop-and-copy time, thus reducing overall page data transfer during static iterations.

In conclusion, the comparisons and percentage reductions highlight the efficacy of our proposed method in consistently reducing downtime and total migration time, enhancing overall efficiency across diverse workloads and VM sizes. These findings establish our method as a promising solution for VM migration optimization. Further real-world validation and testing are necessary to solidify these results and evaluate the practicality and scalability of our algorithm in real-world scenarios.

5 Observations and discussion

This section presents the findings and implications of our study, focusing on predicting two critical performance metrics during pre-copy live migration: downtime and total migration time (TMT). Machine learning techniques were employed, utilizing a carefully constructed dataset to develop predictive models. A noteworthy aspect of our approach is the deliberate selection of four key features for prediction: VM size, Page Dirty Rate (PDR), Available Network Bandwidth (PTR), and Working Set Size (WSS). This contrasts with prior work [61], which employed a larger set of 20 features. The motivation behind this choice is to optimize model efficiency and interpretability. By focusing on a concise set of features, we aim to enhance computational efficiency and facilitate the practical deployment of our models across diverse cloud environments.

Other research works [61], as stated earlier, employed a comprehensive set of 20 features for their predictive models. They developed multiple models tailored to different migration algorithms and metrics, acknowledging that varying features are needed for accuracy across this broad spectrum. Their approach was indeed comprehensive and valuable for addressing the complexities of different migration scenarios. In contrast, our study took a more targeted approach. We focused specifically on pre-copy live migration and honed in on predicting downtime and TMT. Given this narrower scope, our feature selection was optimized for the unique characteristics of pre-copy migration.

We found that by concentrating on these four key features-VM size, Page Dirty Rate (PDR), Available Network Bandwidth (PTR), and Working Set Size (WSS)-we could achieve a high degree of accuracy in predicting downtime and TMT. This focused feature selection simplifies the model and enhances interpretability, making it well-suited for scenarios where pre-copy migration with downtime and TMT are the primary concerns. It’s important to note that the feature selection for our study was carefully tailored to the specific migration scenario under investigation, and we acknowledge that different migration algorithms and metrics may require a more extensive set of features, as demonstrated by the prior work.

Our study provides valuable insights into the predictive capabilities of a streamlined feature set for pre-copy live migration, emphasizing accuracy in forecasting downtime and TMT. While we tailor our approach to this specific scenario, the broader field of migration modeling acknowledges the need for diverse features to address the complexities of different migration algorithms and metrics. Our research contributes by showcasing the effectiveness of a focused feature selection for specific migration scenarios, offering a practical and interpretable solution for practitioners in pre-copy migration contexts.

Throughout the experiment, we observed a substantial difference in total migration time and downtime due to the number of iterations and the page dirty rate in each iteration. To delve deeper into the specifics of the page dirty rate in each iteration and the number of iterations in both the traditional and proposed methods, we collected data at each iteration and plotted a graph showcasing the relationship between the page dirty rate (i.e., the number of dirty pages per second) and the number of iterations. This graph, displayed in Fig. 6, provides valuable insights into the improvements achieved by our methods for different workloads and memory sizes.

In Fig. 6, we plotted the number of iterations vs. dirty page rates for Memtester and Stress workloads with the traditional pre-copy method and the proposed ML-based pre-copy method.

In Fig. 6a, we examined the Memtester workload using 1GB of RAM and the conventional pre-copy method. We noticed that after the 75th iteration, the iterative phase transitioned into the stop and copy phase. At this point, the page dirty rate reached zero because the running workload had finished, and there were no more dirty pages to be copied. Consequently, there was no need for further iterations. We repeated the experiment with a Stress workload, as depicted in Fig. 6c. Similarly, with the traditional method, the stop and copy phase could only be reached when the page dirty rate became zero, indicating that the running workload was completed. This condition is not suitable when the VM is running heavy workloads and requires migration.

In contrast, when employing our prediction method with the same workloads, as shown in Fig. 6b, d, we noticed that the iterative stage transitioned to the stop-and-copy phase after the second iteration. During this stage, we also observed a lower number of dirty page rates compared to other iterations. The migration process is completed at the end of the fourth iteration, resulting in significantly reduced downtime and total migration time. Moreover, our ML model enabled us to migrate the VMs safely while the workload was still running. Conversely, using the traditional method, the VM could not be moved until the running workload was finished. We observed similar patterns when running different workloads with varying RAM sizes using both the traditional and proposed methods.

During the experimental process, we also noted the significance of bandwidth. We utilized a 1000MB bandwidth and observed slight variations in the data transfer rate during each iteration, ranging from 75 MiB/s to 125 MiB/s. These fluctuations in data transfer also impact the generation of the dirty page rate. By introducing the proposed machine learning model, we observed a substantial reduction in downtime during live migration. The model’s ability to predict downtime accurately enabled us to halt the iteration phase prematurely if the predicted downtime fell below the predefined threshold. This prevented unnecessary iterations and significantly minimized service disruptions. Moreover, even when migration was allowed to proceed, the predicted downtime closely aligned with the actual downtime, indicating the model’s reliability.

In addition to reducing downtime, our proposed model also yielded improvements in total migration time. By accurately predicting the downtime, we were able to optimize the migration process and avoid unnecessary iterations. As a result, the total time required to complete the migration was significantly reduced, leading to a faster and more efficient VM relocation. The results of our experiments highlight the advantages of employing our proposed machine-learning model in live migration scenarios. The traditional KVM approach lacks the ability to adapt to heavy workloads and often prolongs the total migration time. The introduction of our model mitigates this issue by dynamically predicting the downtime and intelligently halting the iteration phase, leading to reduced downtime and improved migration efficiency.

To evaluate the performance of our proposed method, we analyzed CPU overhead during the migration process. Figure 7a illustrates the CPU usage when using the traditional KVM algorithm, while Fig. 7b represents the CPU usage with our proposed algorithm. Both evaluations were conducted with a Memtester workload, a highly memory-intensive task.

The Fig. 7a, b demonstrate the advantages of our proposed method over the KVM algorithm. In the KVM method, the X-axis represents time, extending up to 175 s, indicating that the migration process took this long. During this interval, CPU usage exceeded 50%. In contrast, our proposed method completed the migration in half the time, and the CPU usage remained below 10%.

It is important to note that during the initial 20 s, both methods exhibited similar CPU usage, primarily due to the initial preparations for the migration process. The calculated CPU overhead from the provided data was approximately 27% for the KVM approach. This means that during the migration process, CPU utilization increased by 27% compared to the normal workload state. This increase in CPU usage could be attributed to various factors such as the migration process itself, data transfer, and other related tasks. In contrast, our proposed method significantly reduced CPU overhead, resulting in a 12% reduction in CPU load during migration. This reduction implies that our algorithm is more efficient, leading to lower CPU utilization during the migration process.

Our experimental evaluation also considered the impact of VM size on migration performance. As the VM size increased, the number of pages to be transferred also increased, consequently prolonging the migration process. However, even with larger VM sizes, our proposed methodology outperformed the traditional KVM approach, showcasing its advantages across different workload scales. The proposed machine learning model can be easily adapted to different environments and can accommodate various workload types. Moreover, as the model is trained on historical migration data, it can continually improve its predictive capabilities as more data becomes available. This adaptability and scalability make our approach suitable for real-world deployment in diverse virtualized environments.

While our proposed machine learning-based approach significantly reduces downtime and total migration time, certain limitations should be acknowledged. One limitation is the reliance on prediction techniques for estimating downtime. Although our experimental evaluation demonstrated the superiority of our approach, the accuracy of predictions may vary depending on the specific workload and environmental factors. Furthermore, the availability of real data for training the prediction model can impact its efficiency. Obtaining a diverse and extensive dataset that accurately represents various scenarios and workloads is essential for improving the model’s accuracy and generalization capabilities. Therefore, acquiring more real-world data and continuously refining the prediction model are important steps in enhancing the efficiency and reliability of our approach.

6 Conclusion and future works

In this paper, we introduced a novel machine learning-based approach to enhance the efficiency of live migration in a KVM environment. By dynamically predicting downtime and halting the iteration phase when the predicted downtime falls below a threshold, our proposed model significantly reduced downtime and total migration time. Our experimental evaluation demonstrated the superiority of our approach compared to the traditional KVM methodology, particularly when heavy workloads were involved. Key insights from our evaluation include an average reduction in downtime of 64.91% for different RAM configurations in high-write-intensive workloads, along with an average reduction in total migration time of approximately 85.81%. These results underscore the practical advantages of our method in minimizing service disruptions during live virtual machine migration, highlighting its potential for enhancing the efficiency and reliability of virtualized environments.

One important area for future work is to address the security concerns associated with live migration in virtualized environments. Currently, the assumption is that the virtualized environment and the common shared NFS server are trusted. However, it is crucial to find techniques that ensure the secure migration of machines, especially when sensitive or critical data is involved. The focus should be on preventing data tampering or unauthorized access during the migration process.

To achieve this, one approach could be to identify the most sensitive files during the migration and provide alerts to users about their presence. This would raise awareness and allow users to take necessary precautions. Additionally, employing selective encryption techniques can provide an extra layer of security. By encrypting selective files during migration, sensitive data can be protected from potential attacks or unauthorized access.

References

Gupta A, Namasudra S (2022) A novel technique for accelerating live migration in cloud computing. Autom Softw Eng 29(1):34
Article Google Scholar
Ibrahim KZ, Hofmeyr S, Iancu C, Roman E (2011) Optimized pre-copy live migration for memory intensive applications. In: proceedings of 2011 international conference for high performance computing, networking, storage and analysis, 1–11
Akoush S, Sohan R, Rice A, Moore AW, Hopper A (2010) Predicting the performance of virtual machine migration. In: 2010 IEEE international symposium on modeling, analysis and simulation of computer and telecommunication systems, 37–46. IEEE
Kivity A, Kamay Y, Laor D, Lublin U, Liguori A (2007) kvm: the linux virtual machine monitor. In: proceedings of the linux symposium, 1, pp 225–230. Dttawa, Dntorio, Canada
Habib I (2008) Virtualization with kvm. Linux J 2008(166):8
Google Scholar
Haris RM, Khan KM, Nhlabatsi A, Barhamgi M (2023) A machine learning-based optimization approach for pre-copy live virtual machine migration. Clust Comput 27(2):1293–1312
Article Google Scholar
Calheiros RN, Ranjan R, De Rose CA, Buyya R (2009) Cloudsim: a novel framework for modeling and simulation of cloud computing infrastructures and services. arXiv preprint arXiv:0903.2525
Gao J, Wang H, Shen H (2020) Machine learning based workload prediction in cloud computing. In: 2020 29th international conference on computer communications and networks (ICCCN), 1–9. IEEE
Satpathy A, Sahoo MN, Mishra A, Majhi B, Rodrigues JJ, Bakshi S (2021) A service sustainable live migration strategy for multiple virtual machines in cloud data centers. Big Data Res 25:100213
Article Google Scholar
Haris RM, Khan KM, Nhlabatsi A (2022) Live migration of virtual machine memory content in networked systems. Comput Netw 209:108898
Article Google Scholar
Imran M, Ibrahim M, Din MSU, Rehman MAU, Kim BS (2022) Live virtual machine migration: a survey, research challenges, and future directions. Comput Electr Eng 103:108297
Article Google Scholar
Le T (2020) A survey of live virtual machine migration techniques. Comput Sci Rev 38:100304
Article Google Scholar
Abeni L, Faggioli D (2020) Using xen and kvm as real-time hypervisors. J Syst Architect 106:101709
Article Google Scholar
Dakic V, Chirammal HD, Mukhedkar P, Vettathu A (2020) Mastering KVM virtualization: design expert data center virtualization solutions with the power of Linux KVM. Packt Publishing Ltd,
Binu A, Kumar GS (2011) Virtualization techniques: a methodical review of xen and kvm. In: advances in computing and communications: first international conference, ACC 2011, Kochi, India, July 22–24, 2011. Proceedings, Part I 1, pp 399–410. Springer
Elsaid ME, Abbas HM, Meinel C (2020) Live migration timing optimization for vmware environments using machine learning techniques. In: CLOSER, pp 91–102
Clark C, Fraser K, Hand S, Hansen JG, Jul E, Limpach C, Pratt I, Warfield A (2005) Live migration of virtual machines. In: proceedings of the 2nd conference on symposium on networked systems design & implementation-Volume 2, pp 273–286
Bhardwaj A, Krishna CR (2019) Impact of factors affecting pre-copy virtual machine migration technique for cloud computing. Mater Today: Proc 18:1138–1145
Google Scholar
Desai MR, Patel HB (2015) Efficient virtual machine migration in cloud computing. In: 2015 fifth international conference on communication systems and network technologies, pp 1015–1019. IEEE
Desai MR, Patel HB (2016) Performance measurement of virtual machine migration using pre-copy approach in cloud computing. In: proceedings of the second international conference on information and communication technology for competitive strategies, pp 1–4
Li C, Feng D, Hua Y, Qin L (2019) Efficient live virtual machine migration for memory write-intensive workloads. Futur Gener Comput Syst 95:126–139
Article Google Scholar
Lublin U, Liguori A et al (2007) Kvm live migration. In: KVM forum
Jin H, Gao W, Wu S, Shi X, Wu X, Zhou F (2011) Optimizing the live migration of virtual machine by cpu scheduling. J Netw Comput Appl 34(4):1088–1096
Article Google Scholar
Sharma S, Chawla M (2016) A three phase optimization method for precopy based vm live migration. Springerplus 5:1–24
Article Google Scholar
Elsaid ME, Abbas HM, Meinel C (2019) Machine learning approach for live migration cost prediction in vmware environments. In: CLOSER, pp 456–463
Amani A, Zamanifar K (2014) Improving the time of live migration virtual machine by optimized algorithm scheduler credit. In: 2014 4th international conference on computer and knowledge engineering (ICCKE), pp 346–351. IEEE
Yong C, Yusong L, Yi G, Runzhi L, Zongmin W (2013) Optimizing live migration of virtual machines with context based prediction algorithm. In: 1st international workshop on cloud computing and information security, pp 441–444. Atlantis Press
Patel M, Chaudhary S, Garg S (2016) Machine learning based statistical prediction model for improving performance of live virtual machine migration. J Eng 20162016(1):3061674
Google Scholar
Motaki SE, Yahyaouy A, Gualous H (2021) A prediction-based model for virtual machine live migration monitoring in a cloud datacenter. Computing 103(11):2711–2735
Article MathSciNet Google Scholar
Hummaida AR, Paton NW, Sakellariou R (2022) Scalable virtual machine migration using reinforcement learning. J Grid Comput 20(2):15
Article Google Scholar
Talwani S, Alhazmi K, Singla J, Alyamani HJ, Bashir AK (2022) Allocation and migration of virtual machines using machine learning. Comput Mater Continua 70(2):3349–3364
Article Google Scholar
Mangalampalli A, Kumar A (2022) Wbatimenet: a deep neural network approach for vm live migration in the cloud. Futur Gener Comput Syst 135:438–449
Article Google Scholar
Sinha R, Purohit N, Diwanji H (2011) Energy efficient dynamic integration of thresholds for migration at cloud data centers. IJCA Spec Issue Commun Netw 1:44–49
Google Scholar
Strunk A (2012) Costs of virtual machine live migration: a survey. In: 2012 IEEE eighth world congress on services, pp 323–329. IEEE
Shukla R, Gupta RK, Kashyap R (2019) A multiphase pre-copy strategy for the virtual machine migration in cloud. In: smart intelligent computing and applications: proceedings of the second international conference on SCI 2018, Volume 1, pp 437–446. Springer
Hu B, Lei Z, Lei Y, Xu D, Li J (2011) A time-series based precopy approach for live migration of virtual machines. In: 2011 IEEE 17th international conference on parallel and distributed systems, pp 947–952. IEEE
Nirmala DN, Vengatesh KS (2022) Research challenges in pre-copy virtual machine migration in cloud environment. The internet of medical things (IoMT) healthcare transformation, 45–72
Jo C, Gustafsson E, Son J, Egger B (2013) Efficient live migration of virtual machines using shared storage. ACM Sigplan Notices 48(7):41–50
Article Google Scholar
Memtester documentation. https://linux.die.net/man/8/memtester
Kang Q, Jin C, Zhang Z, Zhou A (2014) Memtest: a novel benchmark for in-memory database. In: big data benchmarks, performance optimization, and emerging hardware: 4th and 5th workshops, BPOE 2014, Salt Lake City, USA, March 1, 2014 and Hangzhou, China, September 5, 2014, revised selected papers 4, pp 34–46. Springer
Sysbench documentation. https://manpages.ubuntu.com/manpages/trusty/man1/sysbench.1.html
Hat R. Assembly: stress-testing real-time systems with stress-ng. [Online; accessed 9-March-2024]. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html/optimizing_rhel_8_for_real_time_for_low_latency_operation/assembly_stress-testing-real-time-systems-with-stress-ng_optimizing-rhel8-for-real-time-for-low-latency-operation
Sahni S, Varma V (2012) A hybrid approach to live migration of virtual machines. In: 2012 IEEE international conference on cloud computing in emerging markets (CCEM), pp 1–5. IEEE
Altahat MA, Agarwal A, Goel N, Kozlowski J (2020) Dynamic hybrid-copy live virtual machine migration: analysis and comparison. Procedia Comput Sci 171:1459–1468
Article Google Scholar
Gilesh MP, Jain S, Madhu Kumar S, Jacob L, Bellur U (2020) Opportunistic live migration of virtual machines. Concurr Comput: Prac Exp 32(5):5477
Article Google Scholar
Salfner F, Tröger P, Polze A (2011) Downtime analysis of virtual machine live migration. In: the fourth international conference on dependability (DEPEND 2011). IARIA, pp 100–105
Kirkbride P, Kirkbride P (2020) Network scanning. basic linux terminal tips and tricks: learn to work quickly on the command line, 119–146
Kretchmar JM (2004) Open source network administration. Prentice hall professional, ???
Mohammad T, Eati CS (2015) A performance study of vm live migration over the wan
Alselek M, Leite JP (2016) Live-migration in cloud computing environment. Master’s thesis, Instituto Politecnico do Porto (Portugal)
Nirschl JJ (2011) Virtualized guest live migration profiling and detection. Iowa State University, ???
Mattos DM, Ferraz LHG, Duarte OCM (2015) Virtual machine migration. Cloud services, networking, and management, 49–72
Dhule C, Shrawankar U (2020) Pof-svlm: pareto optimized framework for seamless vm live migration. Computing 102(10):2159–2183
Article Google Scholar
Li J, Zhao J, Li Y, Cui L, Li B, Liu L, Panneerselvam J (2015) imig: toward an adaptive live migration method for kvm virtual machines. Comput J 58(6):1227–1242
Article Google Scholar
Deshpande U, Kulkarni U, Gopalan K (2012) Inter-rack live migration of multiple virtual machines. In: proceedings of the 6th international workshop on virtualization technologies in distributed computing date, pp 19–26
Singh G, Singh AK (2021) Optimizing multi-vm migration by allocating transfer and compression rate using geometric programming. Simul Model Pract Theory 106:102201
Article Google Scholar
Chen C, Cao J (2014) Prediction-based optimization of live virtual machine migration. In: network and parallel computing: 11th IFIP WG 10.3 international conference, NPC 2014, Ilan, Taiwan, September 18-20, 2014. Proceedings 11, pp 347–356. Springer
Lu Q, Zheng X, Ma J, Dong Y, Qi Z, Yao J, He B, Guan H (2019) gmig: efficient vgpu live migration with overlapped software-based dirty page verification. IEEE Trans Parallel Distrib Syst 31(5):1209–1222
Article Google Scholar
Zhang J, Dong E, Li J, Guan H (2017) Migvisor: accurate prediction of vm live migration behavior using a working-set pattern model. ACM SIGPLAN Notices 52(7):30–43
Article Google Scholar
Kumar AV, Krishnakumar V, Kumar AN (2019) Efficient performance upsurge in live migration with downturn in the migration time and downtime. Clust Comput 22(5):12737–12747
Article Google Scholar
Jo C, Cho Y, Egger B (2017) A machine learning approach to live migration modeling. In: proceedings of the 2017 symposium on cloud computing, pp 351–364

Download references

Funding

Open Access funding provided by the Qatar National Library. Open Access funding provided by the Qatar National Library.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Qatar University, Doha, Qatar
Raseena M. Haris, Mahmoud Barhamgi, Armstrong Nhlabatsi & Khaled M. Khan

Authors

Raseena M. Haris
View author publications
You can also search for this author in PubMed Google Scholar
Mahmoud Barhamgi
View author publications
You can also search for this author in PubMed Google Scholar
Armstrong Nhlabatsi
View author publications
You can also search for this author in PubMed Google Scholar
Khaled M. Khan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Raseena Haris wrote the complete manuscript. All other authors reviewed the manuscript.

Corresponding author

Correspondence to Raseena M. Haris.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Haris, R.M., Barhamgi, M., Nhlabatsi, A. et al. Optimizing pre-copy live virtual machine migration in cloud computing using machine learning-based prediction model. Computing (2024). https://doi.org/10.1007/s00607-024-01318-6

Download citation

Received: 03 November 2023
Accepted: 26 June 2024
Published: 08 July 2024
DOI: https://doi.org/10.1007/s00607-024-01318-6

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimizing pre-copy live virtual machine migration in cloud computing using machine learning-based prediction model

Abstract

Similar content being viewed by others

A machine learning-based optimization approach for pre-copy live virtual machine migration

Prediction-Based Optimization of Live Virtual Machine Migration

Machine Learning Based Live VM Migration for Efficient Cloud Data Center

1 Introduction