Empirical study on the discrepancy between performance testing results from virtual and physical environments

Arif, Muhammad Moiz; Shang, Weiyi; Shihab, Emad

doi:10.1007/s10664-017-9553-x

Empirical study on the discrepancy between performance testing results from virtual and physical environments

Published: 03 October 2017

Volume 23, pages 1490–1518, (2018)
Cite this article

Download PDF

Empirical Software Engineering Aims and scope Submit manuscript

Empirical study on the discrepancy between performance testing results from virtual and physical environments

Download PDF

Muhammad Moiz Arif¹,
Weiyi Shang¹ &
Emad Shihab¹

923 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

Large software systems often undergo performance tests to ensure their capability to handle expected loads. These performance tests often consume large amounts of computing resources and time since heavy loads need to be generated. Making it worse, the ever evolving field requires frequent updates to the performance testing environment. In practice, virtual machines (VMs) are widely exploited to provide flexible and less costly environments for performance tests. However, the use of VMs may introduce confounding overhead (e.g., a higher than expected memory utilization with unstable I/O traffic) to the testing environment and lead to unrealistic performance testing results. Yet, little research has studied the impact on test results of using VMs in performance testing activities. To evaluate the discrepancy between the performance testing results from virtual and physical environments, we perform a case study on two open source systems – namely Dell DVD Store (DS2) and CloudStore. We conduct the same performance tests in both virtual and physical environments and compare the performance testing results based on the three aspects that are typically examined for performance testing results: 1) single performance metric (e.g. CPU Time from virtual environment vs. CPU Time from physical environment), 2) the relationship among performance metrics (e.g. correlation between CPU and I/O) and 3) performance models that are built to predict system performance. Our results show that 1) A single metric from virtual and physical environments do not follow the same distribution, hence practitioners cannot simply use a scaling factor to compare the performance between environments, 2) correlations among performance metrics in virtual environments are different from those in physical environments 3) statistical models built based on the performance metrics from virtual environments are different from the models built from physical environments suggesting that practitioners cannot use the performance testing results across virtual and physical environments. In order to assist the practitioners leverage performance testing results in both environments, we investigate ways to reduce the discrepancy. We find that such discrepancy can be reduced by normalizing performance metrics based on deviance. Overall, we suggest that practitioners should not use the performance testing results from virtual environment with the simple assumption of straightforward performance overhead. Instead, practitioners should consider leveraging normalization techniques to reduce the discrepancy before examining performance testing results from virtual and physical environments.

Energy efficiency in cloud computing data centers: a survey on software technologies

Article 30 August 2022

Avita Katal, Susheela Dahiya & Tanupriya Choudhury

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Ouafa Bentaleb, Adam S. Z. Belloum, … Aouaouche El-Maouhab

Serverless Computing: Current Trends and Open Problems

1 Introduction

Software performance assurance activities play a vital role in the development of large software systems. These activities ensure that the software meets the desired performance requirements (Woodside et al., 2007). Often however, failures in large software systems are due to performance issues rather than functional bugs (Dean and Barroso 2013; Foo et al., 2010). Such failures lead to the eventual decline in quality of the system with reputational and monetary consequences (CA Technologies 2011). For instance, Amazon estimates that a one-second page-load slowdown can cost up to $1.6 billion (Eeton 2012).

In order to mitigate performance issues and ensure software reliability, practitioners often conduct performance tests (Woodside et al., 2007). Performance tests apply a workload (e.g., mimicking users’ behavior in the field) on the software system (Jain 1990; Syer et al., 2017), and monitor performance metrics, such as CPU usage, that are generated based on the tests. Practitioners use such metrics to gage the performance of the software system and identify potential performance issues (such as memory leaks (Syer et al., 2013) and throughput bottlenecks (Malik et al., 2010a)).

Since performance tests are often performed on large-scale software systems, the performance tests often require many resources (Jain 1990). Moreover, performance tests often need to run for a long period of time in order to build statistical confidence on the results (Jain 1990). Such testing environments need to be easily configurable such that a specific environment can be mimicked, reducing false performance issues. For example, issues that are related to the environment. Hence, due to their flexibility, virtual environments enable practitioners to easily prepare, customize, use and update performance testing environments in an efficient manner. Therefore, to address such challenges, virtual environments (VMs) are often leveraged for performance testing (Chen and Noble 2001; VMWare 2016). The use of VMs in performance testing are widely discussed (Dee 2014; Kearon 2012; Tintin 2011), and even well documented (Merrill 2009) by practitioners. In addition, many software systems are released both on-premise (physical) and on cloud (virtual) environment (e.g., SugarCRM 2017 and BlackBerry Enterprize Server 2014). Hence, it is important to conduct performance testing on both the virtual (for cloud deployment) and physical environments (for on-premise deployment).

Prior studies show that virtual environments are widely exploited in practice (Cito et al., 2015; Nguyen et al., 2012; Xiong et al., 2013). Studies have investigated the overhead that is associated with virtual environments (Menon et al., 2005). Such overheads may not impose effect on the results of performance tests carried out in physical and virtual environments. For example, if the performance (e.g., throughput) of the system follows the same trend (or distribution) in both, the physical and virtual environments, such overhead would not significantly impact the outcome for the practitioners who examine the performance testing results. Our work is one of the first works that examine such discrepancy between performance testing results in virtual and physical environments. Exploring, identifying and minimizing such discrepancy will help practitioners and researchers understand and leverage performance testing results from virtual and physical environments. Without knowing if there exists a discrepancy between the performance testing results from the two environments practitioners cannot rely on the performance assurance activities carried out in the virtual environment or vice versa. Once the discrepancy is identified, the performance results could be evaluated more accurately.

In this paper, we perform a study on two open-source systems, DS2 (Jaffe and Muirhead 2011) and CloudStore (CloudScale-Project 2014), where performance tests are conducted using virtual and physical environments. Our study focuses on the discrepancy between the two environments, the impact of discrepancy on analyzing performance testing results and highlights potential opportunities to minimize the discrepancy. In particular, we compare performance testing results from virtual and physical environments based on the three widely examined aspects:

single performance metric: the trends and distributions of each performance metric
the relationship between the performance metrics: the correlations between every two performance metrics
statistical performance models: the models that are built using performance metrics to predict the overall performance of the system

We find that 1) performance metrics have different shapes of distributions and trends in virtual environments compared to physical environments, 2) there are large differences in correlations among performance metrics measured in virtual and physical environments, and 3) statistical models using performance metrics from virtual environments do not apply to physical environments (i.e., produce high prediction error) and vice versa. Then, we examined the feasibility of using normalizations to help alleviate the discrepancy between performance metrics. We find that in some cases, normalizing performance metrics based on deviance may reduce the prediction error when using performance metrics collected from one environment and applying it on another. Our findings show that practitioners cannot assume that their performance tests that are observed on one environment will necessarily apply to another environment. The overhead from virtual environments does not only impact the scale of the performance metrics, but also impacts the relationship among performance metrics, i.e a change in correlation values. On the other hand, we find that practitioners who leverage both, virtual and physical environments, may be able to reduce the discrepancy that may arise due to the environment (i.e., virtual vs. physical) by applying normalization techniques.

The rest of the paper is organized as follows. Section 2 presents the background and related work. Section 3 presents the case study setup. Section 4 presents the results of our case study, followed by a discussion of our results in Section 5. Section 6 discusses the threats to validity of our findings. Finally, Section 7 concludes this paper.

2 Background and Related Work

In this section, we discuss the motivation and related work of this paper in broadly three subsections: 1) analyzing performance metrics from performance testing, 2) analysis of VM overhead and 3) performance testing and bug detection.

2.1 Analyzing Performance Metrics from Performance Testing

Prior research has proposed a slew of techniques to analyze performance testing results, i.e. performance metrics. Such techniques typically examine three different aspects of the metrics: 1) single performance metric, 2) the relationship between performance metrics, and 3) statistical modeling based on performance metrics.

2.1.1 Single Performance Metric

Nguyen et al. (2012) introduce the concept of using control charts (Shewhart 1931) in order to detect performance regressions. Control charts use a predefined threshold to detect performance anomalies. However control charts assume that the output follows a uni-model distribution, which may be an inappropriate assumption for performance. Nguyen et al. propose an approach to normalize performance metrics between heterogeneous environments and workloads in order to build robust control charts.

Malik et al. (2010b, 2013) propose approaches that cluster performance metrics using Principal Component Analysis (PCA). Each component generated by PCA is mapped to performance metrics by a weight value. The weight value measures how much a metric contributes to the component. For every performance metric, a comparison is performed on the weight value of each component to detect performance regressions.

Heger et al. (2013) present an approach that uses software development history and unit tests to diagnose the root cause of performance regressions. In the first step of their approach, they leverage Analysis of Variance (ANOVA) to compare the response time of the system to detect performance regressions. Similarly, Jiang et al. (2009) extract response time from system logs. Instead of conducting statistical tests, Jiang et al. visualize the trend of response time during performance tests, in order to identify performance issues.

2.1.2 Relationship Between Performance Metrics

Malik et al. (2010a) leverage Spearman’s rank correlation to capture the relationship between performance metrics. The deviance of correlation is calculated in order to pinpoint which subsystem should take responsibility of the performance deviation.

Foo et al. (2010) propose an approach that leverages association rules in order to address the limitations of manually detecting performance regressions in large scale software systems. Association rules capture the historical relationship among performance metrics and generate rules based on the results of prior performance tests. Deviations in the association rules are considered signs of performance regressions.

Jiang et al. (2009a) use normalized mutual information as a similarity measure to cluster correlated performance metrics. Since metrics in one cluster are highly correlated, the uncertainty among metrics in the cluster should be low. Jiang et al. leverage entropy from information theory to monitor the uncertainty of each cluster. A significant change in the entropy is considered as a sign of a performance fault.

2.1.3 Statistical Modeling Based on Performance Metrics

Xiong et al. (2013) proposed a model-driven approach named vPerfGuard to detect software performance regressions in a cloud-environment. The approach builds models between workload metrics and a performance metric, such as CPU. The models can be used to detect workload changes and assists in identifying performance bottlenecks. Since the usage of vPerfGuard is typically in a virtual environment, our study may help the future evaluation of vPerfGuard. Similarly, Shang et al. (2015) propose an approach of including only a limited number of performance metrics for building statistical models. The approach leverages an automatic clustering technique in order to find the number of models to be build for the performance testing results. By building statistical models for each cluster, their approach is applicable to detect injected performance regressions.

Cohen et al. (2004) propose an approach that builds probabilistic models, such as Tree-Augmented Bayesian Networks, to examine the causes that target the changes in the system’s response time. Cohen et al. (2005) also proposed that system faults can be detected by building statistical models based on performance metrics. The approaches of Cohen et al. (2004, 2005) were improved by Bodík et al. (2008) by using logistic regression models.

Jiang et al. (2009b) propose an approach that improves the Ordinary Least Squares regression models that are built from performance metrics and use the model to detect faults in a system. The authors conclude that their approach is more efficient in successfully detecting the injected faults than the current linear-model approach.

On one hand, none of the prior research discusses the impact of their approaches results in virtual and physical environments, which motivates the empirical study that is conducted in this paper. On the other hand, since there are hardly two identical performance testing results, we do no compare the raw data of performance testing results from virtual and physical environments. Instead, we conduct our case study in the context of all the above three types of analyzes, in order to see the impact when practitioners use such analyzes on performance testing results. Our findings can help better evaluate and understand the findings from the aforementioned research.

2.2 Analysis of VM Overhead

Kraft et al. (2011) discuss the issues related to disk I/O in a virtual environment. They examine the performance degradation of disk request response time by recommending a trace-driven approach. Kraft et al. emphasize on the latencies existing in virtual machine requests for disk IO due to increments in time associated with request queues.

Menon et al. (2005) audit the performance overhead in Xen virtual machines. They uncover the origins of overhead that might exist in the network I/O causing a peculiar system behavior. However, there study is limited to Xen virtual machine only while mainly focusing on network related performance overhead.

Brosig et al. (2013) predict the performance overhead of virtualized environments using Petri-nets in Xen server. The authors focused on the visualization overhead with respect to queuing networks only. The authors were able to accurately predict server utilization but had significant errors for multiple VMs.

Huber et al. (2011) present a study on cloud-like environments. The authors compare the performance of virtual environments and study the degradation between the two environments. Huber et al. further categorize factors that influence the overhead and use regression based models to evaluate the overhead. However, the modeling only considers CPU and memory.

Luo et al. (2016) converge the set of inputs that may cause software regression. They apply genetic algorithms to detect such combinations. Netto et al. (2011) present a similar study to compare performance metrics generated via load tests between the two environments. However, the author did not analyze the results from a statistical perspective.

Prior research focused on the overhead of virtual environments without considering the impact of such overhead on performance testing and assurance activities. In this paper, we evaluate the discrepancy between virtual and physical environments by focusing on the impact of performance testing results analyzes and investigate whether such impact can be reduced in practice.

2.3 Performance Testing and Bug Detection

There exists much research on performance testing and bug detection. Nistor et al. (2013b) detect the presence of functional and loop-related performance bugs with the help of their developed tool. Jin et al. (2012) present a study on a wide range of performance bugs. The authors examined real-world performance bugs and developed rule-based performance bug detection tools. Nistor et al. (2013a) in another study highlight that automated tool based performance bug detection is limited. The authors also comment that performance bugs are mostly detected by code reasoning rather than seeing the effects of the system by the end users. Tsakiltsidis et al. (2016) use prediction models to detect and predict performance bugs based on extraction from source code repositories. Malik et al. (2010c) present a study to uncover functional bugs via load testing. The authors propose an approach to reduce the large amount of performance metrics at the end of a load test by principal component analysis. Zaman et al. (2012) study the tracking and fixing of performance bugs.

However, none of the above mentioned performance bug detection approach has been applied in different environments. In most of the cases, the environment is not explicitly mentioned. Hence, to generalize the findings across environments remains an open topic.

3 Case Study Setup

The goal of our case study is to evaluate the discrepancy between performance testing results from virtual and physical environments. We deploy our subject systems in two identical environments (physical and virtual) with the same hardware. A load driver is used to exercise our subject systems. After the collection and processing of the performance metrics we analyze and draw conclusions based on: 1) single performance metric 2) relationship between performance metrics and 3) statistical models based on the performance metrics. An overview of our case study setup is shown in Fig. 1.

3.1 Subject Systems

Dell DVD Store (DS2) (Jaffe and Muirhead 2011) is an online multi-tier e-commerce web application that is widely used in performance testing and prior performance engineering research (Shang et al., 2015; Nguyen et al., 2012; Jiang et al., 2009). We deploy DS2 (SLOC > 3,200) on an Apache (Version 3.0.0) web application server with MySQL 5.6 database server (Oracle 1998). CloudStore (CloudScale-Project 2014), our second subject system, is an open source application based on the TPC-W benchmark (TPC 2001). CloudStore (SLOC > 7,600) is widely used to evaluate the performance of cloud computing infrastructure when hosting web-based software systems and is leveraged in prior research (Ahmed et al., 2016). We deploy CloudStore on Apache Tomcat (Apache 2007) (version 7.0.65) with MySQL 5.6 database server (Oracle 1998).

3.2 Environmental Setup

The performance tests of the two subject systems are conducted on three machines in a lab environment. Each machine has an Intel i5 4690 Haswell Quad-Core 3.50 GHz CPU, with 8 GB of memory, 100GB SATA storage and connected to a local gigabyte ethernet. The first machine hosts the application servers (Apache and Tomcat). The second machine hosts the MySQL 5.6 database server. The load drivers were deployed on the third machine. We separate the load driver, the web/application server and the database server on different machines in order to mimic real world scenario and avoid interference among these processes. For example, isolating the application and database driver would ensure that the processor is not overused. The operating systems on the three machines are Windows 7. We disable all other processes and unrelated system services to minimize their performance impact. Since our goal is to compare performance metrics in virtual and physical environments, we setup the two different environments, as follows:

Virtual Environment

We install one Virtual Box (version 5.0.16) and create only one virtual machine on one physical machine to avoid the interference between virtual machines. For each virtual machine, we allocate two cores and three gigabytes of memory, which is well below capacity to make sure we are not topping out and pushing our configuration for unrealistic results. Virtual machines typically have an option of using disk pass-through (Costantini 2015). However, disk pass-through prevents the quick deployment of an existing virtual machine image that’s designed for performance testing and quick execution of performance tests (Srion 2015). Hence, we opt to disable disk pass-through since it is unlikely to be used in practice. The network of the virtual machine is set up based on network address translation (NAT) configuration (Tyson 2001). The network traffic of the workload was generated on a dedicated load machine to keep our experiments as close to the real-world as possible.

Physical Environment

We used the same hardware as the virtual environment to set up our physical environments. To make the physical environment similar to the virtual environment, we only enable two cores and three gigabytes of memory for each machine for the physical environment.

3.3 Performance Tests

DS2 is released with a dedicated load driver program that is designed to exercise DS2 for performance testing. We used the load driver to conduct performance testing on DS2. We used Apache JMeter (Apache 2008) to generate a workload to conduct the performance tests on CloudStore. For both subject systems, the workload of the performance tests is varied randomly and periodically in order to avoid bias from a consistent workload. The variation was identical across environments. The workload variation was introduced by the number of threads. A higher number of threads represents a higher number of users accessing the system. Each performance test is run after a 15 minute warming up period of the system and lasts for 9 hours. We chose to run the test 9 hours ensuring that our sample sizes have enough data points for our results to be statistically significant. The nature of our performance tests was based on our related studies mentioned in Section 2.2. To ensure the consistency between the performance tests, we restored the environments followed by a restart of the systems after every test.

3.4 Data Collection and Preprocessing

Performance Metrics

We used PerfMon (Microsoft Technet 2007) to record the values of performance metrics. PerfMon is a performance monitoring tool used to observe and record performance metrics such as CPU utilization, memory usage and disk IOs. We run PerfMon on each of the application server and database server machines. We record all the available performance metrics that can be monitored on a single process by PerfMon. In order to minimize the influence of Perfmon, we monitor only the performance of the two processes of the application server and database server on the two dedicated machines. We recorded the performance metrics with an interval of 10 seconds. In total, we recorded 44 performance metrics.

System Throughput

We used the application server’s access logs from Apache and Tomcat to calculate the throughput of the system by measuring the number of requests per minute. The two datasets were then concatenated and mapped against requests using their respective timestamps.

Since an end user will consider a system as a whole, we combine the performance datasets from our application and database servers. In order to combine the two datasets of performance metrics and system throughput, and to minimize noise of the performance metric recording, we calculate the mean values of the performance metrics every minute. Then, we combine the datasets of performance metrics and system throughput based on the time stamp on a per minute basis. A similar approach has been applied to address mining performance metrics challenges (Foo et al., 2010).

4 Case Study Results

The goal of our study is to evaluate the discrepancy between performance testing results from virtual and physical environments, particularly considering the impact of discrepancy on the analysis of such results. Our experiments are set in the context of analyzing performance testing data, based on the related work. Shown in Section 2, prior research and practitioners examine performance testing results in three types of approaches: 1) examining a single performance metric, 2) examining the relationship between performance metrics and 3) building statistical models using performance metrics. Therefore, our experiments are designed to answer three research questions, where each questions corresponds to one of the types of analysis above.

4.1 Are the Trend and Distribution of a Single Performance Metric Similar Across Environments?

Motivation

The most intuitive approach of examining performance testing results is to examine every single performance metric. As shown in Section 2.1.1, prior studies propose different approaches that typically compare the distribution or trend of each performance metric from different tests. Due to influences from testing environments, performance testing results are not expected to be identical in raw values. However, the shape of distribution and the trend of the metrics should be similar. For example, if in one environment, we observe the Memory has increasing trend while the increasing trend is not seen in another environment, we observe a discrepancy. In addition, the distribution differences between two test results should not be statistically significant. Therefore, we use quantile-quantile (Q-Q) plot and normalized Kolmogorov-Smirnov (KS) tests to examine the differences in trends and shape of the distributions.

Approach

After running and collecting the performance metrics, we compare every single performance metric between the virtual and physical environments. Since the performance tests are conducted in different environments, intuitively the scales of performance metrics are not the same. For example, the virtual environment may have higher CPU usage than the physical environment. Therefore, instead of comparing the values of each performance metric in both environments, we study whether the performance metric follows the same shape of the distribution and the same trend in virtual and physical environments.

First, we plot a quantile-quantile (Q-Q) plot (NIST/SEMATECH 2003) for every performance metric in two environments. A Q-Q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. We also plot a 45-degree reference line on the Q-Q plots. If the performance metrics in both environments follow the same shape of distribution, the points on the Q-Q plots should fall approximately along this reference (i.e., 45-degree) line. A large departure from the reference line indicates that the performance metrics in the virtual and physical environments come from populations with different shapes of distributions, which can lead to a different set of conclusions. For example, the virtual environment has a CPU’s utilization spike at a certain time, but the spike is absent in the physical environment.

Second, to quantitatively measure the discrepancy, we perform a Kolmogorov-Smirnov test (Stapleton 2008) between every performance metric in the virtual and physical environments. Since the scales of each performance metric in both environments are not the same, we first normalize the metrics based on their median values and their median absolute deviation:

$$ M_{\textit{normalized}}=\frac{M-\tilde{M}}{MAD(M))} $$

(1)

where M _normalized is the normalized value of the metric, M is the original value of the metric, $\tilde {M}$ is the median value of the metric and M A D(M) is the median absolute deviation of the metric (Walker 1929). The Kolmogorov-Smirnov test gives a p-value as the test outcome. A p-value ≤ 0.05 means that the result is statistically significant, and we may reject the null hypothesis (i.e., two populations are from the same distribution). By rejecting the null hypothesis, we can accept the alternative hypothesis, which tells us the performance metrics in virtual and physical environments do not have the same distribution. We choose to use the Kolmogorov-Smirnov test since it does not have any assumption on the distribution of the metrics.

Finally, we calculate Spearman’s rank correlation between every performance metric in the virtual environment and the corresponding performance metric in the physical environment, in order to assess whether the same performance metrics in two environments follow the same trend during the test. Intuitively, two sets of performance testing results without discrepancy should show a similar trend, i.e., when memory keeps increasing in the physical environment (like memory leak), the memory should also increase in the virtual environment. We choose Spearman’s rank correlation since it does not have any assumption on the distribution of the metrics.

Results

Most performance metrics do not follow the same shape of distribution in virtual and physical environments. Figures 2 and 3 show the Q-Q plots by comparing the quantiles of performance metrics from virtual and physical environments. Due to the limited space, we only present Q-Q plot for CPU user time, IO data operations/sec and memory working set for both application sever and database server.^{Footnote 1} The results show that the lines on the Q-Q plot are not close to the 45-degree reference line. By looking closely on the Q-Q plots we find that the patterns of each performance metric from different subject systems are different. For example, the application (web) server’s CPU user time for DS2 in the virtual environment shows higher values than in the physical environment at the median to high range of the distribution; while the Q-Q plot of CloudStore shows application (web) server’s CPU user time with higher values at the low range of the distribution. In addition, the lines of the Q-Q plots for database memory working set show completely different shapes in DS2 and in CloudStore. The results imply that the discrepancies between virtual and physical environments are present between the subject systems. The impact of the subject systems warrants its own study.

The majority of the performance metrics had statistically significantly different distributions (p-values lower than 0.05 in Kolmogorov-Smirnov tests). Only 13 and 12 metrics (out of 44 for each environment) have p-values higher than 0.05, for DS2 and CloudStore, respectively, showing statistically in-significant difference between the distribution in virtual and physical environments. By looking closely at such metrics, we find that these metrics either do not highly relate to the execution of the subject system (e.g., application server CPU privileged time in DS2), or highly relate to the workload. Since the workload between the two environments are similar, it is expected that the metrics related to the workload follow the same shape of distribution. For example, the I/O operations are highly related with the workload. The metrics related to I/O operations may show statistically in-significant differences between the distributions in the virtual and physical environments (e.g., application server I/O write operations per second in DS2).

Most performance metrics do not have the same trend in virtual and physical environments. Table 1 shows the Spearman’s rank correlation coefficient and corresponding p-value between the selected performance metrics for which we shared the Q-Q plots. We find that for the application server memory working set in CloudStore and the database server memory working set in DS2, there exists strong (0.69) to moderate (0.46) correlation between the virtual and physical environments, respectively. By examining the metrics, we find that both metrics have an increasing trend that may be caused by a memory leak. Such increasing trend may be the cause of the moderate to strong correlation. Instead of showing the selected metrics as the Q-Q plots, Table 2 shows a summary of the Spearman’s rank correlation of all the performance metrics. Most of the correlations have an absolute value of 0 to 0.3 (low correlation), or the correlation is not statistically significant (p-val > 0.05).

Table 1 Spearman’s rank correlation coefficients and p-values of the highlighted performance metrics for which we shared the Q-Q plots, in virtual and physical environments

Full size table

Table 2 Summary of Spearman’s rank correlation p-values and absolute coefficients of all the performance metrics, in virtual and physical environments

Full size table

Impact on the interpretation of examining single performance metric. Practitioners often plot the trend of each important performance metrics, identify when the outliers exist or calculate the median or mean value of the metric to understand the performance of the system in general. However, based on our findings in this RQ, such analysis results may not be useful if they are from a virtual environment. For example, shown in Figs. 2 and 3 many differences between the two distribution are in the lower and higher ends of the plots, which corresponds to the high and low values of the metrics. Such values are often treated as outliers to be examined. However, if such outliers are due to the virtual environment rather than the system itself, the results may be misleading. In addition, since the distribution of the metrics are statistically different, the mean and median value of the metrics may also be misleading.