1 Introduction

Microbenchmarking is a form of lightweight performance testing, widely used to assess the execution time of Java software (Leitner and Bezemer 2017). Although less demanding than other performance testing techniques (e.g., load tests (Jiang and Hassan 2015)), Java microbenchmarking requires careful design (Costa et al. 2021; Georges et al. 2007; Kalibera and Jones 2013) to enable a reliable performance assessment. A key challenge is the inherent non-linearity of Java performance: The Java Virtual Machine (JVM) uses just-in-time compilation to translate “hot” parts of the Java code into efficient machine code at run-time (Barrett et al. 2017), leading to (often severe) performance fluctuations and potentially unstable results.

To tackle this problem, practitioners rely on the assumption that microbenchmarking is characterized by two distinct phases. During an initial warmup phase, the JVM determines which parts of the software under test would most benefit from dynamic compilation, then, in a subsequent phase, the benchmark reaches a steady state of performance. Based on that, benchmarks are typically designed to discard measurements of the warmup phase and focus on steady state performance (Barrett et al. 2017; Georges et al. 2007; Kalibera and Jones 2013). Java Microbenchmark Harness (JMH), i.e., the most popular Java microbenchmarking framework (Leitner and Bezemer 2017), leverages this concept and enables developers to manually configure the expected warmup time of a benchmark. Once launched, a JMH benchmark continuously executes the software under test for the configured warmup time, and, only after that, starts to collect steady state performance measurements.

Although considered the cornerstone of most of the current Java microbenchmarking practice, the two-phase assumption is not yet confirmed by empirical studies. Quite the opposite, there are indications that such an assumption oversimplifies the actual microbenchmarking behavior. In a recent study, Barrett et al. (2017) studied a set of small and deterministic benchmarks (Bagley et al. 2004; Bolz and Tratt 2015) across different types of VMs (including the JVM), and found that a relevant portion of benchmarks never hit the steady state. It is worth to notice that these benchmarks significantly differ from “software testing oriented” benchmarks (e.g., JMH). Indeed, they are not aimed at assessing specific software, rather, they are typically used as optimization targets by VM authors. Even more, they are generally more effectively optimized by VMs than average software (Ratanaworabhan et al. 2009). Despite the peculiarity of these benchmarks, the finding of Barrett et al. (2017) raises concerns on the current Java microbenchmarking practice, and calls for further empirical investigation.

Nevertheless, even when benchmarks consistently reach a steady state, performance assessment remains far from trivial. A key challenge is to effectively estimate warmup time. An overestimated warmup time may waste too much time, thereby potentially hampering the adoption of benchmarks in the Continuous Integration (CI) pipeline (Laaber et al. 2020; Traini 2022). On the other hand, an underestimated warmup time may easily mislead steady state performance assessment (Georges et al. 2007; Kalibera and Jones 2013).

The current state-of-practice mostly relies on software developers’ knowledge to determine warmup time. Software developers estimate warmup time based on their expertise, and statically configure benchmark execution according to this estimation. Unfortunately, no empirical studies so far have directly investigated the effectiveness of this practice for steady state performance assessment.

Recently, an alternative approach to static benchmark configuration has been proposed by Laaber et al. (2020). This approach, called dynamic reconfiguration, leverages stability criteria (He et al. 2019; Kalibera and Jones 2013) to automatically determine the end of the warmup phase at run-time. According to their results, when compared to JMH default configurations, dynamic reconfiguration can significantly reduce execution time with low impact on results quality. Despite these promising results, there is still little knowledge on the effectiveness of dynamic reconfiguration for steady state performance assessment. And even more, it is yet unclear whether such techniques can improve the effectiveness of the current practice, i.e., developer static configurations.

In the past years, Java microbenchmarks have been widely studied in the literature. Leitner and Bezemer (2017) found that JMH was one of the predominant microbenchmarking framework in the Java community. Costa et al. (2021) empirically studied five JMH bad practices. Laaber et al. (2019) performed an exploratory study on software microbenchmarking in the cloud. Samoaa and Leitner (2021) studied the impact of parameterization in JMH microbenchmarks.

Despite these efforts, there is still a lack of knowledge on the effectiveness of modern Java microbenchmarking for steady state performance assessment. In this paper, we aim to fill this gap by presenting the first comprehensive study that investigates steady state performance assessment in Java microbenchmarking. After an extensive experimentation of 586 JMH benchmarks from 30 Java systems, involving \(\sim \)9.056 billion benchmark invocations for an overall execution time of \(\sim \)93 days, we determined whether and when each benchmark reaches a steady state using an automated statistical approach by Barrett et al. (2017) based on changepoint analysis (Killick et al. 2012). At the time of writing, Barrett et al.’s approach represents one of the most advanced automated technique to determine steady state execution in Java benchmarks. Besides investigating whether benchmarks ever reach a steady state or not, we also comprehensively evaluated the effectiveness of the current state-of-practice and state-of-the-art. In particular, we investigated to what extent statically-defined developer configurations (i.e., state-of-practice), and dynamic reconfiguration techniques (i.e., state-of-the-art) are effective in ensuring a reliable and time-efficient assessment of steady state performance in JMH benchmarks. Even more, we quantified the potential side effects due to inaccurate warmup estimation both in terms of execution time waste and misleading performance measurements.

Laaber et al. (2020) have already investigated the effectiveness of dynamic reconfiguration. However, their study was mainly concerned with a particular aspect of Java microbenchmarking, i.e., reducing execution time. In this paper, instead, we aim to provide a comprehensive investigation on the effectiveness of dynamic reconfiguration (and developer static configurations) for steady state performance assessment. Due to our goal, we do not use JMH defaults as baselines, as done by Laaber et al., since there is no guarantee that they can effectively capture steady state performance. Indeed, although JMH defaults are a reasonable baseline for comparison,Footnote 1 their use may have downsides when studying steady state performance. Several studies have shown that benchmarks often reach their steady states in different numbers of iterations (Georges et al. 2007; Kalibera and Jones 2013), thus it may be misleading to use a unique (though reasonable) configuration as baseline, i.e., JMH defaults may be effective for some benchmarks and suboptimal for other ones. In order to avoid this problem, we leverage a different (and more rigorous) approach to assess dynamic reconfiguration effectiveness. We first use a state-of-the-art steady state detection technique to determine if/when a benchmark reaches a steady state of performance, then we base on its outcome to assess dynamic reconfiguration effectiveness. Besides this evaluation of dynamic reconfiguration techniques, this paper presents the following novel investigations. We present the first study that investigates if/when JMH benchmarks reach a steady state of performance. Second, we perform the first evaluation on the effectiveness of statically defined developer configurations. Third, we introduce the first comparison between the effectiveness of developer configurations (i.e., the current state-of-practice) and dynamic reconfiguration techniques (i.e., the current state-of-the-art).

Our results show that JMH benchmarks do not always reach a steady state of performance, thereby demystifying the current cornerstone of Java microbenchmarking, i.e., the two-phase assumption. This finding implies that practitioners may rely on measurements that are not representative of “actual” steady state performance. In addition, our results suggest that developer static configurations are often ineffective for warmup estimation, and may cause either improperly long execution times or misleading performance assessment. On the other hand, dynamic reconfiguration techniques show significant improvement over the current state of practice, but they still produce inaccurate estimates of the warmup time, hence causing time-consuming benchmark executions and distorted results. This finding highlights room for improvement for dynamic reconfiguration, and it calls for further research on this topic.

The main contributions of this paper are:

  • a statistically rigorous investigation of steady state performance in JMH microbenchmarks.

  • an empirical evaluation of developer static configurations in JMH microbenchmarks.

  • a comprehensive comparison among the effectiveness of developer static configurations and state-of-the-art dynamic reconfiguration techniques.

  • a large dataset of labeled benchmark executions to facilitate future research on Java steady state performance assessment, and foster further innovations in dynamic reconfiguration.

The remainder of this paper is organized as follows. Section 2 introduces steady state performance assessment and JMH microbenchmarks. Section 3 describes our research questions and Section 4 explains the experimental design. Section 5 reports the results. Section 6 discusses some implications of our findings. Section 7 describes threats to validity. Section 8 presents related work, and Section 9 concludes this paper.

2 Background

2.1 Steady State Performance

At the beginning of its execution, typically, a Java microbenchmark is slowly executed by the JVM. In a subsequent phase, the JVM detects “hots” (i.e., frequently executed) loops or methods, and it dynamically compiles them into optimized machine code. As a consequence, subsequent executions of those loops or methods (usually) become faster. Once dynamic compilation is completed, the JVM is said to have finished warming up, and the benchmark is said to be executing at a steady state of performance.

Java microbenchmarking aims at assessing steady state performance (e.g., execution time) of Java software. The typical approach to collect steady state measurements is straightforward: A microbenchmark is executed for a certain number of times, and the first n benchmark executions are discarded (i.e., those related to the warmup phase) to prevent potentially misleading results. Unfortunately, the fixed number of n executions does not guarantee that warmup has ended. In order to investigate this issue, researchers started to develop data-driven methodologies to identify the end of the warmup phase. Prominent works in this regard are the methodologies proposed by Georges et al. (2007), and Kalibera and Jones (2013). The former uses preset thresholds on the coefficient of variation to determine whether the warmup phase is finished or not, while the latter leverages data visualization techniques (i.e., auto-correlation function plots, lag plots and run-sequence plots). Unfortunately, each one of these methodologies has its own drawback. Kalibera and Jones showed that the Georges et al.’s heuristic (2007) often fails to accurately determine the end of the warmup phase (Kalibera and Jones 2013). On the other hand, the methodology proposed by Kalibera and Jones (2013) is mostly based on a manual process, which typically implies some major limitations: (i) humans are prone to error/disagreement, (ii) manual analysis doesn’t enable automation, and therefore it is not scalable.

To overcome these limitations, Barrett et al. (2017) recently proposed a novel automated technique based on change point detection (Eckley et al. 2011). The main advantage of this technique is that it provides a more rigorous approach compared to the Georges et al.’s simple heuristic, while still enabling a fully automated process (unlike Kalibera and Jones’ approach). The Barrett et al.’s technique leverages a standard change point detection algorithm, namely PELT (Killick et al. 2012), to determine shifts in benchmark execution time. The identified shifts (i.e., change points) are then post-processed (e.g., by removing negligible performance shifts) to determine if /when a benchmark reaches a steady state of performance. To the best of our knowledge, this technique currently represents the state-of-the-art for steady state detection.

Figure 1a and b show two examples of benchmark executions along with the performance shifts identified by the PELT algorithm. The former consistently reaches a steady state of performance, while the latter doesn’t.

Fig. 1
figure 1

Two examples of benchmarks executions from our results. The grey line represents the execution times of each benchmark iteration visualized as a time-series. The x-axis represents the benchmark iteration number, while the y-axis represents the mean execution time within the iteration. Shifts in performance behaviour (i.e., changepoints) are indicated by dashed vertical red lines. The height of each dashed horizontal red line denotes the mean execution time within the changepoint segment. The plot titles show: the system that the benchmark belongs to (e.g., h2o-3), the benchmark name (e.g., water.util.IcedHashMapBench.writeMap), and the JMH fork number

2.2 Java Microbenchmark Harness (JMH)

JMH is the de-facto standard framework for writing and executing microbenchmarks for Java software. It enables software developers to easily develop and execute microbenchmarks that measure fine-grained performance of specific units of Java code (e.g., methods). JMH supports steady state performance assessment by providing facilities that enable developers to statically configure the number of times each benchmark execution will be repeated (without compromising the reliability of resultsFootnote 2).

Figure 2 depicts a typical JMH benchmark execution. JMH supports three different levels of repetitions: forks, iterations and invocations. Invocations (i.e., the lower level of repetition) are nominal benchmark executions that are continuously performed within a predefined amount of time, namely an iteration. In turn, a fork is constituted by a sequence of iterations performed on a fully clear instantiation of new JVM. Indeed, as suggested by best practices (Barrett et al. 2017; Costa et al. 2021; Georges et al. 2007; Kalibera and Jones 2013), iterations should be repeated multiple times on fresh JVM instantiations (i.e., forks) to mitigate the contextual effects of confounding factors.

Fig. 2
figure 2

The JMH microbenchmark life cycle

Each fork is usually composed by two distinct types of iterations: warmup and measurement iterations. Warmup iterations are intended to bring the fork (i.e., the fresh JVM) into a steady state of performance, while measurement iterations are the ones where performance measurements are actually collected. Each measurement iteration typically returns a set of performance measurements (e.g., a sample of benchmark invocation execution times) or a performance statistic (e.g., average execution time or throughput).

JMH provides a set of configuration parameters to define the different levels of repetitions involved during microbenchmarking. These parameters include: warmup iteration time w, measurement iteration time r, warmup iterations wi, measurement iterations i, and forks f. Iteration time parameters (w and r) define the minimum time spent within an iteration. Given an iteration time w (resp. r), a warmup (resp. measurement) iteration will continuously perform benchmark invocations until the iteration time will expire. Warmup and measurement iterations (i.e., wi and i), instead, define the number of iterations performed within each fork. Finally, the fork parameter f defines the number of fresh JVM instantiation, i.e., the higher level of benchmark repetition.

Typically, Java developers directly set JMH configuration parameters on benchmark code through Java annotations. Nonetheless, when launching the benchmark, JMH allows to override developer configurations via Command Line Interface (CLI) arguments.

2.3 Dynamic Reconfiguration

JMH allows to statically define the expected length of the warmup phase using configuration parameters, such as warmup iteration time w and warmup iterations wi. Such estimation is typically performed on the basis of developer expertise and/or benchmark nature. Previous studies have shown that a static definition of the warmup time can be quite detrimental for steady state performance assessment (Barrett et al. 2017; Georges et al. 2007; Kalibera and Jones 2013). An alternative to JMH static configuration can be found in a recent approach called dynamic reconfiguration (Laaber et al. 2020). This approach is able to determine, during a JMH benchmark execution, whether the measurements appear to be stable, and more executions are unlikely to improve their accuracy. The rationale behind dynamic reconfiguration is that, by using automated stability criteria to halt the benchmark execution, developers can save some of the time dedicated to performance testing while keeping an acceptable level of accuracy. To achieve this, the dynamic reconfiguration approach uses a sliding window to compare the last iterations to a stability criterion. Laaber et al. (2020) propose and evaluate three stability criteria to dynamically estimate the end of the warmup phase:

  • Coefficient of variation (CV)Footnote 3: CV is the ratio of the standard deviation to the mean, and it can be used to compare normally distributed data. Even when data is not normally distributed, as it is often the case for benchmark data, CV can still provide an estimate of measurement variability. A fork is considered stable when the difference between the largest and the smallest value of CV computed on the sliding window is within a fixed threshold.

  • Relative confidence interval width (RCIW): in this case, the variability in measurement data is estimated using a technique by Kalibera and Jones (2013, 2020) that employs hierarchical bootstrapping to compute the RCIW for the mean. The hierarchical levels are invocations, iterations, and forks.

  • Kullback-Leibler divergence (KLD): a technique described by He et al. (2019) to compute the probability that two distributions are similar based on the Kullback-Leibler divergence (KLD) (Kullback and Leibler 1951). In this case, the first distribution contains all the measurements in the sliding window excluding the last one, while the second distribution includes also the last measurement. As a consequence, stability is reached when the mean of the computed similarity probabilities is above some threshold.

In order to further reduce benchmark execution time, dynamic reconfiguration leverages the same stability criteria also to dynamically determine whether to execute the next fork or not.

3 Research Questions

In this work we aim to answer the following Research Questions (RQs):

  • RQ1 Do Java microbenchmarks reach a steady state of performance?

  • RQ2 How does steady state impact microbenchmark performance?

  • RQ3 How effective are developer configurations in assessing Java steady state performance?

  • RQ4 How effective is dynamic reconfiguration in assessing Java steady state performance?

  • RQ5 Does dynamic reconfiguration provide more effective warmup estimates than developers do?

In the following subsections, we discuss in detail the motivation for each of the RQs and the methodology used to gather the answers. In Section 4, we describe the experimental setup along with the benchmarks used in our empirical study.

3.1 RQ1—Steady State Assessment

With this research question, we aim to evaluate whether Java microbenchmarks reach a steady state of performance. We first use a state-of-the-art steady state detection technique (Barrett et al. 2017) to determine whether and when each fork reaches a steady state of performance (see Section 4.3 for details). Based on these results, we then classify each benchmark as either (i) steady state, if all the forks reach a steady state of performance, (ii) no steady state, if none of the forks reach a steady state or (iii) inconsistent, if the execution involves both steady state and no steady state forks.

We report the classification shares for both benchmarks and forks. Additionally, we report the percentages of no steady state forks for each benchmark.

3.2 RQ2—Steady State Impact

With this research question, we want to investigate to what extent the attainment of a steady state impacts benchmark performance. To do so, we compare the measurements collected during steady state phases against those collected in non-steady phases of benchmark execution. We perform two different analyses: the first one investigates the difference between steady and non-steady performance within the same fork, the second one assesses the same aspect across different forks. Besides these two analyses, we also investigate two potential countermeasures to mitigate performance deviations in non-steady phases of benchmark execution.

All the aforementioned analyses involve a comparison between a set of steady measurements Mstable and a set of non-steady measurements Munstable. In order to assess to what extent non-steady measurements differ from steady measurements, we use relative performance deviation (RPD).

In the following, we first explain the process we use to compute RPD. Then, we describe in the detail the four analyses we use to gather the answer.

3.2.1 Relative Performance Deviation

In order to quantify the relative performance deviation of Munstable compared to Mstable, we use the technique proposed by Kalibera and Jones (2013). The main benefit of this technique is that it provides a clear and rigorous account of the relative performance change and the uncertainty involved. For example, it can indicate that a set of execution time measurements is higher than another by X% ± Y % with 95% confidence. Following the guidelines of Kalibera and Jones (2013) and Kalibera and Jones (2020), we build each confidence interval using bootstrapping with random re-sampling and replacement (Davison and Hinkley 1997), with a confidence level of 95%. We run 10,000 bootstrap iterations. At each iteration, new realizations \(\hat {M}^{unstable}\) and \(\hat {M}^{stable}\) (respectively, of Munstable and Mstable measurements) are simulated and the relative performance change is computed. The simulation of the \(\hat {M}^{unstable}\) new realizations randomly selects a subset of real data from Munstable with replacement. Similarly, \(\hat {M}^{stable}\) is simulated by randomly sampling Mstable. The two means (μunstable and μstable) and the relative performance change (ρ) for simulated measurements are computed as follows:

$$ \mu^{unstable}=\frac{{\sum}_{i=1}^{n}{\hat{M}^{unstable}_{i}}}{n} $$
$$ \mu^{stable}=\frac{{\sum}_{i=1}^{m}{\hat{M}^{stable}_{i}}}{m} $$
$$ \rho=\frac{\mu^{unstable}-\mu^{stable}}{\mu^{stable}} $$

where n is the number of measurements in \(\hat {M}^{unstable}\), m the number of measurements in \(\hat {M}^{stable}\). After the termination of all iterations, we collect a set of simulated realizations of the relative performance change P = {ρi∣1 ≤ i ≤ 10,000} and estimate the 0.025 and 0.975 quantiles on it, for a 95% confidence interval. We consider a relative performance change as statistically significant if the confidence interval does not contain 0. For example, given a confidence interval of (0.05,0.07), we can say that the mean execution time in Munstable is higher than the one in Mstable with a relative performance change that ranges between 5% and 7% with 95% confidence.

We leverage the confidence interval of the mean relative performance change (lb,ub) to compute the relative performance deviation:

$$ RPD=\begin{array}{ll} \text{0} & \text{if lb} \leq 0 \leq \text{ub} \\ |\frac{lb + \text{ub}}{2}| & \text{otherwise} \end{array} $$

In other words, we define RPD as the center of the 95% confidence interval of the mean relative performance change, if the interval doesn’t contain zero. On the other hand, RPD evaluates to 0 if the confidence interval does not report a statistically significant performance change, i.e., the interval does contain 0. Higher values of RPD indicate that Munstable strongly deviates from Mstable.

3.2.2 Analyses

In order to analyze performance deviation within forks, we consider only forks that have reached a steady state of performance, and we analyze how performance changes when the steady state is reached. In particular, we partition the set of measurements of each fork in two distinct sets, namely Mstable and Munstable, and we compare them to quantify the relative performance deviation (RPD). Mstable contains the measurements collected during steady state execution, i.e., those gathered after the steady state starting time st, while Munstable contains measurements collected before st. Figure 3a provides a graphical representation of this process for a given steady fork.

Fig. 3
figure 3

RQ2 analyses

To investigate performance deviation across forks, instead, we assess how performance differs between steady and non-steady forks. To do that, we exclusively consider inconsistent benchmarks (i.e., the ones that contain both steady and non-steady forks), and we randomly pick from each benchmark a pair of forks (f, \(\hat {f}\)): one that has reached a steady state of performance (f ) and one that does not (\(\hat {f}\)). We then use each pair to compare the measurements collected in the steady fork (Mstable) against those collected in the non-steady fork (Munstable), and we quantify RPD. In order to enable a fair comparison, we use the same measurement window for both steady and non-steady forks. Namely, given a pair of forks (f, \(\hat {f}\)), we define Munstable (resp., Mstable) as the set of measurements collected in \(\hat {f}\) (resp., f ) after st, where st denotes the steady starting time of the steady fork f. Figure 3b depicts this process for one inconsistent benchmark.

Besides the aforementioned analyses, we investigate potential countermeasures to mitigate performance deviations of non-steady measurements. In particular, we analyze two specific aspects: number of warmup iterations and number of forks. With our analysis on warmup iterations, we aim to tackle performance deviations of non-steady forks, i.e., forks that never reach a steady state of performance. In particular, we want to assess to what extent an increase in the number of warmup iterations can mitigate performance deviations of non-steady forks. With our analysis on forks, instead, we aim to tackle deviations of non-steady measurements in forks that consistently reach a steady state of performance. That is, we investigate to what extent an increase in the number of forks mitigates performance deviations of measurements gathered during non-steady phases of benchmark execution.

In order to investigate the impact of warmup iterations, we first randomly sample one steady fork f and one non-steady fork \(\hat {f}\) from each inconsistent benchmark. Then, for each pair (f, \(\hat {f}\)), we use a sliding window of 50 consecutive measurements (Munstable) in \(\hat {f}\), and we assess the deviation from the set of steady state measurements collected in f, namely Mstable. In particular, we partition the sequence of measurements gathered from \(\hat {f}\) in 60 segments of equal size (i.e., 50 measurements), and we compute the relative performance deviation of each segment from Mstable. By doing so, we can assess whether increasing the number of warmup iterations can mitigate performance deviations of non-steady forks. Figure 3c shows the process used to compute the RPD for one specific segment (Munstable) of \(\hat {f}\), i.e., for a specific number of warmup iterations.

In order to investigate the influence of forks, instead, we assess how performance deviation of non-steady measurements changes when using different numbers of forks. Given the goal of this analysis, we consider only steady benchmarks, i.e., benchmarks that exclusively involve steady forks. For each benchmark, we progressively increment the number of considered forks and, at each increment i, we form a set \({\mathscr{M}}^{unstable}_{i}\) composed by all the non-steady measurements gathered from these forks. Then, we compute, for each set \({\mathscr{M}}^{unstable}_{i}\), the deviation from the entire set of steady measurements gathered from all the forks (\({\mathscr{M}}^{stable}\)), as showed in Fig. 3d. In particular, we start by comparing the set of non-steady measurements collected from the first fork (\({\mathscr{M}}^{unstable}_{1}\)) to the whole set of steady measurements \({\mathscr{M}}^{stable}\). Subsequently, we consider the set of non-steady measurements from the first two forks together (\({\mathscr{M}}^{unstable}_{2}\)) and, again, we compare it to \({\mathscr{M}}^{stable}\). We proceed in this way for each set of non-steady measurements \({\mathscr{M}}^{unstable}_{i}\), with i ranging from 1 to 10. At the end of this process, we obtain one result (i.e., relative performance deviation) for each pair (b, i), where b denotes a steady benchmark and i denotes the number of considered forks. Through this analysis, we can assess if/how the increases in the number of forks mitigate the inherent deviations of non-steady measurements. It is worth to notice that we perform this analysis in an extreme situation, i.e., by considering exclusively non-steady measurements. We decided to do so because, if can we demonstrate that the increases in the number of forks can mitigate performance deviations in such an extreme case, then there are strong indications that they can effectively mitigate the impact of non-steady measurements.

3.3 RQ3—Developer Configuration Assessment

With this research question, we aim to evaluate the effectiveness of developer configurations for steady state performance assessment. To do so, we assess how well software developers capture steady state starting time (st), i.e., the execution time required to reach a steady state of performance in a fork. Specifically, in each fork, we compare the estimated warmup time (wt) defined by software developers with the steady state starting time st detected by the technique of Barrett et al. (2017). We measure the error of wt using warmup estimation error (WEE), i.e., how far wt is to the steady state starting time st:

$$ WEE = |wt - st| $$

Additionally, we report the proportion of underestimated and overestimated forks (i.e., wt is respectively smaller or larger than st by at least 5 s), and we investigate potential side effects.

An overestimated warmup phase wastes execution time, since it leads to a surplus of warmup iterations, which, in turn, delay the beginning of measurement iterations. This inevitably increases benchmark execution time and causes potentially harmful practical implications. For example, the “time effort” of a performance test is often considered a critical factor when selecting the tests to execute before a software release (Traini 2022; Chen and Shang 2017). In that, an overestimated warmup phase can hamper the adoption of microbenchmarks for continuous performance assurance, especially when software releases happen frequently (Rubin and Rinard 2016), and tests are executed as part of a Continuous Integration (CI) pipeline (Fowler 2006). To investigate side effects of overestimated warmup time, we use time waste, i.e., the overestimation error induced by developer configurations. We quantify time waste through the difference (in terms of time) between wt and st. In other words, time waste represents the execution time that can be potentially saved in a fork.

On the other hand, an underestimated warmup phase may easily mislead steady state performance assessment. Indeed, unstable measurements (i.e., measurements gathered before st) may distort performance results, thereby leading to potentially wrong conclusions (Georges et al. 2007; Kalibera and Jones 2013). We assess the impact of underestimated warmup time using relative performance deviation (RPD), i.e., the magnitude of performance deviation compared to steady state measurements. That is, for each underestimated fork, we compare performance measurements in the steady state (Mstable) with those in the measurement time window defined by software developers (Mconf) using RPD. A high RPD indicates that performance measurements used by software developers strongly deviate from those collected in the steady state.

Besides investigating the effectiveness of developer configurations at fork level, we also analyze their effectiveness at benchmark level, i.e., across multiple forks. Through this analysis, we aim to evaluate developer configurations not only in terms of warmup and measurement iterations, but also based on the number of configured forks. To do that, we consider the entire set of measurements gathered through developer configuration (i.e., across the configured forks), and we compare it to the whole set of steady measurements gathered across all the steady forks of the benchmark. In particular, for each benchmark, we assess the extent to which the set \({\mathscr{M}}^{conf}\) of developer measurements deviate from the set \({\mathscr{M}}^{stable}\) of steady measurements. In addition, we report the time effort of benchmarks based on developer configurations by computing their overall execution time.

3.4 RQ4—Dynamic Reconfiguration Assessment

With this research question, we aim to evaluate the effectiveness of dynamic reconfiguration approaches (Laaber et al. 2020) for steady state performance assessment. These approaches dynamically determine the warmup time during benchmark execution using stability criteria. In our evaluation, we consider three stability criteria proposed by Laaber et al. (2020): (i) Coefficient of variation (CV), (ii) Relative confidence interval width (RCIW), and (iii) Kullback-Leibler divergence (KLD).

In order to assess the effectiveness of each dynamic reconfiguration variant (i.e., stability criteria), we use the same methodology adopted in RQ3. We first report the warmup estimation error (WEE) of forks, along with the proportion of underestimated and overestimated warmup time. Then, we consider the potential side effects of wrong warmup estimation using time waste and relative performance deviation (RPD). Finally, we investigate the effectiveness of dynamic reconfiguration at benchmark level (i.e., across forks) by assessing their RPDs and the overall execution time.

3.5 RQ5—Dynamic vs Developer Configurations

To answer RQ5, we compare dynamic reconfiguration techniques against developer configurations using three evaluation metrics: (i) warmup estimation error, (ii) estimated warmup time, and (iii) relative performance deviation.

Warmup estimation error (WEE) measures the accuracy of wt, i.e., how close wt is to the steady state starting time st. Lower values of WEE indicate better estimates of the warmup time.

Estimated warmup time (wt) measures the time spent to warmup a fork with respect to a specific configuration. Higher values of wt increase the time effort devoted to performance testing, and, therefore, can potentially hamper the adoption of benchmarks for continuous performance assessment.

Relative performance deviation (RPD), instead, measures in each fork the magnitude of performance deviation of Mconf compared to steady state performance measurements (Mstable). Higher values of RPD indicate that Mconf strongly deviates from Mstable, thus implying that performance measurements determined by the configuration may potentially mislead steady state performance assessment. As a complementary analysis, we further use RPD to measure performance deviations at benchmark level. That is, we quantify (for each benchmark) the magnitude of deviation of the entire set of measurements gathered through developers/dynamic configurations (\({\mathscr{M}}^{conf}\)) when compared to the whole set of steady measurements collected through the entire benchmark execution (\({\mathscr{M}}^{stable}\)).

The results of WEE, wt, and RPD are compared using the Wilcoxon Rank-Sum test (Cohen 2013), which is a non-parametric test that makes no assumption about underlying data distribution, hence, raises the bar for significance for both normally and non-normally distributed data. Additionally, a standardized non-parametric effect size measure, namely the Vargha Delaney’s \(\hat {A}_{12}\) statistic (Vargha and Delaney 2000), is used to assess the effect size. Given a dynamic reconfiguration technique D, \(\hat {A}_{12}\) measures the probability of D performing better than developer configurations with reference to a specific evaluation metric. \(\hat {A}_{12}\) is computed using (1), where R1 is the rank sum of the first data group we are comparing, and m and n are the numbers of observations in the first and second data sample, respectively.

$$ \hat{A}_{12} = \frac{\left( \frac{R_{1}}{m} - \frac{m+1}{2}\right)}{n} $$
(1)

We interpret \(\hat {A}_{12}\) using the thresholds provided by Vargha and Delaney (2000). Based on (1), if dynamic configurations and developer configurations are equally good, \(\hat {A}_{12}=0.5\). Respectively, \(\hat {A}_{12}\) higher than 0.5 means that dynamic reconfiguration is more likely to produce better results. The effect size is considered small for \(0.56 \leq \hat {A}_{12} < 0.64\), medium for \(0.64 \leq \hat {A}_{12} < 0.71\), and large for \(\hat {A}_{12} \geq 0.71\). On the other hand, \(\hat {A}_{12}\) smaller than 0.5 means that developer configurations provide better results. In this case, the effect size is considered small for \(0.34 > \hat {A}_{12} \leq 0.44\), medium for \(0.29 > \hat {A}_{12} \leq 0.34\), and large for \(\hat {A}_{12} \leq 0.29\).

In order to avoid misleading interpretations, we perform transformation on the \(\hat {A}_{12}\) effect size (Neumann et al. 2015), since we consider values of WEE smaller than 5 s and RPD smaller than 5%, as negligible (Maricq et al. 2018), and, therefore, “equally good”. In particular, we apply Pre-Transforming Data (Neumann et al. 2015) by replacing each value of WEE < 5sec and RPD < 0.05 with zero. When it comes to estimated warmup time (wt), no transformation is performed on the \(\hat {A}_{12}\) effect size, since we are interested in any improvement (Neumann et al. 2015; Sarro et al. 2016).

4 Experimental Design

In order to answer our research questions, we first collect performance measurements from the execution of 586 microbenchmarks across 30 systems. Then, we analyze collected measurements to determine whether and when each fork reaches a steady state of performance. Finally, we perform post-hoc analysis on the collected measurements to assess developers and dynamic configurations.

In this section, we first describe the microbenchmarking setup we use to collect performance measurements and the benchmark subjects. Then, we describe in detail the steady state detection technique used in our empirical study. Finally, we present the process we use to extract both developer and dynamic configurations.

4.1 Microbenchmarking Setup

Following the methodology used in the study of Barrett et al. (2017), we execute each benchmark for a substantially longer time than “usual” JMH configurations (on average 171 times longer than developer configurations). We perform 10 JMH forks for each benchmark (as suggested by Barrett et al. (2017)), where each fork involves an overall execution time of at least 300 s and 3000 benchmark invocations. To do so, we configure the execution of each benchmark via JMH CLI arguments. Specifically, we configure 3000 measurement iterations (-i 3000) and 0 warmup iterations (-wi 0) to collect all the measurements along the fork. Each iteration continuously executes the benchmark method for 100ms (-r 100ms).Footnote 4 The number of fork is configured to 10 (-f 10). As benchmarking mode, we use sample (-bm sample), which returns nominal execution times for a sample of benchmark invocations within the measurement iteration.

The execution environment and external events occurring during the benchmark runs have a remarkable influence on the accuracy of results. This is especially true when executing microbenchmarks, as they tend to measure small portions of code that may last less than a microsecond and are, therefore, more prone to be affected by even small changes in the environment. Hence, we tried to control as many sources of variability as possible in order to obtain more reliable measurements.

We disabled Intel Turbo Boost, i.e., a feature that automatically raises the CPU operating frequency when demanding tasks are running (Suchanek et al. 2017). We also disabled hyper-threading, i.e., a feature in modern processors that executes two threads simultaneously on the same physical core (Suchanek et al. 2017). This is achieved by replicating the architectural state but sharing execution resources such as ALUs and caches. For this reason, hyper-threading may lead to contention patterns that continuously vary during the execution.

Another potential cause of variability among repeated runs is represented by Address Space Layout Randomization (ASLR), which is a security technique to randomly arrange the address space positions at each execution. We disabled ASLR as it may cause variability in the measurements from one fork to another.

The amount of available memory can also affect execution times. We fixed (through the -Xmx flag) the total amount of heap memory available to the JVM to 8GB, because this is the most important factor affecting garbage collection performance. In fact, the throughput of garbage collections is inversely proportional to the amount of memory available, since collections occur when memory fills up (Oaks 2014).

A large variety of operating system events may have a noticeable impact on execution times because they increase context switching in most cases. For this reason, we tried to keep the events that are not related to the benchmarks to a minimum. We disabled any Unix daemon that is not strictly necessary. We also disabled SSH logins for the entire duration of the experiments. To further reduce context switching, we used priority scheduling and increased the niceness of the JMH process running the benchmarks and all its children.

Finally, we ensured that the state of the system was consistent at each run by monitoring the dmesg log and the systemd journal for anomalies, as well as the shell environment of the process for changes in size (Mytkowicz et al. 2009a).

The benchmarks were executed on a bare metal server running Linux Ubuntu 18.04.2 LTS on a dual Intel Xeon E5-2650v3 CPU at 2.30GHz, with a total of 40 cores and 80GiB of RAM.

4.2 Subject Benchmarks

Table 1 reports the list of the 30 Java open source systems we use in this study. We selected such systems because they are relatively popular (i.e., they have more than 100 Github stars), have non-trivial JMH suites (i.e., have at least 20 benchmarks), and span different domains (e.g., application servers, logging libraries, databases). Given the large size of the benchmark suites, we randomly sample 20 benchmarks for each system. 14 out of the 600 sampled benchmarks failed in our experimental setup.Footnote 5

Table 1 This table reports the name of each subject system, the Github organization, the number of Github stars, and the number of benchmarks in the performance testing suite

Overall, in our empirical study, we assess the behavior of 586 randomly sampled benchmarks across 30 Java systems.

In order to investigate the correctness of the benchmarks selected for this study, we ran the SpotJMHBugs tool byCosta et al. (2021) on the systems. The tool was able to detect only one potential bad practice, of type LOOP, in the netty project. Therefore, we consider our selection of benchmarks suitable for the study.

4.3 Steady State Detection

Detecting the end of the warmup phase, and consequently the start of the steady state, is no trivial task, as the notion of “steady” resides on how much stability of results one wants to achieve during performance testing or, to put it in another way, how much variability one is willing to tolerate. Nonetheless, any analyst who wants to study steady state performance must establish the length of the warmup phase. In our study, we are interested in automatically detecting the length of such phase to determine whether and when a benchmark reached a steady state.

For this task, we build on the steady state detection approach proposed by Barrett et al. (2017), which we adapted to the purposes of our study. The approach is fully automated, and it is based on changepoint analysis (Eckley et al. 2011), which is a statistical technique to detect shifts in timeseries data. In our experimental setup, each datapoint of the timeseries represents the average execution time within a JMH iteration, and the whole timeseries represents a fork.

An overview of the approach can be found in Fig. 4. As a first step of the approach, we identify and remove potential outliers as we are only interested in detecting shifts in execution time that appear to stay consistent for a period of time. As done by Barrett et al. (2017), we use the method by Tukey et al (1977) to identify as outliers the datapoints that lie outside the median ± 3 × (99%ile − 1%ile) in a 200 datapoints window . Out of the 1.8 × 107 datapoints in the study, 0.27% are classified as outliers, with the most of any fork being 1.3%.

Fig. 4
figure 4

Steady state detection process

After filtering outliers, we apply a changepoint algorithm to detect shifts in execution time. Changepoint algorithms are designed to divide the entire timeseries into segments, within which the behavior of the timeseries is considered to remain unchanged. On the basis of the segmentation of the timeseries we can detect if and when a benchmark execution reached a steady state.

The specific changepoint algorithm we used is called PELT (Killick et al. 2012). We applied the algorithm to timeseries data gathered from individual forks in order to detect changes in both the mean and the variance of execution time. An important parameter of the PELT algorithm is the penalty, namely an argument designed to avoid under/over-fitting and, therefore, directly impacting the number of changepoints the algorithm will detect. The higher the penalty value, the more difficult will be for the algorithm to detect changepoints. Conversely, lower penalty values will result in more changepoints. Barrett et al. set this parameter to \(15\log (n)\), where n is the number of datapoints in the timeseries after discarding the outliers. We concluded that a single penalty value for all the timeseries (i.e., all the forks in the experiment) was not suitable for tuning the algorithm to the differences we found among benchmark execution data. As a consequence, we decided to employ a method to derive an appropriate penalty value for each timeseries: we used the CROPS algorithm (Haynes et al. 2014) to efficiently generate, for each timeseries, optimal changepoint segmentations for all penalty values in a continuous range ([4,105], in our case) . The number of changepoints in the alternative segmentations and the corresponding penalty values can be used to derive an optimization curve (sometimes referred to as elbow diagram). Figure 5 shows an example of such a curve computed on the timeseries in Fig. 1b. As suggested by Lavielle (2005), this diagram can be visually inspected to find suitably parsimonious penalty values in the area of the elbow. A penalty point in the elbow area can be automatically selected using the Kneedle algorithm (Satopaa et al. 2011), which is a method to find the point of maximum curvature in the continuous approximation of an optimization curve . A red x in Fig. 5 marks the point in the curve that was chosen by the Kneedle algorithm in that case. Using this procedure, we were able to automatically derive a different penalty value to guide the segmentation of each fork.

Fig. 5
figure 5

Example of the selection procedure of a penalty value from an elbow diagram

Once we obtain a segmentation of the timeseries , we can proceed to detect a possible steady state. A steady state should be detected when the execution time is reasonably stable after the end of the warmup phase. It is a matter of interpretation how much the execution time is allowed to vary but, following the approach from Barrett et al., we consider a fork to have reached a steady state if the last 500 measurements are contained in a single segment (i.e., no changepoint was detected within the last 500 datapoints). In general, to find when the steady state was first reached we could just take the start of the last segment. However, this would not consider practical cases in which the smallest variation in mean or variance would generate different segments, even if, from a performance evaluation point of view, the benchmark has completed its warmup phase. Therefore, we need to establish a suitable tolerance to allow the steady state period to span multiple segments whose variation is not meaningful to determine the execution time of the benchmark. In Barrett et al. the tolerance was provided by combining the maximum number of consecutive segments from the last one, such that a segment si is equivalent to the final segment sf if mean(si) is within (mean(sf) ± max(variance(sf), 0.001s)). We did not intend to apply a fixed threshold in units of time (like 0.001 s) because our benchmarks vary from tens of nanoseconds to few seconds, nor we wanted to use the variance of the last segment as a threshold because it leads to an extremely low tolerance for most of the benchmarks.Footnote 6 Hence, we preferred to compare the segments by applying a 5% tolerance on the confidence interval for the mean relative performance change, by using the approach of Kalibera and Jones (2013) (see Section 3.3 for details).

On the basis of this information, we can classify individual forks as steady state if we were able to detect a steady state, or no steady state when the opposite occurred. Moreover, we classify a benchmark as steady state if all its forks reached a steady state, no steady state if all its forks did not reach a steady state, and inconsistent if at least one fork was classified as steady state while at least another one was classified as no steady state. The same information is used to derive the steady state starting time (st), which is the beginning of the first segment si that is considered to be steady. Consequently, for a given fork, the set of measurements in the range between st and the end of the timeseries is the set of the steady state performance measurements (Mstable) of our experiment. Based on that, given a specific benchmark, we can further define the entire set of steady measurements \({\mathscr{M}}^{stable}\) as the union of the steady measurements sets Mstable collected across all the steady forks of the benchmark.

4.4 Benchmark Configurations

Each benchmark configuration determines for each fork two relevant pieces of information: the estimated warmup time (wt) and a set of performance measurements Mconf.

We use this information to (i) compute warmup estimation error (WEE), (ii) determine whether a configuration underestimates or overestimates the steady state starting time (st), and (iii) assess potential side effects due to wrong estimation (i.e., time waste or relative performance deviation).

In the following, we describe the process we use to derive wt and Mconf for each benchmark fork, for both developer and dynamic configurations.

Algorithm 1
figure j

Estimate overall warmup time.

Algorithm 2
figure k

Select performance measurements.

4.4.1 Software Developer Configurations

Software developers define JMH configurations in benchmark code through Java annotations. When a benchmark is launched, JMH executes the benchmark according to developer configurations (e.g., no. measurements and warmup iterations). A trivial approach to obtain both wt and the set of performance measurements Mconf for each fork would be to simply run the benchmark, and extract this information from the execution logs and JSON result files produced by JMH. Unfortunately, this approach would be extremely expensive in terms of time, and would not fit our own needs. Instead, we use post-hoc analysis: We first obtain the JMH configurations as defined by developers, then we use this information to compute both wt and Mconf based on the performance measurements collected in our microbenchmarking setup (see Section 4.1).

In order to obtain developer configurations, we leverage a JMH feature that allows to overwrite configurations on-the-fly via CLI arguments.Footnote 7 We exploit this capability to reduce benchmark execution time and speed-up developer configurations retrieval. We first execute each benchmark twice while reducing execution time through JMH CLI arguments. Then, we retrieve developer configurations in the JSON result files of each individual execution. Specifically, we first obtain the number of measurement and warmup iterations (i.e., wi and i), and the number of forks f by executing each benchmark while setting the measurement and warmup time of each iteration to 1 nanoseconds (-w 1ns-r 1ns). Then, we retrieve the measurement and warmup time (i.e., w and r) of each iteration by running each benchmark while setting the number of forks, warmup and measurement iterations to 1 (-f 1 -wi 1 -i 1). At the end of this process, we obtain a tuple (w,wi,r,i,f), where w denotes the time of a warmup iteration, wi denotes the number of warmup iterations, r denotes the time of a measurement iteration, i the number of measurement iterations and f the number of forks. We exploit w, wi, r and i along with the fork measurements M (as collected in our microbenchmarking setup) to compute, for each fork, the estimated warmup time (wt) and the set of performance measurements Mconf. Specifically, we estimate the time spent in each warmup/measurement iteration using the average execution time of each iteration as observed in our microbenchmarking setup; then, we derive wt and Mconf based on the JMH configuration. We report the detailed process to obtain both wt and Mconf in Algorithm 1 and Algorithm 2, respectively.

Based on the above, we can exploit the number of configured forks f defined by software developers to obtain the whole set of performance measurements gathered from the entire benchmark execution, namely \({\mathscr{M}}^{conf}\). In other words, we derive \({\mathscr{M}}^{conf}\) by joining all the sets of measurements Mconf gathered from the first f forks of the benchmark.

4.4.2 Dynamic Configurations

In order to obtain (w,wi,r,i,f) for each dynamic reconfiguration variant, we leverage the replication package provided by Laaber et al. (2020). Specifically, we use the scripts provided for post-hoc analysis, which take as input JMH result JSON files, and return, for each fork, the number of warmup iterations (wi) according to stability criteria. The warmup iteration time w, the measurement time r, and the number of measurement iterations i are fixed to w = 1s, r = 1s and i = 10, respectively, according to the experimental setup defined in Laaber et al. (2020).

We then obtain wt and Mconf using the same approach adopted for developer configurations.

5 Results

This section presents the results of the experiments and provides answers to the RQs formulated in Section 3.

5.1 RQ1—Steady State Assessment

As described in Section 4.3, we classify the benchmarks in our study on the basis of their ability to eventually reach a steady state. Such classification is first performed at fork level, and then at benchmark level by combining results from the steady state detection on forks.

In order to provide an overview of how many forks reached a steady state, Fig. 6a reports the percentage of forks classified as steady state or otherwise, as grouped by systems. The percentage of forks that reached steady state varies between 69% (JCTools) and 98.9% (cantaloupe). Even if there is some variability among systems, 28 of them, out of the 30 in our study, show a percentage of steady state forks above 80%, with 18 of them above 90%. Globally, in most cases (89.1% in the last row of Fig. 6a), individual forks were able to reach a steady state according to our detection technique.

Fig. 6
figure 6

RQ1. Steady state classification

When we examine how the classified forks are distributed among the benchmarks, we get a less obvious outlook. In Fig. 6b, we report the percentage of benchmarks classified as steady state, or inconsistent in the respective systems (note we do not report percentages for “no steady state” classification, since we didn’t find any benchmark classified as such). The first clear result is that there are no cases in which all the forks of a benchmark did not reach a steady state, since the totality of benchmarks is always distributed among the steady state and inconsistent columns. On the one hand, this might encourage the assumption that, in the vast majority of cases, benchmarks reach and measure steady state performance. On the other hand, we can assess that the percentage of benchmarks in which all the forks reached a steady state is subject to large variability depending on the specific system. In fact, the percentage of steady state benchmarks varies between 20% (JCTools) and 95%(rdf4j). Only 5 systems overcome a 80% percentage and, as opposed to what one would expect, no steady state forks are unevenly distributed among benchmarks, thus causing most systems to have a low percentage of steady state benchmarks even though the percentage of steady state forks was higher. Since there are no benchmarks in which all the forks did not reach a steady state, all the remaining benchmarks are classified as inconsistent (43.5%), which means that their forks showed mixed behavior.

Another viewpoint on the classification of benchmarks is provided in Fig. 7, where we report the distribution of the percentages of no steady state forks in benchmarks, as grouped by systems, which further clarifies what contributes to the percentages of inconsistent benchmarks. We can notice that, in most cases (with very few exceptions like JCTools), the systems tend to exhibit inconsistent benchmarks with small percentages of no steady state forks. It is worth recalling that a single fork (i.e., 10% in a benchmark) is enough to flip the classification from steady state to inconsistent. This is, in fact, the most common case, as we can see in the distribution computed over all the systems (i.e., Total in Fig. 7) that shows a mean around 10%.

Fig. 7
figure 7

RQ1. Percentages of no steady state forks within each benchmark, grouped by subject system

From a practical perspective, it is also important to estimate how long it takes to reach a steady state, in the cases in which it is reached. This provides a better view on how the time budget could be spent when executing the benchmarks. Figure 8 shows the distributions of the time to reach a steady state, as grouped by system and in total. We can observe that the time spent considerably varies, even within a single system, therefore it can hardly be generalized. This result is not surprising, because the attainment of a steady state inherently depends on the nature of the benchmark. As most of the instability during the warmup is due to the JIT activity, we can imagine that, beside the size of the benchmark method itself, also the number of loaded classes plays a crucial role, since it will induce a different amount of compilation.

Fig. 8
figure 8

RQ1. Time to reach the steady states, in seconds. The y-axis uses a logarithmic scale

Figure 8 also shows how the time spent to reach a steady state compares to the JMH default setting of 50 s for the warmup phase (dashed horizontal line in the figure). We can observe that the difference between the detected end of the warmup phase and the JMH defaults largely differs from one system to the other, by extreme values for eclipse-collections and jgrapht. The clear picture that emerges from this is that, in most cases, by using the JMH defaults we would overestimate the time needed to warm the benchmark up, therefore wasting a considerable amount of time dedicated to performance testing. While the amount of time that would be wasted considerably varies from one system to another, the percentage of overestimated forks is quite consistent across all the systems. This leads to the conclusion that, more often than not, the JMH defaults for the warmup time should not be used, rather one should rely on techniques to assess the actual amount of time a specific benchmark requires to reach a steady state.

figure l

5.2 RQ2—Steady State Impact

In this subsection, we present results of our analysis on the impact of steady state on performance.

Figure 9 and Table 2 report results of the analysis that investigates how performance changes (within each fork) when the steady state is reached. In particular, Fig. 9 depicts the distribution of the relative performance deviation (RPD) across all the steady forks of our study. The figure highlights strong performance deviations when the steady state is reached, with an average RPD of 123,937% and a median of 41% (IQR 14–195%).

Fig. 9
figure 9

RQ2. Steady state impact within forks. The box plot reports the distribution of RPD between steady and non-steady phases of the execution across all the steady forks. The y-axis is in logarithmic scale

Table 2 RQ2. Steady state impact within forks grouped by project. The first and second columns report, respectively, the name of the project and number of steady forks per project. The last three columns report RPD statistics (i.e., mean, standard deviation and median)

By looking at the detailed results reported in Table 2, we can observe that the large mean is highly influenced by some specific projects (e.g., camel, crate, h2o-3 and presto), which report extremely high RPD (up to 3.5 billion %). Nonetheless, even when considering projects with smaller RPD (e.g., jgrapht, RoaringBitmap and SquidLib), we can observe considerable performance deviations between steady and non-steady measurements (respectively, 36%, 40% and 57% on average). Besides the diversity across projects, Table 2 also highlights a substantial diversity within each project. Indeed, performance deviations substantially differ across benchmarks of the same projects, as they report extremely high standard deviations (maximum standard deviation of \(\sim \)16 billion (presto), and minimum of 56 (RoaringBitmap)).

The above results suggest that performance substantially changes when forks reach a steady state of performance, and provide empirical evidence on the danger of using non-steady measurements during performance assessment. Indeed, given these large magnitudes, even a tiny portion of non-steady measurements can substantially distort performance indices, with significant implications on performance assessment.

Figure 10 and Table 3 report the results of our second analysis, which investigates performance deviations between steady and non-steady forks. By observing Fig. 10, we can notice that the reported performance deviations are significantly smaller than those within forks (see Fig. 9). The average RPD between steady and non-steady forks is 5%, while the median RPD is 2% (IQR 0-5%). Although these deviations may appear negligible at a first glance, they are still significant if placed in the context of Java microbenchmarking. Indeed, in these contexts, even relatively small performance regressions (e.g., 5%) may lead to rejections of code revisions,Footnote 8 as they can have significant impact at system level. Moreover, if we look at some specific projects, such as byte-buddy, JCTools and RxJava, the reported deviations are even more conspicuous (average RPDs of respectively 14%, 19% and 13%, see Table 3). Still, the deviations between steady and non-steady forks are substantially smaller than those within forks. In that, it is worth to remark that we compare steady and non-steady forks using on purpose the same measurement window (i.e., we discard all measurements collected before st, where st denotes the steady starting time of the paired steady fork). Indeed, this methodology may potentially discard measurements that are related to the most unstable phases of the (non-steady) fork execution, and it can consequently smooth deviations from steady measurements. Nonetheless, this fact may also suggest that performance deviations tend to improve over time during benchmark execution, even in forks that do not reach a steady state of performance.

Fig. 10
figure 10

RQ2. Steady state impact across forks. The box plot reports the distribution of RPDs between steady and non-steady forks across all the inconsistent benchmarks

Table 3 RQ2. Steady state impact across forks grouped by project. The first and second columns report, respectively, the name of the project and the number of inconsistent benchmarks per project. The last three columns report RPD statistics for each project (i.e., mean, standard deviation and median)

To further investigate this aspect, we assess how warmup iterations impact performance deviation. In particular, we investigate to what extent warmup iterations mitigate performance deviations (RPD) of non-steady forks. Figure 11 reports the results of our analysis. As it can be seen from the figure, RPDs seem to be considerably influenced by the number of warmup iterations, especially at the early stages of benchmark execution. Indeed, we can see a substantial drop in the first 50 iterations, where the mean RPD decreases from 578% to 17% and the median from 30% to 12%. Moreover, we found a clear trend toward RPD reduction in the first 300 iterations. After this point, the decreasing trend seems to disappear, and RPDs begin to show fluctuations with means ranging from a minimum of 6% to a maximum of 9%, and medians ranging from 1% to 3%.

Fig. 11
figure 11

RQ2. The impact of warmup iterations on RPD. In each plot the x-axis denotes the number of warmup iterations, while the y-axis report RPD statistics. The dark shadow around the line represents the 95% confidence interval. The plots in the first column report the mean RPD, while those in the second column report the median RPD. Plots in the first row present the overall results, while those in the second row zoom on the first 500 iterations

The above results suggest that performance deviations of non-steady forks may be substantially mitigated through a reasonable amount of warmup iterations. In particular, a significant reduction of performance deviations can be obtained (i.e., − 561% on average and − 18% on median) by using at least 50 warmup iterations, which correspond to 5 s of continuous benchmark execution and no less than 50 invocations. These deviations can be further reduced by using a higher number of warmup iterations up to 300 (i.e., 30 s of continuous execution and at least 300 invocations). After this point, increasing the number of warmup iterations barely affect RPDs.

Besides investigating the impact of warmup iterations, we also analyze the impact of forks. Specifically, we study whether using a higher number of forks reduces performance deviations of non-steady measurements. As it can be observed in Fig. 12, the analysis shows an overall trend toward RPD reduction when increasing the number of forks. The average RPD is reduced by each additional fork, while the median shows a swinging trend only in the first two forks, and then it constantly decreases. If we compare RPDs of individual forks to those of 5 forks, we observe a significant RPD reduction, i.e., − 244,237% on the mean and − 21% on the median. This trend leads to an RPD reduction of − 307,555% (mean) and − 31% (median) when using 10 forks. Still, the reported RPDs remain extremely high, i.e., mean of 35,797% and median of 40%. The latter results are not surprising if we consider that our analysis deliberately targets non-steady measurements, which are typically subject to severe performance deviations (as we have shown in the first analysis of this RQ, see Fig. 9). In fact, in this analysis we were not particularly interested on the absolute RPD, rather we wanted to assess how RPDs change with respect to the number of forks. In this regard, our results suggest that increasing the number of forks can effectively mitigate the impact of non-steady measurements.

figure m
Fig. 12
figure 12

RQ2. The impact of forks on RPD. In each plot the x-axis denotes the number of non-steady forks considered in the comparison, while the y-axis report RPD statistics. The dark shadow around the line represents the 95% confidence interval. The left plot reports the mean RPD, while the right plot reports the median RPD

5.3 RQ3—Developer Configuration Assessment

In this subsection, we first present results of the assessment of developer static configurations, thereafter we provide answer to RQ3.

5.3.1 Warmup Estimation Accuracy

In the first part of this subsection, we investigate to what extent developers accurately estimate steady state starting time (st), i.e., the end of the warmup phase.

Figure 13b depicts the distribution of warmup estimation error (WEE) across all benchmark forks. Developers configurations lead to a WEE that ranges between 9 and 50 s in half of the cases (i.e., the interquartile range (IQR)), and they lead to a median and mean WEE of 28 and 90 s, respectively. Interestingly, we found that the estimation error is approximately as large as the steady state starting time (or more) in half of the forks, i.e., the median of the ratios between WEE and st is 0.997 (IQR: 0.79–43.5).

Fig. 13
figure 13

RQ3. Developer configurations—Warmup estimation accuracy. The left plot reports the percentages of overestimated, underestimated, and correctly estimated forks. The right plot depicts the distribution of WEE across all benchmark forks. n is the total amount of data points, and the number on the top of the plot is the amount of outliers not drawn in the figure

It might be easier for some systems to assess the time required to reach a steady state than for other ones. With respect to the steady state we detected, most systems (27 out of 30) show an estimation error of less than 100 s. However, the developer configurations in eclipse-collections, presto, and RxJava seem to be less effective in estimating the warmup phase. The most evident case is represented by eclipse-collections with an error, on average, of around 5 minutes.

Overall, these results suggest that, in most of the cases, software developers fail to accurately estimate the end of the warmup phase, and often with a non-trivial estimation error.

Besides investigating the estimation accuracy of developers, we also check whether they provide more accurate estimates than JMH defaultsFootnote 9 (i.e., the default configuration provided by JMH developers). To do so, we compare the WEE s provided by developer configurations against those provided by JMH defaults using the Wilcoxon Rank-Sum test (Cohen 2013), and the Vargha Delaney’s effect size measure (Vargha and Delaney 2000). We found that developer configurations outperform JMH defaults with statistical significance (p < 0.001) and small effect size (\(\hat {A}_{12}=0.64\)). The median and the mean WEE provided by JMH defaults are larger than those provided by developer configurations, i.e., mean of 97 s and median of 50 s (IQR: 36–50 s). These results suggest that the estimates provided by developers are more accurate than those provided by JMH defaults.

Figure 13a reports the percentages of overestimated, underestimated, and correctly estimated forks across all forks (i.e., the estimated warmup time wt is respectively smaller or larger than st by at least 5s). The bar chart shows that overestimation is more common than underestimation. Developers overestimate the end of the warmup phase in 48% of the forks (median WEE: 33 s, IQR: 19–50 s), whereas underestimation is reported in 32% of the cases (median WEE: 150 s, IQR: 36–240 s). Developers accurately estimate it in only 19% of the cases.

In the following subsections, we investigate side effects in both overestimated and underestimated forks.

5.3.2 Overestimation Side Effects

Figure 14 shows the time waste due to overestimation. 87% of the overestimated forks waste more than 10 s (i.e., 37% of all the forks). As can be observed by Fig. 14a, overestimation leads to a time waste between 10 and 25 s in 30% of the cases, between 25 and 50 s in 38% of the cases, and it leads to a time waste higher than 50 s in 19% of the cases. Figure 14b depicts the distribution of time waste across overestimated forks. The box plot shows that the average time waste is 37 s, and the median is 33 s (IQR: 19–49 s).

Fig. 14
figure 14

RQ3. Developer configurations—Overestimation side effects (Time waste). The left plot reports the percentages of overestimated forks where Time waste(sec) ∈{[10,25),[25,50),[50,100),[100,inf)}. The right plot depicts the distribution of Time waste across all overestimated forks. n is the total amount of data points, and the number on the top of the plot is the amount of outliers not drawn in the figure

The amount of time wasted on warmup, after reaching a steady state, considerably varies from one system to another. In most systems, by using the developers configurations, each fork wastes, on average, more than 20 s. More extremes behaviors can be found, for instance, in jdbi, netty, and rdf4j, with an average wasted time of more than 50 s per fork. On the contrary, cantaloupe, JCTools, r2dbc-h2, and zipkin might waste just a few s, and probably their configurations do not need any adjustment.

Such absolute time wastes, when contextualized within concrete performance assurance processes (that involve multiple benchmarks), can have substantial effects on the overall execution time and, as consequence, can hamper microbenchmarks adoption for continuous performance assessment. For example, a time waste of 33 s in a relatively small performance testing suite (e.g., r2dbc-h2), which involves a typical number of 5 forks,Footnote 10 could lead to an overall time waste of approximately one hour.Footnote 11 In larger testing suites, such as RxJava, the same time waste could lead to an overall waste of about 2 days and a half. Besides this, wt is mostly composed by time waste in a large number of cases. In fact, we have measured that the median of the ratios between the time waste and the estimated warmup time wt is approximately 0.97 (IQR: 0.77–0.99), i.e., in half of overestimated forks, at least 97% of the estimated warmup time wt consists of time waste.

The reported results highlight a substantial portion of time wasted during microbenchmarking, and stress the need for better microbenchmark configuration approaches that reduce their execution time. These findings further motivate prior efforts in reducing execution time through dynamic reconfiguration (Laaber et al. 2020), and highlight huge opportunities for execution time reduction in microbenchmarks.

5.3.3 Underestimation Side Effects

Although less frequent than overestimation, underestimation can have relevant side effects on microbenchmarking. Indeed, it can lead to consider performance measurements that significantly differ from steady state performance, as in practice they fall within the warmup phase.

Figure 15 reports the distribution of the relative performance deviation (RPD) across underestimated forks. Underestimation leads to an RPD of at least 5% in 57% of cases (i.e., 22% of all the forks). Figure 15a depicts that underestimation induces an RPD between 5% and 10% in 15% of the cases, between 10% and 25% in 20% of the cases, and it induces an RPD greater than 50% in 9% of the cases. The box plot (Fig. 15b) shows a mean RPD of 17%, and a median of 7% (IQR: 1–21%).

Fig. 15
figure 15

RQ3. Developer configurations—Underestimation side effects (performance change). The left plot reports the percentages of underestimated forks where performance change(%) ∈{[5,10),[10,25),[25,50),[50,inf)}. The right plot depicts the distribution of performance change across all underestimated forks. n is the total amount of data points, and the number on the top of the plot is the amount of outliers not drawn in the figure

Some systems show a large performance deviation when underestimating the warmup time. In nine systems, developers will observe measurements that deviate at least by 20% from the performance reached in the steady state. Such large deviations prevent the benchmarks to spot smaller performance changes, in fact defeating their purpose in practical performance testing scenarios.

These results highlight significant performance deviations due to underestimation. In Java systems, even relatively small performance regressions (e.g., 5%) may lead to rejections of code revisions. In fact, a microbenchmark regression, for example due to software refactoring (Traini et al. 2021), can have huge impact at system level, as microbenchmarks measure performance at fine-grained level (Laaber and Leitner 2018; Leitner and Bezemer 2017). For this reason, the reported RPD s can have severe consequences on steady state performance assessment, as they can easily lead to faulty judgments of code revisions. According to our results, developers may rely on measurements that significantly differ from those collected during steady state execution. This finding sheds a light on the perils of underestimation, and on how such inaccuracy can disrupt performance assessment. Indeed, the reported results show that underestimation is not rare in the current developer practice (32% of forks), and it often leads to potentially misleading results (57% of underestimated forks lead to a performance deviation ≥ 5%).

5.3.4 Benchmark Level Assessment

Here, we discuss the results of the analysis of developer configurations at benchmark level. The left plot of Fig. 16 reports the distribution of performance deviations of measurements gathered by software developers when compared to steady state measurements. The first aspect that we can notice by observing the plot is that the impact of underestimation seems to be considerably mitigated when considering aggregated measurements. This finding is in line with our previous analysis on forks (see RQ2), in which we have shown that the number of forks can substantially mitigate deviations of non-steady measurements. Performance deviations are not statistically significant in about half of the benchmarks (i.e., 53% of the cases), the median RPD is 0% and the IQR is 0–6%. Nonetheless, 47% of benchmarks report statistically significant performance deviations (i.e., the confidence interval does not contain zero), the mean RPD is 8%, and a quarter of benchmarks report a deviation higher 6%. As we have already discussed in our prior analysis on underestimation, these magnitudes of deviation can be harmful in Java microbenchmarking, as they can mislead performance assessment and lead to wrong judgements of software revisions.

Fig. 16
figure 16

RQ3. Developer configurations: Benchmark level assessment. The left plot depicts the RPD distribution across all benchmarks. The right plot depicts the execution time distribution

The right plot of Fig. 16 reports the distribution of benchmark execution times based on developer configurations. Developer configurations lead to extremely different execution times, with durations ranging from a minimum of 1 second to a maximum of 893 s. The average execution time is 210 s, and the median is 100 s (IQR 30–403 s). Our previous analysis on overestimation has already highlighted large opportunities for execution time reduction. In the light of the above results, these opportunities appear even more significant. For example, our prior analysis has reported a median time waste of 33 s in overestimated forks. If we compare this result with the median execution time of a benchmark, i.e., 100 s, the time waste appears extremely relevant, i.e., approximately one third of the entire benchmark execution. In that, it is worth to remark that time wastes are measured on individual forks, while the execution times (reported in Fig. 16) measure the entire duration of a benchmark, which typically involves multiple forks (3 on average in the case of developer configurations). This finding further remarks on the need for better techniques to reduce benchmark execution time without affecting result quality.

figure n

5.4 RQ4—Dynamic Reconfiguration Assessment

In this subsection, we first present results of the assessment of dynamic reconfiguration techniques, then we answer to RQ4.

5.4.1 Warmup Estimation Accuracy

Figure 17b depicts the distributions of warmup estimation errors (WEE) per dynamic reconfiguration technique across all benchmark forks. As it can be seen from the figure, RCIW leads to higher WEE (median: 48 s, IQR: 30–50 s) when compared to other techniques, whereas CV and KLD report similar distributions. CV leads to a median WEE of 19 s (IQR: 10–41 s), while KLD reports a median of 17 s (IQR: 9–47 s). Also, we have measured that, in half of the forks, all dynamic techniques report a WEE approximately as large as the steady state starting time st (or more), i.e., the medians of the ratios between WEE and st for CV, RCIW and KLD are respectively 0.97 (IQR: 0.79–45.7), 2.04 (IQR: 0.78–75) and 0.97 (IQR: 0.79–36.1). These results indicate that all dynamic reconfiguration techniques lead to a substantial error in the estimate of steady state starting time. Nevertheless, CV and KLD clearly provide more accurate estimates than RCIW.

Fig. 17
figure 17

RQ4. Dynamic reconfiguration—Warmup estimation accuracy. The left plot reports the percentages of overestimated, underestimated, and correctly estimated forks per dynamic reconfiguration technique. The right plot depicts the distribution of WEE across all benchmark forks per dynamic reconfiguration technique. n is the total amount of data points, and the number on the top of the plot is the amount of outliers not drawn in the figure

The bar chart in Fig. 17a confirms the specificity of RCIW. As it can be observed from the bar chart, RCIW reports a strong tendency toward overestimation, while CV and KLD show similar frequencies both in terms of underestimation and overestimation. RCIW overestimates 72% of forks (median WEE: 46 s, IQR: 31–50 s), and it underestimates only 25% forks (median WEE: 181 s, IQR: 86–236 s). On the other hand, CV reports 50% of overestimations (median WEE: 17 s, IQR: 12–24 s) and 39% of underestimations (median WEE: 131 s, IQR: 29–249 s). Similarly, KLD reports 55% overestimated forks (median WEE: 13 s, IQR: 9–21 s) and 37% underestimated forks (median WEE: 145 s, IQR: 29–255 s).

5.4.2 Overestimation Side Effects

Besides the large number of overestimated forks, RCIW also causes higher time wastes when compared to other techniques. As it can be observed in Fig. 18, RCIW reports a mean time waste of 45 s and a median of 46 s (IQR: 31–55 s). 95% of overestimated forks lead to a time waste of at least 10 s, 81% lead to a time waste of at least 25 s, and 13% to a time waste of at least 50 s. On the other hand, CV and KLD report a median time waste of 17 s (IQR: 12–24 s) and 14 s (IQR: 9–21 s), respectively (see Fig. 18b). Nonetheless, we have measured that in half of the overestimated forks, all techniques lead to an estimated warmup time that mostly consists of time waste, i.e., the medians of the ratios between the time waste and wt for CV, RCIW, and KLD are respectively 0.98 (IQR: 0.88–0.99), 0.96 (IQR: 0.7–0.99), and 0.97 (IQR: 0.88–0.99). Overall, these results indicate that different techniques lead to diverse outcomes in terms of overestimation. For example, RCIW induces more frequent overestimations and higher time wastes when compared to other techniques (see Fig. 18a). Despite this diversity, overestimation is frequent across all dynamic reconfiguration techniques, and it often leads to a non-trivial time waste, which can hamper continuous performance assessment.

Fig. 18
figure 18

RQ4. Dynamic reconfiguration—Overestimation side effects (time waste). The left plot reports the percentages of overestimated forks where time waste(sec) ∈{[10,25),[25,50),[50,100),[100,inf)}. The right plot depicts the distribution of time waste across all overestimated forks per dynamic reconfiguration technique. n is the total amount of data points, and the number on the top of the plot is the amount of outliers not drawn in the figure)

5.4.3 Underestimation Side Effects

RCIW is less prone to underestimation than other techniques, and it also has less marked side effects due to underestimation. As it can be observed in Fig. 19, CV and KLD lead to higher performance changes when compared to RCIW. RCIW reports a median RPD of 3% (IQR: 0-11%), whereas CV and KLD report medians of 10% (IQR: 5-20%) and 9% (IQR: 3%-17%), respectively (see Fig. 19b). Additionally, CV and KLD cause performance changes of at least 5%, respectively, in 76% and 69% of the cases, while RCIW achieves a similar deviation in only 43% of forks (see Fig. 19a). These results suggest that underestimation side effect varies depending on the dynamic reconfiguration technique. Some dynamic reconfiguration techniques (i.e., CV and KLD) are more prone to induce underestimation, and they often lead to a non-trivial performance deviation which can potentially mislead steady state performance assessment. Other techniques, such as RCIW, are instead “safer” in terms of underestimation, and ensure higher results quality in terms of performance assessment.

Fig. 19
figure 19

RQ4. Dynamic reconfiguration—Underestimation side effects (performance change). The left plot reports the percentages of underestimated forks where performance change(%) ∈{[5,10),[10,25),[25,50),[50,inf)}. The right plot depicts the distribution of performance change across all underestimated forks per dynamic reconfiguration technique. n is the total amount of data points, and the number on the top of the plot is the amount of outliers not drawn in the figure)

5.4.4 Benchmark Level Assessment

The analysis at benchmark level confirms the trend observed in our prior analysis. Figure 20 reports the deviations (RPD) of measurements gathered through dynamic reconfiguration techniques when compared to steady measurements (left plot), along with the distributions of benchmark execution times (right plot). As it can be observed, RCIW is by far the most reliable technique in terms of performance deviation. It reports an average RPD of 2%, a median of 0% and an IQR of 0-0%. About 91% of the benchmarks report an RPD smaller than 5%. On the other hand, RCIW provides also the most time consuming process with a mean execution time of 283 s and a median of 300 s (IQR 265–300 s). This is not surprising, as our prior analysis has highlighted a large overestimation rate (72%) and relevant time wastes (45 s on average) in RCIW.

Fig. 20
figure 20

RQ4. Dynamic reconfiguration: Benchmark level assessment. The left plot depicts the RPD distribution across all benchmarks. The right plot depicts the execution time distribution

Contrariwise, CV and KLD are less reliable in terms of results quality, by reporting a mean RPD of respectively 8% (median: 1%, IQR: 0–10%) and 7% (median: 0.5%, IQR 0–8%), but they are also less demanding in terms of execution time. CV reports a mean execution time of 93 s (median: 75 s, IQR: 51–113 s), while KLD reports a mean of 154 s (median: 121 s, IQR: 94–156 s). CV and KLD show similar behaviors in terms of execution times and performance deviations, however, as it emerges by the box plots in Fig. 20, they show two opposite tendencies. In particular, CV shows a slight tendency towards faster execution times and less reliable results, while KLD shows the opposite behavior, i.e., better results quality and more time-consuming executions.

figure o

5.5 RQ5—Dynamic Reconfiguration vs Developer Configuration

In this subsection, we first present results of the comparison between dynamic reconfiguration techniques and developer static configurations for each considered metric: (i) warmup estimation error (WEE), (ii) estimated warmup time (wt), and (iii) relative performance deviation (RPD). Then, we provide answer to RQ5.

5.5.1 Warmup Estimation Error

We performed the Wilcoxon test to check the significance of the difference between WEE of dynamic reconfiguration techniques and developer static configurations. The detailed results of the comparison within and across systems are reported in Table 4. As it can be observed by the last row of the table, the differences are statistically significant for all dynamic reconfiguration techniques (p ≤ 0.05), with two techniques reporting extremely small p-values (p < 0.001). The comparison leads to a medium effect size in RCIW (\(\hat {A}_{12}\) ≥ 0.64), and to a large effect size in CV and KLD (\(\hat {A}_{12}\) ≥ 0.71). These results indicate that, when compared to developer static configurations, dynamic reconfigurations techniques overall provide more accurate estimates of the end of the warmup phase.

Table 4 RQ5. WEE comparison. Results of the Wilcoxon test (with \(\hat {A}_{12}\) effect sizes in brackets) that compare WEE of dynamic reconfiguration techniques to the ones obtained using developer configurations. Systems where dynamic configurations perform better than developer configurations (p ≤ 0.05 and \(\hat {A}_{12}\) > 0.5) are highlighted in bold. Asterisks denote interpretation of the \(\hat {A}_{12}\) effect size: small (), medium (∗∗), large (∗∗∗)

Nevertheless, Table 4 also shows that the difference between WEE of dynamic reconfiguration techniques and developer configurations varies across systems. Figure 21 shows a summary of these results, where each cell reports the number of projects whose comparison leads to a statically significant difference (p ≤ 0.05) within a specific \(\hat {A}_{12}\) effect size range.

As it can be observed by the figure, CV provides better warmup estimates than developers in 15 of the 30 systems (13 with large effect sizes, and 2 with medium effect sizes). Conversely, developer configurations provide lowers estimation errors than CV in only 2 systems (respectively, negligible and small effect size).

Fig. 21
figure 21

RQ5. WEE comparison summary. Each cell reports the number of systems whose comparison leads to a statistically significant change (p ≤ 0.05) within a specific \(\hat {A}_{12}\) effect size range: negligible (N), small (S), medium (M) and large (L). \(\hat {A}_{12}\) > 0.5 indicates that dynamic configurations perform better than developer configurations

KLD shows a similar trend to that observed for CV. KLD provides better warmup estimates than developers in 13 out of the 30 systems (11 with large effect sizes, 1 with medium effect size, and 1 with negligible effect size), whereas developers outperform KLD in only 2 systems (respectively, small and medium effect size).

RCIW also shows improvement over static configurations in a considerable number of systems (10 out of 30), though with lower effect sizes (4 large, 2 medium, and 4 small). Again, developer configurations provide better estimations in only 2 systems.Footnote 12

Overall, we can observe that dynamic reconfiguration techniques provide more accurate warmup estimates than software developer ones. In particular, CV and KLD outperform developer configurations in terms of WEE on a considerable number of projects with high effect sizes.

5.5.2 Estimated Warmup Time

In this subsection, we investigate the difference between the estimated warmup time (wt) provided by dynamic reconfiguration techniques and developer static configurations. Table 5 reports results of Wilcoxon tests for each system, and across all systems.

Table 5 RQ5. Estimated warmup time (wt) comparison. Results of the Wilcoxon test (with \(\hat {A}_{12}\) effect sizes in brackets) that compare the wt of dynamic reconfiguration techniques to the ones obtained using developers configurations. Systems where dynamic configurations report shorter wt than developer configurations (p ≤ 0.05 and \(\hat {A}_{12}\) > 0.5) are highlighted in bold. Asterisks denote interpretation of the \(\hat {A}_{12}\) effect size: small (), medium (∗∗), large (∗∗∗)

As it can be observed by the last row of the table, CV and KLD report a statistically significant difference (p < 0.001) with tendency toward improvement (i.e., smaller wt values) but with negligible effect size. RCIW also reports statistically significant difference (p < 0.001), but with the opposite tendency, i.e., larger wt (\(\hat {A}_{12}\) < 0.5), and a small effect size.

If we look at project-level results, we can observe a remarkable diversity among projects for CV and KLD. As it can be seen in Fig. 22, CV leads to higher wt than those defined by developers (\(\hat {A}_{12}\)< 0.5) in 10 of the 30 systems, and it reports lower wt (\(\hat {A}_{12}\) > 0.5) in 17 systems, thus not showing a clear trend. A similar behavior can be observed for KLD: 10 of the 28 systems that report a statically significant difference (p ≤ 0.05) have \(\hat {A}_{12}<0.5\), while 17 of them have \(\hat {A}_{12}>0.5\).

Fig. 22
figure 22

RQ5. Summary wt comparison. Each cell reports the number of systems whose comparison leads to a statistically significant change (p ≤ 0.05) within a specific \(\hat {A}_{12}\) effect size range: negligible (N), small (S), medium (M) and large (L). \(\hat {A}_{12}\) > 0.5 indicates that dynamic configurations lead to shorter wt than developer configurations

On the other hand, RCIW performs worse than developer configurations in most of the systems. RCIW leads to higher wt than developers in 22 of the 30 systems: 12 with large effect sizes, 4 with medium effect sizes, and 4 with small effect sizes, 2 with negligible effect sizes. It leads to shorter wt in only 5 projects (1 negligible, 3 small, and 1 medium effect sizes).

Furthermore, if we compare overestimation frequency/side effect of RCIW (see Figs. 17a and 18) and developer configurations (see Figs. 13 and 14), we can observe that RCIW reports more frequent overestimations, namely 72% vs 48%, and higher time wastes, namely median of 46 s (IQR: 31–55 s) vs median of 33 s (IQR: 19–49 s).

Overall, our results indicate that CV and KLD can lead to different behaviors based on the context, that is they can provide either higher or lower wt than developers depending on the system. On the other hand, RCIW induces higher estimates of the warmup time wt when compared to developer configurations, thus increasing microbenchmark execution time.

5.5.3 Relative Performance Deviation

In this subsection, we assess the difference between RPD of dynamic reconfiguration techniques and developer static configurations. Overall, we can observe that dynamic reconfiguration techniques slightly improve developer configurations, as shown in the last row of Table 6. The comparisons report statistically significant differences for all techniques (p < 0.001), respectively with small (CV and RCIW ) and negligible (KLD) effect sizes.

Table 6 RQ5. RPD fork level comparison. Results of the Wilcoxon test (with \(\hat {A}_{12}\) effect sizes in brackets) that compare the RPD of dynamic reconfiguration techniques to the ones obtained using developer configurations. Projects where dynamic configurations perform better than developer configurations (p ≤ 0.05 and \(\hat {A}_{12}\) > 0.5) are highlighted in bold. Asterisks denote interpretation of the \(\hat {A}_{12}\) effect size: small (), medium (∗∗), large (∗∗∗)

If we look at project-level results (see Table 6 and Fig. 23), we can observe that RCIW leads to statistically significant improvements over developer configurations in a large number of projects. In particular, RCIW improves developer configurations in 18 of the 30 systems (5 with large, 3 with medium, 5 with small, and 5 with negligible effect sizes), while it degrades RPD in only 2 systems (both with negligible effect sizes). By comparing the RPD distributions of RCIW (Fig. 19) and developer configurations (Fig. 15), we can observe that the former produces lower deviations with respect to steady state measurements. For example, using developer configurations, about 57% of the forks lead to an RPD of at least 5%, while the same performance deviation is achieved in 43% of forks when using RCIW. Even more, RCIW provides a median RPD of 3% (IQR: 0–11%), whereas developer configurations lead to a deviation of 7% (IQR: 1–21%).

Fig. 23
figure 23

RQ5. Summary RPD fork level comparison. Each cell reports the number of systems whose comparison leads to a statistically significant change (p ≤ 0.05) within a specific \(\hat {A}_{12}\) effect size range: negligible (N), small (S), medium (M) and large (L). \(\hat {A}_{12}\) > 0.5 indicates that dynamic configurations perform better than developer configurations

These results demonstrate that performance measurements gathered through RCIW deviate less from steady state measurements than those collected through developer static configurations, thereby ensuring better results quality.

CV and KLD also report statistically significant improvements over static configurations in 16 and 17 systems, respectively. Nonetheless, developer configurations perform better than CV and KLD in, respectively, 10 and 12 systems. Despite this, by looking at Fig. 23, we can observe that systems where developer configurations perform better than CV and KLD, tend to have lower effect sizes than those where they provide improvement. For example, if we exclude negligible effect sizes, we can observe that CV still leads to improvement (\(\hat {A}_{12}\) ≥ 0.56) in 7 systems (5 large and 2 medium), while it performs worse than developer configurations (\(\hat {A}_{12}\) ≤ 0.44) in only one case with small effect size. Similarly, KLD leads to an effect size \(\hat {A}_{12}\) ≥ 0.56 in 10 systems (5 large, 2 medium, 3 small), while it reports a small effect size \(\hat {A}_{12}\) ≤ 0.44 in 5 systems.

These results suggest that the comparison between RPD of CV /KLD and developer configurations lead to different outcomes depending on the system, though with a slight overall tendency towards improvement. The results across benchmarks of all systems confirm this tendency. As shown in the last row of Table 6, CV and KLD report statistically significant improvements (p < 0.001), respectively with small and negligible effect sizes. Given these results, we can safely state that CV and KLD only provide marginal improvements over developer static configurations in terms of performance deviation.

5.5.4 Benchmark Level Assessment

Interestingly, when we look at benchmark level results, we can observe significant differences (see Fig. 24 and Table 7). CV notably shifts from a tendency towards improvement to a tendency towards regression (p < 0.001 and \(\hat {A}_{12}=0.45\)). KLD, which reported statistically significant differences and tendency towards improvement at fork level, reports neither improvement nor regression (p > 0.05) at benchmark level. By analyzing project level results, we can further appreciate this shift. At fork level, CV reports worse RPDs than developers in only one project (with non-negligible effect size). At benchmark level instead, if we exclude negligible effect sizes, it reports worse performance deviations than developer configurations in 11 projects (3 with small effect sizes, 2 with medium, and 6 with large). Likewise, KLD reports worse RPDs in 9 projects at benchmark level (2 small, 2 medium and 5 large effect sizes), while, at fork level, it reports worse RPDs in only 5 project with small effect sizes. The only technique that seems to provide improvement over developer configurations both at benchmark and fork level is RCIW, which reports statistically significant improvement (p < 0.001) and small effect size (\(\hat {A}_{12}=0.64\)). RCIW outperforms developer configurations in 17 projects (10 with large effect sizes, 5 with medium and 2 with small), and the effect sizes provided at benchmark level are even better than those provided at fork level (i.e., higher \(\hat {A}_{12}\)) in 15 out of the 30 projects. These results may suggest that the capability of RCIW to dynamically stop forks at run-time may further improve performance deviations when compared to those of developers. To further investigate this aspect, we analyzed the results of these 15 projects, and we found that, in 80% of the benchmarks, RCIW involves 5 fork executions, i.e., the maximum number of forks for dynamic reconfiguration techniques (based on the original parameterization provided in Laaber et al. (2020)). That is, forks are not halted by stability criteria, rather they are stopped because the technique has reached the maximum number of allowed forks. This is somehow equivalent to statically fix the number of forks to 5.

Fig. 24
figure 24

RQ5. Summary RPD benchmark level comparison. Each cell reports the number of systems whose comparison leads to a statistically significant change (p ≤ 0.05) within a specific \(\hat {A}_{12}\) effect size range: negligible (N), small (S), medium (M) and large (L). \(\hat {A}_{12}\) > 0.5 indicates that dynamic configurations perform better than developer configurations

Table 7 RQ5. RPD benchmark level comparison. Results of the Wilcoxon test (with \(\hat {A}_{12}\) effect sizes in brackets) that compare RPD of dynamic reconfiguration techniques to the ones obtained using developer configurations. Systems where dynamic configurations perform better than developer configurations (p ≤ 0.05 and \(\hat {A}_{12}\) > 0.5) are highlighted in bold. Asterisks denote interpretation of the \(\hat {A}_{12}\) effect size: small (), medium (∗∗), large (∗∗∗)

Our analysis at fork level has shown that dynamic reconfiguration outperforms developer configurations in terms of performance deviation due to its capability to dynamically stop warmup iterations. The same cannot be said when stability criteria are applied to halt forks. Indeed, the analysis at benchmark level has shown that this capability of dynamic reconfiguration has only neutral or negative effects on performance deviations, and it is hard to perceive any improvement brought by this specific feature.

figure p

6 Discussion

Overall, we can observe that Java microbenchmarking is still subject to some flaws. The results for RQ1 provide evidence that benchmarks do not always reach a steady state of performance. About 11% of benchmark forks never reach a steady state of performance, and 43% of benchmark executions involve at least one fork that doesn’t hit the steady state. These results are consistent with the seminal study on VM microbenchmarking of Barrett et al. (2017), thus showing, on a larger corpus of benchmarks and in the more defined scope of “testing-oriented” Java microbenchmarks, that the “two-phase assumption” does not always hold. With this finding, we aim to raise awareness among developers (and researchers) that deal with Java microbenchmarking. An important lesson here is that some benchmark forks (mean \(\sim \)10%) may not be representative of “actual” steady state performance, since their performance may continuously fluctuate over time, with a non-negligible deviation from steady state performance. Unfortunately, the only way to avoid this issue is to execute each fork for a large number of iterations, and then run the Barrett et al. (2017) technique to determine if the steady state of performance is reached or not. While this methodology may be appropriate in a research context (like ours), it may be impractical in real-word performance assurance processes, where benchmarks are repeatedly executed against software evolution, and time/resources are subject to constraints (Traini 2022). Nonetheless, there are certain measures that can be put in place to (partially) mitigate this problem. For example, the analyses performed for RQ2 showed that performance deviations of non-steady forks can be significantly reduced by using a minimum of 50 warmup iterations that, based on our microbenchmarking setup, correspond to 5 s of continuous benchmark execution and no less than 50 invocations. The performance deviations can be further mitigated by increasing the number of warmup iterations up to 300 (i.e., 30 s of continuous benchmark execution and no less than 300 invocations). It is worth to notice that these values are considerably different than those provided by JMH defaults, which define 50 s of continuous benchmark execution and no less than 5 invocations for warmup. On the basis of our results, our practical suggestion is to never execute a benchmark for less than 5 s (and less than 50 invocations) before starting to collect measurements. When time does not represent a major concern, warmup should last for at least 30 s of continuous benchmark execution, and no less than 300 invocations.

Microbenchmarking is far from trivial even when benchmarks consistently reach a steady state of performance. The results of RQ2 sheds a light on the potential pitfalls of using non-steady measurements. Performance measurements gathered in non-steady phases of benchmark execution substantially deviate from those collected during steady phases (\(\sim \)124k% on average). Hence, relying on them can significantly mislead performance assessment. To deal with this problem, the current practice mostly rely on developers’ guesses to estimate the end of the warmup phase and discard measurements subject to performance fluctuations. Based on the results for RQ3, this approach seems to drastically mitigate these large deviations, i.e., developer configurations lead to an average deviation from steady measurements of 8%. Nonetheless, warmup estimation remains challenging and subject to (large) errors. The results for RQ3 show that developer static configurations fail to accurately estimate the end of the warmup phase, often with a non-trivial estimation error (median: 28 s). Developers tend to overestimate warmup time more frequently than underestimating it (48% vs 32%). Nonetheless, both of these kinds of estimation errors produce relevant (though diverse) side effects. For example, we showed that overestimation produces severe time wastes (median: 33 s), thereby hampering the adoption of benchmarks for continuous performance assessment. On the other hand, underestimation often leads to performance measurements that significantly deviate from those collected in the steady state (median 7%), thus leading to poor results quality and potentially wrong judgements. The latter side effect can be partially mitigated by running an adequate number of forks (e.g., 5). Indeed, as we have shown in RQ2, forks play a significant role in reducing performance deviations of non-steady measurements. Unfortunately, they also largely increase benchmark execution time, and this may be impractical in real-word contexts. Another option is to leverage automated techniques that can effectively estimate the end of the warmup time at run-time. Prior work tried to address this challenge through dynamic reconfiguration (Laaber et al. 2020).

Based on the results for RQ5, dynamic reconfiguration techniques significantly improve the effectiveness of the state-of-practice. The achieved results show that dynamic reconfiguration techniques outperform developer static configurations in terms of warmup estimation error with statical significance (p ≤ 0.05) and large/medium effect sizes (see Section 5.5.1). Nevertheless, this improvement may come at the expense of an increased microbenchmark execution time. For example, RCIW produces higher estimates of the warmup time with non-negligible effect size in 20 out of the 30 systems (see Section 5.5.2). On the other hand, CV and KLD have more heterogeneous behaviors depending on the system, but they still report higher warmup estimates than those of developers in 10 systems. Further empirical studies are needed to assess whether such time increase is acceptable for practitioners.

The results for RQ4 also highlight a substantial diversity among different dynamic reconfiguration techniques. One peculiar example is RCIW that, on one hand, induces the highest increase in microbenchmark execution time, but on the other hand, it provides the most reliable set of performance measurements. Microbenchmark practitioners that do not have specific concerns on time (e.g., small benchmarks suites) should adopt RCIW for a reliable steady state performance assessment. In the other cases, KLD and CV represent the best alternatives.

Despite the promising results highlighted in RQ5, our findings suggest room for improvement for dynamic reconfiguration. As shown for RQ4, all dynamic reconfiguration techniques lead to a substantial estimation error, with RCIW providing by far the largest error (median: 48 s), and, CV and KLD producing smaller, but still relevant, estimation errors (median of 19 and 17 s, respectively). These errors induce significant, though diverse, side effects depending on the technique. For example, RCIW induces substantial side effects in terms of time waste (median: 46 s), while KLD and CV induce more frequent and impactful side effects on the reliability of performance measurements (median performance deviation of 10% and 9%, respectively). Nonetheless, half of the warmup estimates of dynamic reconfiguration techniques can be reduced by at least 96%, when only considering overestimated forks. These results highlight a large space for improvement in dynamic reconfiguration techniques, and call for further research on designing and developing more effective dynamic reconfiguration techniques. For example, future research may explore the use of other stability metrics (e.g., autocorrelation metrics, other confidence interval metrics (Fieller 1954)), or combinations of them to more effectively determine the end of the warmup phase. Another suggestion is to focus more on improving warmup estimation accuracy, rather than finding stability criteria that are suitable to both stop forks and warmup iterations. Indeed, based on our results, dynamically stopping forks does not produce any tangible improvement over developer static configurations. In this regard, our suggestion is to allow practitioners manually configuring forks based on their own needs and time constraints. Nonetheless, we always recommend to run at least 5 forks (i.e., the default in JMH) to mitigate the impact of non-steady measurements. Practitioners may decide to run less forks when time represents a major concern, however they should be aware of potential implications on results quality.

7 Threats to Validity

Our study may be affected by different threats that span from how we collected performance data to the subject project domains, and we describe them in the following.

Construct Validity

We assume that the benchmarks that reach a steady state will be able to do it within the execution time we defined. There may be benchmarks that need more time to show some stability. However, we chose the execution time to be considerably longer than the time we found in developer configurations (171 times longer on average). Also, as done in other studies (Laaber et al. 2020), measurement time, the number of iterations, and the number of forks are fixed for every benchmark. The consequence is that the number of invocations varies from one benchmark to another. Nonetheless, no benchmark is invoked less than 3000 times per fork, that is 1000 times more than in Barrett et al. (2017). The number of forks is fixed at 10, as recommended in Barrett et al. (2017).

Our experiment was performed in an environment where we tried our best to reduce the measurement noise and external influencing factors (Papadopoulos et al. 2021; Mytkowicz et al. 2009b). Such settings are effective in improving the accuracy of results, but may not represent the more common environment in which developers execute the benchmarks. However, general reference environments can hardly exist for benchmarks. In fact, different developers can potentially execute the same microbenchmark on a wide range of different machines/environments, given the inherently distributed nature of open-source software development. On top of that, there is an increasing interest in promoting the adoption of microbenchmarks in CI (Laaber and Leitner2018; Laaber et al. 2019, 2020,2021), which are most likely uncontrolled/noisy environments. All these aspects make it impractical to identify a reference environment for each benchmark/system. For this reason, we have deliberately chosen to execute benchmarks on the same bare-metal server, using precautionary measures to mitigate measurement noise and external influencing factors (see Section 4.1 for details). In this respect, we preferred to control the confounding variables rather than losing accuracy in more noisy and unreproducible settings. This consideration was especially motivated by the first goal of our study (RQ1), i.e., checking whether benchmarks reach a steady state of performance. Indeed, we prefer to run benchmarks on an environment that is considerably less subject to noise than those of developers, rather than possibly causing non-steady executions due to the noise that is specific to our environment. Nonetheless, we do believe that future research should further strengthen our investigation beyond controlled environments, i.e., by studying steady state performance even in uncontrolled and noisy environments. For example, future studies may replicate our experiments in cloud contexts to assess how these virtualized environments affect steady state performance. We envision that this latter step is crucial to foster the adoption of Java microbenchmarking in CI.

Developer configurations for warmup iterations wi and measurement iterations i are derived from the JSON reports generated by JMH (details in Section 4.4). We do not consider other ways in which benchmarks might be executed by developers, by using, for example, different command line arguments at launch time or by configuring additional parameters in build automation pipelines. Nonetheless, to the best of our knowledge, no study so far showed evidence of microbenchmarks being used as part of build automation pipelines (Beller et al. 2017; Rausch et al. 2017; Vassallo et al. 2017).

Following the methodology of Laaber et al. (2020), we performed post-hoc analysis to derive measurements for developers and dynamic configurations. Indeed, a comprehensive execution of all benchmarks across all developer static configurations and dynamic reconfiguration alternatives would have made our experimental evaluation impractical due to extremely long execution times (the execution of all 586 benchmarks using our single setup took about 93 days). In order to mitigate the impact of post-hoc analysis on the validity of our study, we employed a carefully defined process to derive measurements. We first estimated the time spent in each warmup/measurement iteration using the average execution time of each iteration as observed in our microbenchmarking setup. Then, we derived the estimated warmup time wt and the set of measurements Mconf based on the JMH configurations provided by software developers or dynamic reconfiguration techniques. The detailed process to estimate wt and derive Mconf can be found in Section 4.4. Nevertheless, our results may be marginally affected by approximations in converting measurement samples from our configuration to the others.

Internal Validity

The steady state is detected on the basis of execution time measurements. We do not consider other event-based stability criteria, such as JIT activity, as these are not part of any JMH report and, therefore, are not something we can compare against when examining developer configurations. However, considering such additional criteria may lead to a different steady state classification.

Before performing the change point detection, we filtered the outliers using Tukey’s fences on a sliding window, as described in Section 4.3. The parametrization of this procedure might affect the detection of the steady state. However, we tried to select parameters that would result in a very conservative outliers filtering. In fact, we only filter 0.27% of the datapoints. Nonetheless, the outliers filtering is a necessary procedure when employing change point detection algorithms, as the vast majority of such algorithms cannot distinguish between actual changes and outliers (Fearnhead and Rigaill 2019), thus leading to an overestimation of changes.

External Validity

We only focused on GitHub repositories. Therefore, it is unclear if the findings are valid for other open-source hosting platforms or industrial software. Nonetheless, we conducted our experiment on 30 systems, a number larger than most recent empirical studies on performance (e.g., see Laaber and Leitner 2018; Laaber et al. 2020; Ding et al. 2020; Reichelt et al., 2019).

We chose to limit the scope of the experiment to JMH microbenchmarks, because JMH is a mature and widely adopted Java microbenchmark harness. We ran the benchmarks on JVM 8 (JDK 1.8.0 update 241) or 11 (JDK 11.0.6), depending on the requirements of the specific system. Using other JVM versions or JVMs from other vendors may change the results. All the benchmarks were executed on the same bare-metal server running Linux. Nothing can be said about other hardware characteristics or other operating systems.

Conclusion Validity

The changepoint detection method we used assumes independence in the time series. Even when the data contain some dependence, changepoint methods can still be used, provided that larger penalty values are used (Antoch et al. 1997; Barrett et al. 2017). The penalty values we dynamically compute for each fork, as explained in Section 4.3, tend to be larger (the average penalty is 504.54) than the value used, for example, in Barrett et al. (2017). Also, we manually inspected some of the time series to ensure that the segmentation was reasonable given the goal of the experiment.

Wherever possible, we used appropriate statistical procedures with p-value and effect size measures to test the significance of the differences and their magnitude.

8 Related Work

There are different perspectives of tackling performance analysis of software systems, through models at runtime (Cortellessa et al. 2022; Giese et al. 2020), or by means of benchmarking. Recently, benchmarking technique has played a key role to discover potential performance flaws (Stefan et al. 2017).

One of the main problems with benchmarking results is the reliability of the data. Recently, different approaches have defined rigorous processes to interpret those data. For example, some approaches rely on statistical inference for identifying and measuring the reliability of benchmarking results (Kalibera and Jones 2013, 2020). Other approaches, instead, have presented performance analysis methodologies to extract data in a more reliable way (Georges et al. 2007). Barrett et al. (2017) introduced a fully automated statistical approach based on changepoint analysis. In their work, Barrett et al. (2017) studied a set of small and deterministic VM benchmarks across different types of VMs, including the JVM. They found that VM microbenchmarks may not always reach a steady state of performance.

On the other hand, performance benchmarking is a time-demanding process. Recently, some approaches investigated solutions to reduce the time for performance analysis while preserving reliable results (AlGhamdi et al. 2020; Mostafa et al. 2017; He et al. 2019). He et al. (2019) have studied the reduction of performance testing in the cloud. They introduced a statistical tool, namely PT4Cloud, that provides stop conditions in order to obtain reliable performance indices. Another way to reduce testing time is, for example, by reusing the “functional” unit tests, which are likely available and maintained. For example, Bulej et al. (2017) extended “functional” unit tests with performance knowledge by equipping them with stochastic performance logic.

Java Microbenchmark Harness (JMH) is a popular benchmarking framework for Java software. JMH allows defining performance testing to reduce variability in the measurements as well as external factor contributions during the microbenchmark testing phase. Hence, different studies spanned over different JMH aspects (Costa et al. 2021; Samoaa and Leitner 2021; Laaber and Leitner 2018; Laaber et al. 2019). A JMH microbenchmark might be affected by bad practices that could degrade performance results. Costa et al. (2021) studied those bad practices by analyzing a corpus of 123 OSS, and they extracted those bad practices that more likely lead to bad performance indicators. Laaber et al. (2020) focused their study on reducing the required execution time of microbenchmarking tests through dynamic reconfiguration. They have defined three stability criteria to dynamically estimate the end of the warmup phase and halt warmup iterations accordingly. Samoaa and Leitner (2021) studied, instead, the impact of benchmark parameters and how they affect performance results.

In this work, we studied the effectiveness of modern Java microbenchmarking for steady state performance assessment. Similarly to Barrett et al. (2017), we investigated whether microbenchmarks reach a steady state of performance. However, unlike them, we studied this aspect in the more defined scope of “testing-oriented” Java benchmarks, i.e., JMH benchmarks specifically designed to assess performance of a particular software. Our results are consistent with those gathered by Barrett et al. (2017), thus confirming that, even in a different context, Java benchmarks may not always reach a steady state of performance.

Costa et al. (2021) broadly studied bad practices in JMH benchmarks, instead we specifically investigated the effectiveness of developer configurations for steady state performance assessment. Laaber et al. (2020) presented dynamic reconfiguration as a viable alternative to developer static configurations. In their study, they compared dynamic reconfiguration to JMH default configurations, and they observed a significant reduction in execution time with a negligible loss of result quality. In our study, instead, we evaluated the effectiveness of dynamic reconfiguration for steady state performance assessment. Furthermore, we showed, through a rigorous comparison, that dynamic reconfiguration is significantly more effective than developer configurations and, as such, it produces less pronounced side effects.

9 Conclusion

This paper presents a comprehensive investigation on Java steady state performance assessment. Through a rigorous assessment, we showed that Java microbenchmarks do not always reach a steady state of performance, thus confirming the finding of Barrett et al. (2017) in the more defined scope of “testing-oriented” Java microbenchmarks.

Even when microbenchmarks consistently reach a steady state of performance, a reliable assessment remains far from trivial. According to our results, the current state-of-practice, which mostly relies on developer static configurations, show poor effectiveness for steady state performance assessment. Developers often fail to accurately estimate the end of the warmup phase, thereby causing either large time wastes or poor results quality. Dynamic reconfiguration provides a significant leap forward over developer static configurations by providing more accurate warmup estimates and less pronounced side effects. Still, the achieved results highlight non-trivial estimation errors, large time wastes, and distorted performance measurements.

The findings of our work have implications for both practitioners and researchers.

For the former, it is important to be aware that benchmark forks may not always reach a steady state of performance. The recommendation here is to perform an adequate number of forks (e.g., 10) to mitigate the noise introduced by “non-steady” forks. Another important lesson for practitioners is to favor dynamic reconfiguration over static configuration when possible. Indeed, when compared to developer configurations, dynamic reconfiguration techniques provide more accurate estimates of warmup time, though this improvement may (sometimes) come at the expense of a more time-consuming performance assessment process. Further empirical studies are needed to assess whether this “cost” is acceptable for practitioners. Nonetheless, the achieved results are also helpful for suggesting which technique to use depending on the practitioner’s need.

On the researchers’ side, given the promising results of dynamic reconfiguration and the large room for improvement suggested by our investigation, we envision research aimed at designing novel and more effective dynamic reconfiguration techniques to (i) reduce the time effort devoted to performance assessment and (ii) strengthen the reliability of performance measurements. This is a direction we aim to investigate in future work.

We have made the code and the data used in our study publicly available to encourage further research on this topic.