Sample size calculations for the experimental comparison of multiple algorithms on multiple problem instances

This work presents a statistically principled method for estimating the required number of instances in the experimental comparison of multiple algorithms on a given problem class of interest. This approach generalises earlier results by allowing researchers to design experiments based on the desired best, worst, mean or median-case statistical power to detect differences between algorithms larger than a certain threshold. Holm's step-down procedure is used to maintain the overall significance level controlled at desired levels, without resulting in overly conservative experiments. This paper also presents an approach for sampling each algorithm on each instance, based on optimal sample size ratios that minimise the total required number of runs subject to a desired accuracy in the estimation of paired differences. A case study investigating the effect of 21 variants of a custom-tailored Simulated Annealing for a class of scheduling problems is used to illustrate the application of the proposed methods for sample size calculations in the experimental comparison of algorithms.


Introduction
Experimental comparison of algorithms has long been recognised as an essential aspect of research on meta-heuristics [2,36,47,18]. Despite well-earned criticisms of some disconnect between theory and experimentation [35,17] and a few methodological issues that still require adequate attention [24,6], experimental evaluation remains a central aspect and a major component of the development and understanding of EAs.
This reliance on experimental assessment and comparison of algorithms is evidenced by the continuing effort of researchers in devising better experimental protocols for performance assessment and comparison of algorithms. While many of the most important points were presented as far back as the late 1990s [2,47,36], research into adequate protocols and tools for comparing algorithms has continued in the past two decades, with several statistical approaches being proposed and employed for comparing the performance of algorithms [18,39,59,20,60,9,11,5,7,30,29,22,16,21,8,40,33,15,13]. This increased prevalence of more statistically sound experiments in the field of optimisation heuristics can be seen as part of the transition of the area into what has been called the scientific period of research on metaheuristics [57].
Despite of this area-wide effort, a few important aspects of experimental comparisons of algorithms remain largely unexplored in the literature. One of those topics is the question of how to adequately determine the relevant sample sizes for the comparisons of algorithms -number of problem instances to use, and number of runs to employ for each algorithm on each instance. The standard approach has been that of maximizing the number of instances, limited only by the computational budget available [2,30,28,22,1], and of running arbitrarily-set numbers of repeated runs, usually 30 or 50. While it is indeed true that the sensitivity of comparisons increases with the number of instances, this does not mean that a large sample size can substitute a well-designed experiment [44,46,15]. Also, it is important to be aware that arbitrarily large sample sizes allow tests to detect even minuscule differences at arbitrarily strict significance levels, which may lead to the wrongful interpretation that effects of no practical consequence are strongly significant [46,4] if certain methodological safeguards are not put into place when designing the experiment [15].
The dual issues of statistical power and sample size have been touched by a few authors in the past, albeit only superficially [38,18,19,53,4,5,7]. Until recently, the best works on the subject were those by Mauro Birattari [9,10] and Chiarandini and Goegebeur [7,Ch. 10]. The former correctly advocated for a greater focus on the number of instances than on repetitions, showing that the optimal allocation of computational effort, in terms of accuracy in the estimation of mean performance for a given problem class, is to maximize the amount of test instances, running each algorithm a single time on each instance if needed. Birattari's approach, however, was focussed only on accuracy of parameter estimation for the mean performance across a problem class, yielding very little information on the specific behaviour of each algorithm on each instance, as well as not considering questions of desired statistical power or sample size calculations. The work by Chiarandini and Goegebeur [7, Ch. 10] provided a good discussion on statistical power and sample size in the context of nested linear statistical models, as well as some guidelines on the choice of the number of instances and number of repeated runs based on the graphical analyses of power curves. It was, however, limited to nested models, and required manual inspection of power curves by the user, which may have precluded its broader adoption in the literature.
More recently, we have proposed a principled approach for calculating both the number of instances and number of repeated runs when comparing the performance of two algorithms for a problem class of interest [15]. That approach calculates the number of instances based on the desired sensitivity to detect some minimally relevant effect size (MRES), i.e., the smallest difference between algorithms that is considered to have some degree of practical relevance. Alternatively, the number of instances can be fixed a priori (e.g., when using standard benchmark sets) and the sensitivity curves can be derived instead. As for the number of repeated runs per instance, the approach proposed in [15] was based on minimising the number of algorithm runs required to achieved a the desired accuracy in the estimation of the paired differences in performance for each instance. While statistically sound, the fact that the results presented in that work applied only to comparisons of two algorithms limited their applicability to the general case of experimental comparisons of algorithms, which often aim to investigate the relative performances of multiple algorithms, or multiple variants of a given algorithmic framework.
In this paper we generalise the results from [15] to calculate the required sample sizes for the comparison of an arbitrary number of algorithms. The calculation of the number of instances is based on considerations regarding the test of multiple hypotheses that commonly occurs in these contexts, correcting the significance level of each test so as to maintain the familywise error rate controlled at a given desired level [55]. Optimality-based ratios of sample sizes are derived for sampling algorithms on each individual instance, both for the simple difference of means and for two types of percent differences. A simple sampling heuristic is also provided, which can easily generalise the sampling approach presented here for cases in which general performance statistics are of interest. 1 The remainder of this paper is organised as follows: Section 2 explicitly defines the algorithm comparison problem considered in this work, states the hypotheses to be tested, and provides the rationale for the choices that justify the proposed approach for sample size calculation. Section 3 describes the proposed approach for sampling an arbitrary number of algorithms on any given problem instance, and presents the derivation of optimal sample size ratios for three common cases. Section 4 describes the general concepts behind the estimation of the required number of instances for an experiment to achieve desired statistical properties. Section 5 presents an example of application of the proposed methodology for the evaluation of 21 algorithmic variants of a stateof-the-art, custom-tailored simulated annealing approach [54] for a class of scheduling problems [58]. Finally, Section 6 presents final considerations and conclusions.

Problem Definition
Before proceeding to investigate the required sample sizes for the experimental comparison of algorithms, it is important to formally define the questions one is trying to answer when those comparisons are performed. While there are several different scientific questions that can be investigated by experimentation, arguably the most general case for experimental meta-heuristic comparisons is: given a finite subset of problem instances and a finite number of runs that can be performed by each algorithm being tested, what can be said regarding the relative performance of those algorithms (according to a given set of performance indicators) for the problem class from which the test instances were drawn? Notice that this question is broad enough to encompass the most common cases in the experimental comparison of algorithms, and even several comparisons of scientific relevance that are not routinely performed [24].
Aiming to highlight the specific aspects that are considered in the present work, we refine this problem definition as follows: Let Γ S = {γ : ∈ [1, N ]} ⊂ Γ represent a finite sample of problem instances drawn from a given problem class of interest, Γ; and let A = {a 1 , a 2 , . . . , a A } denote a set of algorithms 2 that we want to compare. We define the algorithm comparison problem in the context of this work as that of obtaining a (partial) ordering of the algorithms according to their mean performance on the problem class of interest, based on the observed performance values across the set of test instances used. We assume here that (i) all algorithms can be run on the same sample of instances; (ii) any run of an algorithm returns some tentative solution, which can be used to estimate the performance of that algorithm for the problem instance; and (iii) the researcher is interested primarily in comparing the performance of the algorithms for the problem class Γ, instead of for individual instances. Each of these three experimental assumptions can be associated with a specific aspect when comparing algorithms: • Assumption (i) indicates that the variability due to instance effects can be modelled out of our analysis by pairing or blocking [15,50], which is the underlying approach of methods such as ANOVA with blocking, or Friedman's test [20].
• Assumption (ii) essentially prevents us from having to deal with the problem of missing values in our analysis, since every run will return some valid performance observation. The question on how to deal with missing values in the comparative analysis of algorithms -e.g., if a given algorithm fails to converge in a study where performance is measured by time-untilconvergence -is a very relevant one, but it falls outside the scope of this particular work.
• Assumption (iii) is associated with the fact that comparative testing of algorithms should, in the vast majority of cases, be focused on generalising the performance observed on limited test set Γ S to the wider class of problems Γ. This contrasts with a somewhat common practice in the heuristics literature of performing individual Rank Sum tests on each instance and tallying up the wins/losses/ties, without any further inference being performed on these summary quantities. The problem with this approach for the comparison of algorithms on multiple problem instances has been recognised at least since 2006 [20], but despite of this the practice still persists in the literature.
Focusing on testing hypotheses regarding the (relative) performance of algorithms for a problem class of interest also determines two other aspects of the design and analysis of comparative experiments. The first is the issue of what is considered the effective sample size for these experiments. As argued in earlier works [15,9,7], the effective sample size to be considered when testing the hypotheses of interest is the number of instances, rather than the amount of repeated runs (or even worse, the product of these two quantities). Failure to account for this when performing inference results in pseudoreplication [37,48,43], that is, the (often implicit) use of artificially inflated degrees-of-freedom in statistical tests. This results in violation of the independence assumption underlying these tests, and results in actual confidence levels that can be arbitrarily inferior to the nominal ones.
The second aspect, which is a consequence of the first, is that the analysis of data generated by the experiments can be done using either a hierarchical model [31] or, alternatively, a block design applied on summarised observations, in which the performance of each algorithm on each instance is summarised as a single value, often the mean or median of multiple runs. This second approach is equivalent to the first under the assumption (iii) stated above, and results in the application of the well-known and widely-used methods for comparisons of the average performance of multiple algorithms on multiple problem instances: omnibus tests such as Complete Block Design (CBD) ANOVA or Friedman's test [20]. Those tests are used to investigate whether at least one algorithm has an average performance different from at least one other. If that is the case, then multiple comparison procedures are employed to answer the more specific statistical questions of which algorithms differ from which in terms of their average performance [56,50].
In the following, we detail the hypotheses of interest that are tested by these statistical procedures when performing comparative experiments with algorithms, and suggest that a greater focus on the multiple comparison procedures can provide experimenters with better tools for designing and analysing their experiments.

Test Hypotheses
Let Y k| denote the performance of algorithm a k on instance γ . Based on the definitions from the preceding section, we can decompose the performance according to the usual model of the complete block design [49]: in which µ k denotes the mean performance of algorithm a k for the problem class Γ after the (additive) effect of each instance, θ , is blocked out; and k is the residual relative to that particular observation, i.e., the term that accounts for all other unmodelled effects that may be present (e.g., uncertainties in the estimation of Y k| ). Notice that the algorithm mean µ k can be further decomposed into a grand mean of all algorithms for the problem class of interest, µ, and the effect of algorithm a k on this grand mean, represented by τ k . By construction, A k=1 τ k = 0. As mentioned earlier, an usual approach for comparing multiple optimisation algorithms on multiple problem instances (which is an analogous problem to that of testing multiple classifiers using multiple data sets [20]) employs the usual 2-step approach provided in most introductory texts in statistics. It starts by using omnibus tests, such as CBD ANOVA or Friedman's test, to investigate the existence of some effect by testing hypotheses of the form: which is equivalent to testing whether all algorithms have the same mean value versus the existence of at least one algorithm with a different mean performance for the problem class of interest. If the omnibus test returns a statistically significant result at some particular confidence level, pairwise comparisons are then performed as a second step of inference, to pinpoint the significant difference(s). These are usually performed using common tests for the difference of means of two matched populations, such as variations of the paired t-test for post-ANOVA analysis, or Wilcoxon's Signed Ranks test (or the Binomial Sign Test) for post-Friedman analysis.
There are several different sets of pairwise comparisons that can be executed in this step, but the two most commonly used ones are all vs. all and the all vs. one. In the former, A(A − 1)/2 tests are performed on the paired difference of means (again, with instance as the pairing factor) for all pairs of algorithms a i = a j : in which µ (ij) = τ i − τ j represents the mean of the paired differences in performance between a i and a j . All vs. all pairwise comparisons are the default approach in the meta-heuristics literature, but in practice are only required when one is really interested in obtaining a complete "ordering" of all algorithms in a given experiment.
The second type of pairwise comparisons is the all vs. one, in which there is a reference algorithm (e.g. a new proposal one is interested in evaluating). Assuming that the first algorithm, a 1 , is always set as the reference one, all vs. one pairwise comparisons are defined as: for all j = 1. This approach results in only (A − 1) hypotheses to be tested and, as we will discuss later, results in inferential procedures with higher statistical power or, if sample sizes are being estimated, in experiments requiring fewer instances to achieve a given desired power [46]. 3 It is important to highlight at this point that the ANOVA or Friedman tests are not strictly required when performing the comparison of multiple algorithms. From an inferential perspective, it is equally valid to perform only the pairwise comparisons, as long as the rejection threshold of each hypothesis tested is adequately adjusted to prevent the inflation of Type-I error rates resulting from multiple hypothesis testing [55]. While it is true that post-ANOVA pairwise tests can usually benefit from increased sensitivity (by using the more accurate estimate of the common residual variance that results from the ANOVA model), this improvement of statistical power is only achieved under the assumption of equality of variances, which is rarely true in algorithm comparisons. We argue that this potential (albeit occasional) disadvantage is more than compensated by a range of possible benefits that emerge when we focus on the pairwise comparisons: • Tests for the mean of paired differences are generally much simpler than their corresponding omnibus counterparts, and generally require fewer assumptions. For instance, unlike the CBD ANOVA, the paired t-test does not require equality of variances between algorithms, nor does it assume that the data from each algorithm is normally distributed -only the paired differences need be approximately normal, and not even that if the sample size is large enough. These tests also yield much more easily interpretable results, and are considerably simpler to design and analyse.
• In the context of experimental comparison of algorithms one is commonly interested in relatively few tests: the number of algorithms compared is usually small, and in several cases tests can be performed using the all vs. one approach (e.g., when assessing the performance of a proposed method in comparison to several existing ones). This alleviates the main issue with testing multiple hypotheses, namely the loss of statistical power (or, alternatively, the increase in the sample size required to achieve a certain sensitivity) that results from adjusting the significance levels to control the Type-I error rate. [55].
• From the perspective of estimating the number of required instances for algorithm comparisons, one of the most challenging quantities that need to be defined in the design stage of these experiments is a minimally relevant effect size (MRES) [15]. The definition of MRES in the context of omnibus tests is very counter-intuitive, since it must be defined in terms, e.g., of the ratio of between-groups to within-groups variances. While still challenging, the definition and interpretation of MRES in the context of comparing a pair of algorithms is incomparably simpler: it essentially involves defining the smallest difference in mean performance between two algorithms (either in terms of raw difference or normalised by the standard deviation) that would have some practical consequence in the context of the research being performed [15]. 3 Statistical power (or sample sizes) can be further improved if the alternative hypotheses in (4) are directional, which can be employed if the researcher is specifically interested in knowing whether or not the reference algorithm is superior (in terms of mean performance) to the others (i.e., if equality or inferiority are considered equally undesirable outcomes). Of course it makes no sense trying to use directional alternative hypotheses in the all vs. all case, since in that case there is no reference algorithm common to all tests.
Given the considerations above we argue that, while it is common practice to perform the experimental analysis as the 2-step procedure outlined earlier, it is often more practical to focus on the pairwise tests, particularly from the perspective of sample size calculations. In the following sections we detail how the two sample sizes involved in the experimental comparison of algorithms -number of instances, number of runs of each algorithm on each instance -can be estimated by focusing on the pairwise comparisons. This of course does not preclude the use of omnibus tests as part of the analysis, or of any other type of analytical tool for investigating and characterising the behaviour and performance of the algorithms being tested. In fact, sample sizes estimated by focusing on the pairwise comparisons result in omnibus tests with at least the same statistical power for which the pairwise tests were planned, assuming that their assumptions, which are often more restrictive, are satisfied. However, the use of omnibus tests becomes optional, and the analyst can choose to proceed directly with the pairwise tests knowing that the statistical properties of their experiment are guaranteed.
Finally, based on the statistical model from (1) we can define the two kinds of paired differences in performance, which we will examine in this work. The first one is the simple paired difference in performance, D (ij)| = Y i| − Y j| , which can be estimated from the data as in whichX i| andX j| are the mean (or median) observed performance of algorithms a i and a j on instance γ , computed from n i and n j runs, respectively. An alternative approach is to use percent differences, which can be estimated for the all vs. one case as and for the all vs. all case as in whichX •| is the estimate of the grand mean of all algorithms on instance , 3 Calculating the number of repetitions for comparisons of multiple algorithms

Preliminary Statistical Concepts
Before we present the derivation of sample size formulas, there are a few statistical concepts that need to be clarified. More information and detailed discussions of these concepts are available in our previous work [15], and here we will only present a very brief review of these definitions. The minimally relevant effect size (MRES), which is an essential concept for determining the smallest sample size in the comparison of algorithm, is defined in the context of this work as the smallest difference between two algorithms that the researcher considers as having some practical effect. There exists several different effect size estimators that are relevant for a variety of experimental questions [25], but in the context of the hypotheses considered in this work we focus on Cohen's d coefficient, which is the (real) value of the mean of the paired differences 4 normalised by σ, the (real) standard deviation of the paired differences. In the context of this work, the value in the numerator of (9) is estimated as the mean of the | D| values calculated using (5) (for the simple paired difference) or (6)-(7) (for the percent paired differences). Based on this definition, the MRES is defined as i.e., as the smallest magnitude of the standardised mean of paired differences between two algorithms that would actually result in some effect of practical relevance. 5 Let X t k| denote the observation of performance of algorithm a k on the t-th run on a given instance γ , andX (11) denote the sample estimator of the mean performance of a k on that instance, based on n k| independent runs. Under relatively mild assumptions, we know that the sampling distribution of means converges to a Normal distribution even for reasonably small sample sizes, with where the σ 2 k| / n k| term is the squared standard error of the sample mean. Given a single problem instance γ and the A algorithms one wishes to compare, the approach for calculating the number of repetitions is similar to the one outlined in [15], namely finding the smallest total number of runs such that the standard errors of estimation -which can be interpreted as the measurement errors of the pairwise differences in performance -can be controlled at a predefined level. This can be expressed as an optimisation problem, (13) in which n k| is the number of runs of algorithm a k on instance γ , se (ij)| is the standard error of estimation of the relevant statistic (e.g., simple or percent difference of means) for a pair of It is worth noting that the most adequate formulation for this problem is an integer one, but in this work we employ a relaxed version of the formulation that accepts continuous values. Since this formulation is used only to derive reference values for the optimal ratio of sample sizes (which can be continuous even under the discrete sample size constraint) this relaxation does not result in any inconsistencies. Also notice that there is a second set of constraints that is omitted from formulation (13), namely n k| ≥ 2, ∀k ∈ [1, A], which are needed to guarantee that standard deviations can be estimated for each algorithm. We did not explicitly include these constraints since the proposed method assumes that all algorithms will be initially run a number n 0 > 2 times before iterative sample size estimations start, so these constraints will always be satisfied.
The specific calculation of the standard errors se ij depends on the type of differences being considered: simple differentes or percent differences and, in the latter case, whether the planned comparisons are an all vs. one or all vs. all case. The results for the comparison of two algorithms were presented in our previous work [15], and in this section we generalise those results to the case of comparisons of multiple algorithms. We also introduce a heuristic that can be used to estimate the numbers of repetitions in the case of more general statistics that may be of interest in the context of experimental comparison of algorithms.
3.2 Derivation of optimal ratios n i| /n j| While it was possible to derive an analytic solution to the problem of determining the optimal ratio of sample sizes in the two algorithm case [15], a similar closed solution does not seem to be possible for the general problem (13) with A ≥ 3. It is, however, possible to derive optimality-based ratios of sample sizes for any pair of algorithms, which allows us to propose a principled heuristic to generate adequate samples for a set of A algorithms on any individual instance. The proposed heuristic iteratively generates observations of performance for each of the algorithms considered, so that the constraints se 2 (ij)| ≤ (se * ) 2 can be satisfied with as few total algorithm runs as possible.
This heuristic is described in Algorithm 1. The procedure detailed in Algorithm 1 is general enough to accommodate any statistic one may wish to use for quantifying the pairwise differences of performance between algorithms, as long as the standard errors and their sensitivities to changes in n k| can be estimated (which can be done using bootstrap, if analytical expressions are not available). In the case where analytical expressions for the optimal ratio of sample sizes between two algorithms are available (see below), the determination of the algorithm that is contributing the most to se max on a given instance γ (line 5 of the algorithm) can be done based on these optimal ratios: if the observed ratio is smaller than the optimal one, run the algorithm in the numerator; otherwise, run the one in the denominator.
Another point worth mentioning is that the procedure outlined in Algorithm 1 is essentially a greedy approach to reduce the worst-case standard error, which means that even if it is interrupted by the computational budget constraint the resulting standard errors of estimation will be the smallest ones achievable. This, in turn, means that the residual variance due to these estimation uncertainties will be as small as possible, which in turn increases the sensitivity of the experiment Algorithm 1 Sample algorithms on a single instance. Require: Instance γ ; Algorithms a 1 , . . . , a A ; accuracy threshold se * ; initial sample size n 0 ; maximum sample size n max . 1: Generate n 0 initial samples of each algorithm on instance γ . 2: Estimate all se ij of interest

5:
Perform one additional run of the algorithm that is contributing the most to se max on instance γ 6: Vectors of observations generated for all algorithms.
to smaller differences in performance by reducing, even if slightly, the denominator term in (9). The derivation of the standard errors and optimal ratios of sample sizes, for the simple and percent differences of means, both for the all vs. one and all vs. all cases, are presented below. Notice that while we can derive optimal sample size ratios for all paired differences of interest, Algorithm 1 will result in ratios that can deviate (sometimes substantially so) from the optimal reference values. This is a consequence of the fact that ratios of sample sizes are not independent quantities when multiple algorithms are compared. In practice, pairs that present the worst-case standard error ( se max ) more often will tend to have ratios of sample sizes closer to the optimal values derived below, with the other pairs deviating from optimality due to the algorithm focusing the sampling on the pairs that are associated with se max . This, however, does not reduce the validity and usefulness of the optimality derivations (as they are needed to decide which algorithm from the selected pair should receive a new observation), nor does it affect the quality of the results provided by Algorithm 1.
Finally, for the sake of clarity we will omit the instance index from all derivations in the next section, but the reader is advised to keep in mind that the calculations in the remained of this section are related to a single given instance.

Simple differences of means
In the case of the simple differences of means, the statistic of interest for any pair of algorithms a i , a j is which is estimated using the sample quantities Given (12), the sampling distribution of φ ij will be with squared standard error For any pair of algorithms a i , a j , the optimal sample size ratio n i /n j for a given instance γ can be found using the formulation in (13): Find [n i , n j ] = arg min f ij (n i , n j ) = n i + n j ; This formulation has a known analytical solution [15] in terms of the ratio of sample sizes: which can be rewritten in terms of sample estimators aŝ with S i , S j being the sample standard deviations of the observations of algorithms a i , a j , respectively. In this case the decision rule for which algorithm should receive a new observation (line 5 of Algorithm 1) can be expressed as: let a i , a j be the two algorithms that define se max . Then: It is worthwhile to notice that this result is independent on whether one is interested in all vs. all or all vs. one comparisons, the only difference being which pairs a i , a j generate se ij constraints in (13).

Percent differences of means
For the percent differences of means 6 we need to distinguish between the two types of comparisons discussed in this work.
All vs. one: For all vs. one comparisons, assuming a 1 is the reference algorithm, the statistic of interest is given as: which can be interpreted as the percent mean loss in performance of a j when compared to a 1 . This is estimated using which is distributed according to one plus the ratio of two normal variables R/S, with 6 Strictly speaking the difference discussed here is the proportional (per unit) difference, which must be multiplied by a factor of 100 to be actually expressed as percent. This, however, has no effect in the derivations that follow, and is merely a matter of linear scaling.
If we can assume that all algorithms involved return observations that are strictly positive (which is quite common, e.g., in several families of problems in operations research as well as for several common performance indicators in multiobjective optimisation), the squared standard error of (23) can be calculated as [15,26,27]: with The KKT conditions for the problem formulation in (18) state that, at the optimal point n * 1 , n * j the following should be true: The gradient of the objective function is trivially derived as ∇f 1j = [1,1], and the partial derivatives of g 1j are: ∂g 1j The first row of (25) can then be expanded as: which means that C 1j 1 n −2 1 = C 1j 2 n −2 j at the optimal point. 7 Isolating the ratio of sample sizes and simplifying finally yields: which can be estimated from the sample using: Given this expression forr opt , the same rule expressed in (21) can be used in Line 5 of Algorithm 1, simply replacing n i by n 1 .
All vs. all: For all vs. all comparison using percent differences the question of which mean to use as the normalising term in the denominator must be decided. Assuming that no specific preference is given to any of the algorithms in this kind of comparison, we suggest using the grand mean µ instead of any specific mean µ k , i.e.: which can now be interpreted as the difference between the means of a i and a j , in proportion to the grand mean. This is estimated using is an estimator of the grand mean µ. This estimator is unbiased regardless of unequal sample sizes, and has a squared standard error of: Under the same mild assumptions under whichX k ∼ N µ k , σ 2 k n −1 k , we have that the sampling distribution of this estimator is progressively closer to a normal distribution as the sample sizes are increased. These considerations allow us to calculate the standard error of (31) in the same manner as (24), using the simplified formulation of Fieller's standard error [26,27], i.e.: with with φ ij defined as in (30). This new expression for the standard error leads to (26) being replaced by: which, when inserted into the KKT conditions (25) yields: and, consequently, which is identical to the optimal ratio for the simple differences of means, and can be similarly estimated using the sample standard deviations. Once again, the same decision rule from (21) can be used.

Calculating the number of required instances for comparisons of multiple algorithms
In this section we deal with the calculation of the number of instances, N , required to obtain an experiment with predefined statistical properties. In a previous work [15] we developed sample size formulas for the comparison of two algorithms on a problem class of interest. Since we have reduced the multiple algorithm comparison problem to a series of pairwise comparisons (at least for the purposes of sampling), most of the theoretical justifications remain the same, and the only actual difference in the calculation of the required number of instances is the need to adjust the significance levels so as to maintain the overall probability of Type-I errors (false positives) controlled at the desired level α [55]. The calculation of the number of instances for comparing two algorithms is based on the definition of a few desired statistical properties for the test [15]. More specifically, we want the test to have a predefined statistical power, π * = (1 − β * ), to detect differences equal to or greater than a minimally relevant effect size d * , at a predefined significance level α. This led to the definition of the smallest required number of instances for the comparison of two algorithms (based on the paired t-test, for a two-sided alternative hypothesis) as: is the q-quantile of the central t distribution with N − 1 degrees of freedom, and t (N −1) q;|ncp * | is the q-quantile of the non-central t distribution with N − 1 degrees of freedom and noncentrality parameter |ncp * | = |d * | √ N . The required sample sizes can be further reduced in the all vs. one comparisons if one-sided alternative hypotheses can be used, in which case we simply replace t (38). 8 Since our proposed protocol for sample size calculation is based on designing the experiments from the perspective of the pairwise tests that will need to be performed, the sample size formulas outlined above remain valid, and we only need to adjust the significance level to account for the testing of multiple hypotheses. As mentioned previously, we have two common cases in the experimental comparison of algorithms: all vs. all comparisons (3), which result in K = A(A − 1)/2 tests; and all vs. one comparisons (4), in which case only K = A − 1 comparisons are performed.
As discussed by Juliet Shaffer in her review of multiple hypothesis testing [55], "when many hypotheses are tested, and each test has a specified Type I error probability, the probability that at least some Type I errors are committed increases, often sharply, with the number of hypotheses". This is quite easy to illustrate if we consider that each test has a base probability of falsely rejecting a true null hypothesis given by its significance level α. If K comparisons are performed, and assuming that (i) the comparisons are independent, and (ii) all null hypotheses are true, the overall probability of observing one or more false positives (commonly called the family-wise error rate, or FWER) can be calculated as As an example, if we consider a somewhat typical case of all vs. all comparisons of 5 algorithms (which would generate K = 10 hypotheses) with a per-comparison α = 0.05, this would result in an F W ER = 1 − 0.95 10 = 0.401, i.e., a chance of around 40% of falsely rejecting at least one of the null hypotheses.
To prevent this inflation of the FWER, a wide variety of techniques for multiple hypothesis testing are available in the literature. Possibly the most widely known is the Bonferroni correction [23,55], which can maintain the FWER strictly under a desired level α f by using, for each comparison, a reduced significance level α = α f /K (or, equivalently, by multiplying all p-values by K and comparing against the original α). This correction is known to be overly conservative, leading to substantial reductions of the statistical power of each comparison, or to substantially higher sample sizes being required to achieve the same power. However, its simplicity often makes it quite appealing, particularly when few hypotheses need to be tested, in which case the penalisation applied to α is not too extreme. If the Bonferroni correction is selected to control the FWER in a given experiment, the calculation of the number of instances N can be performed by simply dividing the value of α in (38) by the number of planned comparisons (K = A − 1 for the all vs. one case, or K = A(A − 1)/2 for all vs. all ). Figure 1 illustrates an example of the increase in the required sample size for Bonferroni-corrected tests as the number of comparisons is increased.
A uniformly more powerful alternative that also controls the FWER at a predefined level is Holm's step-down method [34,55]. Instead of testing all hypotheses using the same corrected significance level, Holm's method employs a sequential correction approach. First, the p-values for all tests are computed. The hypotheses and their corresponding p-values are then ranked in increasing order of p-values, such that p 1 ≤ p 2 . . . ≤ p K . Each of these ordered hypotheses is then tested at a different, increasingly stricter significance level, in which r is the rank of the hypothesis being tested. If R is the largest value of r for which a hypothesis would be rejected at the corresponding α r level, then Holm's step-down method leads to the rejection of all null hypotheses with ranks r ≤ R. Holm's method is also known to maintain the FWER under the desired value [55], but it leads to significantly less conservative tests than the Bonferroni correction. However, while the calculation of the number of instances is quite straightforward for the Bonferroni correction, it requires some clarification when Holm's method is employed. This is because each test is performed using a different significance level, which leads to individual tests with heterogeneous power levels under a constant sample size.
One possible approach, which is found in the statistical literature, is to calculate the sample size based on the least favourable condition. We can calculate the sample size such that the test with the most strict significance level α r will have at least the desired power π * , which guarantees that the statistical power for all other comparisons will be greater than π * . A simple examination of (39) shows that the most strict test will occur for r = 1, in which case α r = α f /K, i.e., the same corrected significance level generated by the Bonferroni correction. In this case, the same simple adjustment can be made for the formulas in (38), namely dividing the value of α by K. A related possibility, which generalises this one, is to calculate the sample size so that some predefined number of tests can have a statistical power of at least π * . This can also be easily set up by determining a number K of comparisons that should have power levels above the predefined threshold, and then divide α by K − K + 1 in (38).
A second possibility, which is also relatively straightforward, is to design the experiment so that the mean (or median) power of the tests is maintained at the nominal level. The median case is roughly equivalent to setting K = K/2 and applying the method outlined above. Estimating sample size for achieving a mean power of π * is more challenging from an analytic perspective, but can be done iteratively without much effort using Algorithm 2. In this pseudocode SampleSize (α, π, d * , H 1 ) denotes the calculation of the required sample size for performing a hypothesis test (e.g., using a paired t-test) with significance level α, power β, MRES d * and type of alternative hypothesis H 1 , as defined in (38). Similarly, Power (α, n, d * , H 1 ) calculates the power of a test procedure under the same parameters, assuming a sample size of n. Notice that since power increases monotonically with α, the final sample size returned by the procedure outlined in Algorithm 2 is guaranteed to be smaller than the result based on the least favourable condition. Figure 2 illustrates the resulting power profile of an experiment designed for mean power using the procedure outlined in Algorithm 2. Two interesting aspects of this planning are immediately clear. First, the worst-case power appears to stabilize at about 0.85, which is not much lower than Algorithm 2 Estimate sample size for mean power in Holm's procedure. Require: Desired mean power (π * ); desired FWER (α f ); type of alternative hypotheses (one or two-sided) (H 1 ); number of comparisons (K) 1:p ← 0 2: N ← SampleSize (α f , π, d * , H 1 ) − 1 Using (38) 3: whilep < π * do 4: N ← N + 1 5: for i ∈ {1, . . . , K} do 6: See [15] for details.

7:
end for Calculate mean power 9: end while 10: return N the desired 0.9, and would still be considered an acceptable power level in most applications. The second one is that the resulting sample sizes are, as expected, lower than those required if the Bonferroni correction were to be used. Based on these considerations, we can offer some suggestions as to how to proceed calculating the required number of instances (N ) for the comparison of multiple algorithms using multiple problem instances: • It is more efficient, often considerably so, to use Holm's correction instead of Bonferroni's, given that the former requires smaller sample sizes to achieve similar power characteristics (see Figure 1).
• In general it seems reasonable to design experiments based on average power for Holmcorrected tests, given that the resulting worst-case power will not be much lower than the nominal value π * , and the number of instances is substantially lower than what would be required for Bonferroni-corrected tests (see Figure 2).
• If worst-case power must be guaranteed, the number of instances can be estimated using the formulas for the Bonferroni case, but the actual analysis should still be conducted using Holm's method, which in this case will result in better statistical power for all comparisons.
• Whenever possible, all vs. one comparisons should be used instead of all vs. all, since they result in fewer hypotheses being tested and, consequently, smaller sample sizes (or larger power if sample sizes are constant).
• If one-sided alternative hypotheses make sense in the experimental context of all vs. one comparisons they should also be preferred, since this also improves efficiency (in terms of requiring fewer instances to achieve a predefined power).
It is worth repeating here an important point also highlighted in our previous work [15]: the sample sizes calculated with the methodology presented in this section represent the smallest sample size required for a given experiment to present desired statistical properties. This can be particularly useful when, e.g., designing new benchmark sets, sampling a portion of an existing (possibly much larger) set of test functions as part of algorithm development, or designing experiments in computationally expensive contexts. If more instances are available in a particular experiment they can of course be used, which will result in experiments with even higher statistical power and can enable the execution of finer analyses, e.g., to investigate the effect of dimension or other instance characteristics on performance. However, even if that is the case, we argue that the proposed methodology can still be useful for a number of reasons: • It can be used by researchers to obtain a more solid understanding of the statistical properties of their experiments (e.g., by examining power × effect size curves at a given sample size).
• By defining a MRES a priori, the proposed approach provides a sanity check for researchers in overpowered experiments, with can result in arbitrarily-low p-values even for minuscule differences of no practical consequence [46]. That is, it provides a practical relevance aspect, to contrast with the statistical significance of the results.
• Even if the calculation of the number of instances is considered unnecessary in the context of a given experiment, the methodology for estimating the number of repeated runs of each algorithm on each instance can still be used to provide a statistically principled way to generate the data.
In the next section we provide an example of application of the proposed methodology for investigating the contribution of different neighbourhood structures for a scheduling problem. Please notice, however, that the main objective of the experiment is to demonstrate the capabilities of the proposed methodology for defining the sample sizes, and not necessarily to answer specific questions regarding the algorithms used in the example.

Problem Description
In the unrelated parallel machine scheduling problem (UPMSP) with sequence dependent setup times [32,42,58] a number of jobs, J, needs to be processed by a number of machines, M , minimising the completion time of the last job to leave the system, i.e., the makespan. This scheduling problem also presents a few other specificities which are not relevant for the current discussion, but are presented in detail in the relevant literature [58,54].
So far the best algorithm for the solution of this problem seems to be a finely-tuned Simulated Annealing (SA) proposed by Santos et al. in 2016 [54]. Santos' SA was tested on a benchmark set originally introduced by Vallada and Ruiz [58], which is composed of 200 tuning instances with J ∈ {50, 100, 150, 200, 250} jobs and M ∈ {10, 15, 20, 25, 30} machines. For each (M, J) pair, eight instances with different characteristics (e.g., distribution of setup times) are present. A larger set, composed of 1000 instances, is also available to test hypotheses derived using the tuning set.
One of the core aspects of Santos' SA is the use of six neighbourhood structures to explore the performance landscape of the problem, namely Shift, Two-shift, Task Move, Swap, Direct Swap, and Switch. At each iteration one of these perturbation functions is selected randomly and generates a new candidate The authors do not provide any discussion related to how much each of these perturbation functions affects the performance of the algorithm. However, preliminary results [45,51] suggest that the six neighbourhood structures have considerably different effects on the algorithm, with Task move having a much more critical influence than the others.

Experimental questions
To further investigate the effect of the different neighbourhoods on the performance of the algorithm, we designed an experiment in which the full algorithm (i.e., equipped with all six neighbourhood structures) was compared against variants with perturbation functions suppressed from the pool of available movements. We focussed on removing at most two perturbations from the pool, resulting in 21 algorithm variants (6 without a single perturbation, and 15 without two perturbations) which enabled the quantification of contributions of neighbourhood structures to the algorithm's ability to navigate the performance landscape of this particular problem class.
Given that we wanted to investigate the effects of removing these neighbourhood functions from the full algorithm, an all vs. one design was be the most appropriate, resulting in a total of 21 hypotheses to be tested. Also, the variability in instance hardness suggested the use of percent differences (see Section 3.2.2), which quantify differences in a scale that is usually more consistent across heterogeneous instances.

Experimental parameters
We set the MRES at a (reasonably high) value of d * = 0.5, i.e., we were interested in detecting variations in the mean of percent differences of at least half a standard deviation. We decided to design our experiment with a mean power π * = 0.8, under a familywise significance level α = 0.05 using Holm's correction. To prevent excess variability in the estimates of within-instance differences from inflating the overall residual variance of the experiment, we set the desired accuracy for the estimates as se * = 0.05, i.e., a 5% error in the estimates of paired percent differences on each instance. This relatively good accuracy can also be useful if we later want to further investigate the differences on individual instances, or even instance subgroups.
The method described in Section 4 yielded N = 57 instances as the smallest amount necessary to obtain an experiment with the desired properties. 9 At this point we could have opted to use any number of instances greater than or equal to this value, up to the full available set of 200 [58]. We opted to use only the suggested value of 57 not only because it was enough to guarantee the desired statistical properties for our experiment, but also to maintain a reserve of "untouched" tuning instances, which could be useful for further testing of eventual proposals of algorithmic improvement. 10 The A = 22 algorithms (full algorithm plus the 21 variants) were sampled on each instance under a maximum budget of 50A = 50 × 22 = 1100 runs per instance, using an initial sampling of n 0 = 10 runs/algorithm/instance. Figure 3 shows the standard errors obtained for all algorithm pairs, as well as the total sample sizes generated, for each instance used in this experiment. It is very clear that the proposed approach for calculating the number of runs was able to control the standard errors at the desired level of se * = 0.05 using, in most cases, a fraction of the total available computational budget. In only three instances was the available budget insufficient to bring all standard errors under the desired threshold, but even in these cases the resulting standard errors were controlled at relatively low values. These three particular instances all had J = 50 jobs (one with M = 15 and two with M = 25 machines), which suggests further investigation into the behaviour of the methods tested on this particular subset of instance sizes. variants, for all instances tested. The numbers at the top indicate the total number of runs performed on each instance, and the labels on the x-axis provide the instance identifiers. Notice that only in three cases the allocated budget was insufficient to reduce all standard errors below the preset limit of se * = 0.05.  Figure 4 shows the distribution of sample sizes for each algorithm involved in this experiment. As expected, the full algorithm received a reasonably large number of observations for the majority of instances, since it was involved in all comparisons. Another feature that becomes quite clear in this figure is that all versions in which the Task Move (TSK) neighbourhood function was suppressed also received much more observations, which suggests this neighbourhood as strongly influential and corroborates preliminary results obtained for Santos' SA on the UPMSP [51,45].

Experimental results
The results of the pairwise comparisons are shown in Figure 5, which provides the confidence intervals (adjusted for a familywise error rate of α = 0.05) for all comparisons between the full algorithm and the suppressed variants. As indicated by the relative importance attributed to the methods by the sampling method, the suppression of the TSK neighbourhood function led to strong degradations of performance, with mean percent losses around 50% or more. Also of interest in this table is the last column, which shows that the desired statistical properties of the experimentparticularly the ability to detect effects at or above approximately d * = 0.5 -were actually obtained in the designed experiment.
Of possibly equal interest in terms of understanding the contributing factors to algorithm performance is the fact that the algorithm did not experience significant performance losses after the removal of some of its neighbourhood structures. Table 1 presents the results of the t-tests 11 on the mean of paired percent differences of performance. Examining the bottom rows of this table, it shows that the suppression of Swap (SWP), Switch (SWT) and Shift (SHF), individually as well as in pairs, resulted in no significant differences in performance. The width of the associated confidence intervals also suggests that any effects that may have gone undetected must be rather small.
To further explore these initial insights a follow-up experiment was performed. The objective of this second analysis was to obtain a more precise estimate of the effect (or lack thereof) of removing these three neighbourhood functions, individually as well as in pairs, from the pool of movements available to the algorithm. Further, we also added the variant generated by simultaneously removing these three functions, which was not present in the first experiment. This means that this follow-up test set was employed for this follow-up experiment, which provides a mean power of π 0.85 to detect differences as small as d * = 0.25. This was calculated by deriving the d * × π * curve shown in Figure 6, using the same approach described in Section 4.   Figure 7 summarise the results obtained for the follow-up experiment. Notice that even with the increased sensitivity of the tests to the mean of percent differences for the variants that had the SWP, SHF and SWT neighbourhood functions removed from the pool of possible movements the tests failed to detect any statistically significant differences. This result, coupled with the narrow confidence intervals, indicates that any effect that these neighbourhood functions may be having on the performance of the algorithm are probably very small, if they exist at all. Based on these results, algorithm designers and developers could expand this exploration even further by, for instance, (i) focussing on the structures that are being explored by the remaining neighbourhood functions, so as to gain better understanding of the problem's landscape, or (ii) perform further exploration into instance subsets, to investigate whether the systematic lack of effect observed in this experiment also occurs, e.g., when conditioning the results on problem size. These further explorations are, however, outside the scope of the current paper.

Conclusions
In this work we expanded methodology first introduced in [15 Notice that no comparison yielded statistically significant results, and that the narrow confidence intervals suggest that even if the neighbourhoods tested have some effect on the algorithm performance it is likely to be of minor consequence.
problem class of interest. The two aspects of sample size estimation for the comparative performance experiments, namely the number of instances and the amount of runs allocated to each algorithm on each instance, were addressed in a statistically principled manner. The number of instances is determined based on the desired sensitivity (in terms of statistical power) for the detection of effects larger than a predefined minimally relevant effect size (MRES), under a given family-wise confidence level. Holm's step-down procedure was employed to control the rate of false positives while enabling the design of experiments at desired levels of power for the best, worst, median or mean case. The proposed methodology is based on the assumptions that underlie the t-tests, but the results can be easily generalised to the most common non-parametric tests (e.g., Wilcoxon signed-ranks test) based on asymptotic relative efficiency, as discussed previously in [15].
The proposed iterative sampling method used to determine the number of runs of each algorithm on each instance is based on the interpretation of the standard error of estimation as a measure of precision, and generating samples aimed at minimising the total amount of runs necessary to obtain estimates with standard errors below predefined thresholds. Optimal sample size ratios were derived for simple and percent differences between the mean of all pairs of algorithms on any given instance, and a sampling heuristic was proposed that enables generalisation for other statistics that may be of eventual interest to researchers, such as the variance, algorithm rankings etc.
An example of application was provided in Section 5, using the publicly available R package CAISEr [14], which provides an easy to use implementation of the proposed methodology. The example focussed on the comparison of a state-of-the-art heuristic for the solution of a class of scheduling problems against 21 algorithm variants generated by suppressing different neighbourhood functions used to generate candidate solutions. The preliminary results indicated not only the most relevant neighbourhood function but also the apparent lack of effect of three of the six ones commonly employed. The first part of the experiment illustrated the ability of the proposed method to sample the algorithms so as to control the standard errors at each instance under the desired accuracy threshold. It also showcased the compliance between the MRES used for designing the experiment, and the effect sizes that the hypothesis tests were able to detect as statistically significant.
The results obtained in the first part of the experiment were used to design a follow-up investigation, in which the effects of the three neighbourhood structures flagged in the initial part as not statistically significant were further investigated. This follow-up experiment used the full available instance set, and was used to showcase the use of the proposed methodology in a fixed sample size situation. The results obtained not only corroborated those obtained in the first part, but further reinforced the tentative conclusion that the effect of using the three neighbourhood functions as part of the algorithm must be minor at best, which suggests a few possibilities of exploration for algorithm designers and developers.

Limitations and Possibilities
As in our previous work upon which the present paper was based, it is important to reinforce at that the proposed methodology is not the definitive way to compare algorithms. For instance, convergence analysis, algorithm reliability, and performance analysis conditional on problem characteristics are research questions that require different methodologies. The proposed methodology is simply an additional tool in the research arsenal, one that can provide answers to several common questions in the experimental comparison of algorithms.
One of the most straightforward extensions of the work presented in this paper is the incorporation of sequential analysis approaches [12,3] to enable further reductions in the required number of instances, which is possible in cases where the actual effect sizes are substantially larger than the MRES defined in the experimental design. Extending the sample size calculations performed here to Bayesian alternatives [13] is also relatively straightforward, and may allow the aggregation of existing knowledge into the comparison of algorithms, in the form of prior distributions. Bayesian methods may also allow for an easier incorporation of sequential analysis into the sample size calculation methodologies -since the incremental aggregation of observations in Bayesian inference does not require further significance corrections [41] -and may even enable the incorporation of existing published evidence, which would be another step towards the development of meta-analytical tools for algorithmic research.