Abstract
Experimental comparisons of performance represent an important aspect of research on optimization algorithms. In this work we present a methodology for defining the required sample sizes for designing experiments with desired statistical properties for the comparison of two methods on a given problem class. The proposed approach allows the experimenter to define desired levels of accuracy for estimates of mean performance differences on individual problem instances, as well as the desired statistical power for comparing mean performances over a problem class of interest. The method calculates the required number of problem instances, and runs the algorithms on each test instance so that the accuracy of the estimated differences in performance is controlled at the predefined level. Two examples illustrate the application of the proposed method, and its ability to achieve the desired statistical properties with a methodologically sound definition of the relevant sample sizes.
Similar content being viewed by others
Notes
Throughout this work we refer to an algorithm as the full structure with specific parameter values, i.e., to a completely defined instantiation of a given algorithmic framework.
\(\varGamma \) can either be explicitly known or implicitly defined as a hypothetical set for which some available test instances can be considered a representative sample.
The direction of the inequalities in (2) will depend on the type of performance measure used, i.e., on whether larger = better or vice versa.
The definition of an initial value of \(n_0\) also helps increasing the probability that the sampling distributions of the means will be approximately normal.
Considering a comparison where larger is better.
If this assumption cannot be guaranteed, the use of percent differences is not advisable, and the researcher should instead perform comparisons using the simple differences.
The independence between \(\phi _j^{(1)}\) and \(\bar{X}_{1j}\) is guaranteed as long as \(X_{1j}\) and \(X_{2j}\) are independent.
Using inferential tests on \(\mathbf {y}_B\) is not good practice, as the number of resamples can be made arbitrarily large, which would artificially inflate the degrees-of-freedom of any such test.
Notice that it is relatively common for the normality assumption to be violated in the original data, but valid under transformations such as log or square root. The topic of data transformations is, however, outside the scope of this manuscript.
Although it is very common in the literature on the experimental comparison of algorithms to ignore the fact that Wilcoxon’s signed-ranks test works under the assumption of symmetry.
There are other ways to calculate the sample size for the binomial sign test that are less conservative, but for the sake of brevity this will not be discussed here.
While in this particular example the required computational budget for exhausting all available instances would not be unattainable, limitations to the number of instances that can be reasonably employed in an experiment can be much more severe when researching, for instance, heuristics for optimizing numerical models in engineering applications, or other expensive optimization scenarios (Tenne and Goh 2010). The present example was inspired in part by the authors’ past experience with such problems.
The full replication script for this experiment is available in the Vignette “Adapting Algorithms for CAISEr” of the CAISEr package (Campelo and Takahashi 2017).
More specifically: \(\widehat{se}_{\widehat{\phi }_j} = 0.0518\) for UF5 (28); \(\widehat{se}_{\widehat{\phi }_j} = 0.0544\) for UF3 (29); and\(\widehat{se}_{\widehat{\phi }_j} = 0.0536\)forUF5 (17)
The instance files can be retrieved from http://soa.iti.es/problem-instances
The source codes used for this experiment can be retrieved from http://github.com/andremaravilha/upmsp-scheduling.
The graphical analysis of the residuals did not suggest expressive deviations of normality. The results table and residual analysis are provided in the Supplemental Materials.
References
Barr, R.S., Golden, B.L., Kelly, J.P., Resende, M.G.C., Stewart, W.R.: Designing and reporting on computational experiments with heuristic methods. J. Heuristics 1(1), 9–32 (1995)
Bartroff, J., Lai, T., Shih, M.C.: Sequential Experimentation in Clinical Trials: Design and Analysis. Springer, Berlin (2013)
Bartz-Beielstein, T.: New Experimentalism Applied to Evolutionary Computation. Ph.D. thesis, Universität Dortmund, Germany (2005)
Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation. Springer, Berlin (2006)
Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M.: Experimental Methods for the Analysis of Optimization Algorithms. Springer, Berlin (2010)
Bausell, R., Li, Y.F.: Power analysis for experimental research: a practical guide for the biological, medical and social sciences. Cambridge University Press, Cambridge (2006)
Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., Ruggeri, F.: A bayesian wilcoxon signed-rank test based on the dirichlet process. In: 30th International conference on machine learning, pp. 1026–1034 (2014)
Birattari, M.: On the estimation of the expected performance of a metaheuristic on a class of instances: how many instances, how many runs? Tech. Rep. IRIDIA/2004-001, Université Libre de Bruxelles, Belgium (2004)
Birattari, M.: Tuning Metaheuristics. A Machine Learning Perspective. Springer, Berlin Heidelberg (2009)
Birattari, M., Dorigo, M.: How to assess and report the performance of a stochastic algorithm on a benchmark problem: mean or best result on a number of runs? Optim. Lett. 1, 309–311 (2007)
Botella, J., Ximénez, C., Revuelta, J., Suero, M.: Optimization of sample size in controlled experiments: The CLAST rule. Behav. Res. Methods 38(1), 65–76 (2006)
Bradley Efron, R.T.: An Introduction to the Bootstrap, 1st edn. Chapman and Hall, Boca Raton (1994)
Campelo, F., Batista, L.S., Aranha, C.: The MOEADr package—a component-based framework for multiobjective evolutionary algorithms based on decomposition. J. Stat. Softw. (2018). arXiv:1807.06731
Campelo, F., Takahashi, F.: CAISEr: Comparison of Algorithms with Iterative Sample Size Estimation (2017). https://CRAN.R-project.org/package=CAISEr
Carrano, E.G., Wanner, E.F., Takahashi, R.H.C.: A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms. IEEE Trans. Evol. Comput. 15(6), 848–870 (2011)
Chow, S.C., Wang, H., Shao, J.: Sample Size Calculations in Clinical Research. CRC Press, Boca Raton (2003)
Coffin, M., Saltzman, M.J.: Statistical analysis of computational tests of algorithms and heuristics. INFORMS J. Comput. 12(1), 24–44 (2000)
Crawley, M.: The R Book, 2nd edn. Wiley, Hoboken (2013)
Czarn, A., MacNish, C., Vijayan, K., Turlach, B.: Statistical exploratory analysis of genetic algorithms: the importance of interaction. In: Proceedings of the 2004 IEEE Congress on Evolutionary Computation. Institute of Electrical and Electronics Engineers (IEEE) (2004)
Davison, A.C., Hinkley, D.V.: Bootstrap Methods and Their Application. Cambridge University Press, Cambridge (1997)
del Amo, I.G., Pelta, D.A., González, J.R., Masegosa, A.D.: An algorithm comparison for dynamic optimization problems. Appl. Soft Comput. 12(10), 3176–3192 (2012)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Derrac, J., García, S., Hui, S., Suganthan, P.N., Herrera, F.: Analyzing convergence performance of evolutionary algorithms: a statistical approach. Inf. Sci. 289, 41–58 (2014)
Derrac, J., García, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)
Eiben, A., Jelasity, M.: A critical note on experimental research methodology in EC. In: Proceedings of the 2002 IEEE Congress on Evolutionary Computation. Institute of Electrical & Electronics Engineers (IEEE) (2002)
Fieller, E.C.: Some problems in interval estimation. J. R. Stat. Soc. Ser. B (Methodological) 16(2), 175–185 (1954)
Franz, V.: Ratios: A short guide to confidence limits and proper use (2007). https://arxiv.org/pdf/0710.2024v1.pdf
García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput. 13(10), 959–977 (2009)
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special session on real parameter optimization. J. Heuristics 15(6), 617–644 (2008)
Grissom, R.J., Kim, J.J.: Effect Sizes for Research, 2nd edn. Routledge, Abington (2012)
Hansen, N., Auger, A., Mersmann, O., Tusar, T., Brockhoff, D.: COCO: A platform for comparing continuous optimizers in a black-box setting. CoRR arXiv:1603.08785 (2016)
Hansen, N., T\(\check{{\rm u}}\)sar, T., Mersmann, O., Auger, A., Brockoff, D.: COCO: The experimental procedure (2016). arXiv:1603.08776
Hooker, J.N.: Testing heuristics: We have it all wrong. J. Heuristics 1(1), 33–42 (1996)
Hurlbert, S.H.: Pseudoreplication and the design of ecological field experiments. Ecol. Monogr. 54(2), 187–211 (1984)
Jain, R.K.: The Art of Computer Systems Performance Analysis. Wiley, Hoboken (1991)
Johnson, D.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M., Johnson, D., McGeoch, C. (eds.) Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 59, pp. 215–250. American Mathematical Society, Providence (2002)
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)
Krohling, R.A., Lourenzutti, R., Campos, M.: Ranking and comparing evolutionary algorithms with hellinger-TOPSIS. Appl. Soft Comput. 37, 217–226 (2015)
Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS, 1st edn. Academic Press Inc, Cambridege (2010)
Lazic, S.E.: The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci. 11(5), 397–407 (2010)
Lenth, R.V.: Some practical guidelines for effective sample size determination. Am Stat. 55(3), 187–193 (2001)
Li, H., Zhang, Q.: Multiobjective optimization problems with complicated pareto sets, MOEA/d and NSGA-II. IEEE Trans. Evol. Comput. 13(2), 284–302 (2009)
Mathews, P.: Sample Size Calculations: Practical Methods for Engineers and Scientists, 1st edn. Matthews Malnar & Bailey Inc., Fairport Harbor (2010)
McGeoch, C.C.: Feature article—toward an experimental method for algorithm simulation. INFORMS J. Comput. 8(1), 1–15 (1996)
Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers, 6th edn. Wiley, Hoboken (2013)
Mori, T., Sato, Y., Adriano, R., Igarashi, H.: Optimal design of RF energy harvesting device using genetic algorithm. Sens. Imag. 16(1), 14 (2015)
Nuzzo, R.: Scientific method: statistical errors. Nature 506(7487), 150–152 (2014)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/
Ridge, E.: Design of Experiments for the Tuning of Optimisation Algorithms. Ph.D. thesis, The University of York, UK (2007)
Santos, H.G., Toffolo, T.A., Silva, C.L., Berghe, G.V.: Analysis of stochastic local search methods for the unrelated parallel machine scheduling problem. Int. Trans. Oper. Res. (2016). https://doi.org/10.1111/itor.12316
Sawilowsky, S.S.: New effect size rules of thumb. J. Mod. Appl. Stat. Methods 8(2), 597–599 (2009)
Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psychol. 46(1), 561–584 (1995)
Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Taylor & Francis, Abingdon (2011)
Tenne, Y., Goh, C.K.: Computational Intelligence in Expensive Optimization Problems. Springer, Berlin (2010)
Vallada, E., Ruiz, R.: A genetic algorithm for the unrelated parallel machine scheduling problem with sequence dependent setup times. Eur. J. Oper. Res. 211(3), 612–622 (2011)
Yuan, B., Gallagher, M.: Statistical racing techniques for improved empirical evaluation of evolutionary algorithms. Parallel Problem Solving From Nature - PPSN VIII 3242, 172–181 (2004)
Yuan, B., Gallagher, M.: An improved small-sample statistical test for comparing the success rates of evolutionary algorithms. In: Proceedings of the 11th Annual conference on Genetic and evolutionary computation—GECCO09. Association for Computing Machinery (ACM) (2009)
Zhang, Q., Li, H.: MOEA/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007)
Zhang, Q., Zhou, A., Zhao, S., Suganthan, P., Liu, W., Tiwari, S.: Multiobjective optimization test instances for the cec 2009 special session and competition. Tech. Rep. CES-887, University of Essex (2008). http://dces.essex.ac.uk/staff/zhang/moeacompetition09.htm. (Revised on 20/04/2009)
Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C., Fonseca, V.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7(2), 117–132 (2003)
Author information
Authors and Affiliations
Corresponding author
Additional information
Felipe Campelo has worked under grants from Brazilian agencies FAPEMIG (APQ-01099-16) and CNPq (404988/2016-4). Fernanda Takahashi has been funded by a Ph.D. scholarship from Brazilian agency CAPES. The source code for Experiment 2 was kindly provided by Dr. André Maravilha, ORCS Lab, UFMG, Brazil.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Campelo, F., Takahashi, F. Sample size estimation for power and accuracy in the experimental comparison of algorithms. J Heuristics 25, 305–338 (2019). https://doi.org/10.1007/s10732-018-9396-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10732-018-9396-7