Sample size estimation for power and accuracy in the experimental comparison of algorithms

Abstract

Experimental comparisons of performance represent an important aspect of research on optimization algorithms. In this work we present a methodology for defining the required sample sizes for designing experiments with desired statistical properties for the comparison of two methods on a given problem class. The proposed approach allows the experimenter to define desired levels of accuracy for estimates of mean performance differences on individual problem instances, as well as the desired statistical power for comparing mean performances over a problem class of interest. The method calculates the required number of problem instances, and runs the algorithms on each test instance so that the accuracy of the estimated differences in performance is controlled at the predefined level. Two examples illustrate the application of the proposed method, and its ability to achieve the desired statistical properties with a methodologically sound definition of the relevant sample sizes.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    Throughout this work we refer to an algorithm as the full structure with specific parameter values, i.e., to a completely defined instantiation of a given algorithmic framework.

  2. 2.

    \(\varGamma \) can either be explicitly known or implicitly defined as a hypothetical set for which some available test instances can be considered a representative sample.

  3. 3.

    The direction of the inequalities in (2) will depend on the type of performance measure used, i.e., on whether larger = better or vice versa.

  4. 4.

    The definition of an initial value of \(n_0\) also helps increasing the probability that the sampling distributions of the means will be approximately normal.

  5. 5.

    Considering a comparison where larger is better.

  6. 6.

    If this assumption cannot be guaranteed, the use of percent differences is not advisable, and the researcher should instead perform comparisons using the simple differences.

  7. 7.

    The independence between \(\phi _j^{(1)}\) and \(\bar{X}_{1j}\) is guaranteed as long as \(X_{1j}\) and \(X_{2j}\) are independent.

  8. 8.

    Using inferential tests on \(\mathbf {y}_B\) is not good practice, as the number of resamples can be made arbitrarily large, which would artificially inflate the degrees-of-freedom of any such test.

  9. 9.

    Notice that it is relatively common for the normality assumption to be violated in the original data, but valid under transformations such as log or square root. The topic of data transformations is, however, outside the scope of this manuscript.

  10. 10.

    Although it is very common in the literature on the experimental comparison of algorithms to ignore the fact that Wilcoxon’s signed-ranks test works under the assumption of symmetry.

  11. 11.

    There are other ways to calculate the sample size for the binomial sign test that are less conservative, but for the sake of brevity this will not be discussed here.

  12. 12.

    While in this particular example the required computational budget for exhausting all available instances would not be unattainable, limitations to the number of instances that can be reasonably employed in an experiment can be much more severe when researching, for instance, heuristics for optimizing numerical models in engineering applications, or other expensive optimization scenarios (Tenne and Goh 2010). The present example was inspired in part by the authors’ past experience with such problems.

  13. 13.

    The full replication script for this experiment is available in the Vignette “Adapting Algorithms for CAISEr” of the CAISEr package (Campelo and Takahashi 2017).

  14. 14.

    More specifically: \(\widehat{se}_{\widehat{\phi }_j} = 0.0518\) for UF5 (28); \(\widehat{se}_{\widehat{\phi }_j} = 0.0544\) for UF3 (29); and\(\widehat{se}_{\widehat{\phi }_j} = 0.0536\)forUF5 (17)

  15. 15.

    The instance files can be retrieved from http://soa.iti.es/problem-instances

  16. 16.

    The source codes used for this experiment can be retrieved from http://github.com/andremaravilha/upmsp-scheduling.

  17. 17.

    The graphical analysis of the residuals did not suggest expressive deviations of normality. The results table and residual analysis are provided in the Supplemental Materials.

References

  1. Barr, R.S., Golden, B.L., Kelly, J.P., Resende, M.G.C., Stewart, W.R.: Designing and reporting on computational experiments with heuristic methods. J. Heuristics 1(1), 9–32 (1995)

    MATH  Article  Google Scholar 

  2. Bartroff, J., Lai, T., Shih, M.C.: Sequential Experimentation in Clinical Trials: Design and Analysis. Springer, Berlin (2013)

    Google Scholar 

  3. Bartz-Beielstein, T.: New Experimentalism Applied to Evolutionary Computation. Ph.D. thesis, Universität Dortmund, Germany (2005)

  4. Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation. Springer, Berlin (2006)

    Google Scholar 

  5. Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M.: Experimental Methods for the Analysis of Optimization Algorithms. Springer, Berlin (2010)

    Google Scholar 

  6. Bausell, R., Li, Y.F.: Power analysis for experimental research: a practical guide for the biological, medical and social sciences. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  7. Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., Ruggeri, F.: A bayesian wilcoxon signed-rank test based on the dirichlet process. In: 30th International conference on machine learning, pp. 1026–1034 (2014)

  8. Birattari, M.: On the estimation of the expected performance of a metaheuristic on a class of instances: how many instances, how many runs? Tech. Rep. IRIDIA/2004-001, Université Libre de Bruxelles, Belgium (2004)

  9. Birattari, M.: Tuning Metaheuristics. A Machine Learning Perspective. Springer, Berlin Heidelberg (2009)

    Google Scholar 

  10. Birattari, M., Dorigo, M.: How to assess and report the performance of a stochastic algorithm on a benchmark problem: mean or best result on a number of runs? Optim. Lett. 1, 309–311 (2007)

    MathSciNet  MATH  Article  Google Scholar 

  11. Botella, J., Ximénez, C., Revuelta, J., Suero, M.: Optimization of sample size in controlled experiments: The CLAST rule. Behav. Res. Methods 38(1), 65–76 (2006)

    Article  Google Scholar 

  12. Bradley Efron, R.T.: An Introduction to the Bootstrap, 1st edn. Chapman and Hall, Boca Raton (1994)

    Google Scholar 

  13. Campelo, F., Batista, L.S., Aranha, C.: The MOEADr package—a component-based framework for multiobjective evolutionary algorithms based on decomposition. J. Stat. Softw. (2018). arXiv:1807.06731

  14. Campelo, F., Takahashi, F.: CAISEr: Comparison of Algorithms with Iterative Sample Size Estimation (2017). https://CRAN.R-project.org/package=CAISEr

  15. Carrano, E.G., Wanner, E.F., Takahashi, R.H.C.: A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms. IEEE Trans. Evol. Comput. 15(6), 848–870 (2011)

    Article  Google Scholar 

  16. Chow, S.C., Wang, H., Shao, J.: Sample Size Calculations in Clinical Research. CRC Press, Boca Raton (2003)

    Google Scholar 

  17. Coffin, M., Saltzman, M.J.: Statistical analysis of computational tests of algorithms and heuristics. INFORMS J. Comput. 12(1), 24–44 (2000)

    MATH  Article  Google Scholar 

  18. Crawley, M.: The R Book, 2nd edn. Wiley, Hoboken (2013)

    Google Scholar 

  19. Czarn, A., MacNish, C., Vijayan, K., Turlach, B.: Statistical exploratory analysis of genetic algorithms: the importance of interaction. In: Proceedings of the 2004 IEEE Congress on Evolutionary Computation. Institute of Electrical and Electronics Engineers (IEEE) (2004)

  20. Davison, A.C., Hinkley, D.V.: Bootstrap Methods and Their Application. Cambridge University Press, Cambridge (1997)

    Google Scholar 

  21. del Amo, I.G., Pelta, D.A., González, J.R., Masegosa, A.D.: An algorithm comparison for dynamic optimization problems. Appl. Soft Comput. 12(10), 3176–3192 (2012)

    Article  Google Scholar 

  22. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  23. Derrac, J., García, S., Hui, S., Suganthan, P.N., Herrera, F.: Analyzing convergence performance of evolutionary algorithms: a statistical approach. Inf. Sci. 289, 41–58 (2014)

    Article  Google Scholar 

  24. Derrac, J., García, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)

    Article  Google Scholar 

  25. Eiben, A., Jelasity, M.: A critical note on experimental research methodology in EC. In: Proceedings of the 2002 IEEE Congress on Evolutionary Computation. Institute of Electrical & Electronics Engineers (IEEE) (2002)

  26. Fieller, E.C.: Some problems in interval estimation. J. R. Stat. Soc. Ser. B (Methodological) 16(2), 175–185 (1954)

    MathSciNet  MATH  Google Scholar 

  27. Franz, V.: Ratios: A short guide to confidence limits and proper use (2007). https://arxiv.org/pdf/0710.2024v1.pdf

  28. García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput. 13(10), 959–977 (2009)

    Article  Google Scholar 

  29. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)

    Article  Google Scholar 

  30. García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special session on real parameter optimization. J. Heuristics 15(6), 617–644 (2008)

    MATH  Article  Google Scholar 

  31. Grissom, R.J., Kim, J.J.: Effect Sizes for Research, 2nd edn. Routledge, Abington (2012)

    Google Scholar 

  32. Hansen, N., Auger, A., Mersmann, O., Tusar, T., Brockhoff, D.: COCO: A platform for comparing continuous optimizers in a black-box setting. CoRR arXiv:1603.08785 (2016)

  33. Hansen, N., T\(\check{{\rm u}}\)sar, T., Mersmann, O., Auger, A., Brockoff, D.: COCO: The experimental procedure (2016). arXiv:1603.08776

  34. Hooker, J.N.: Testing heuristics: We have it all wrong. J. Heuristics 1(1), 33–42 (1996)

    MATH  Article  Google Scholar 

  35. Hurlbert, S.H.: Pseudoreplication and the design of ecological field experiments. Ecol. Monogr. 54(2), 187–211 (1984)

    Article  Google Scholar 

  36. Jain, R.K.: The Art of Computer Systems Performance Analysis. Wiley, Hoboken (1991)

    Google Scholar 

  37. Johnson, D.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M., Johnson, D., McGeoch, C. (eds.) Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 59, pp. 215–250. American Mathematical Society, Providence (2002)

    Google Scholar 

  38. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)

    MathSciNet  MATH  Article  Google Scholar 

  39. Krohling, R.A., Lourenzutti, R., Campos, M.: Ranking and comparing evolutionary algorithms with hellinger-TOPSIS. Appl. Soft Comput. 37, 217–226 (2015)

    Article  Google Scholar 

  40. Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS, 1st edn. Academic Press Inc, Cambridege (2010)

    Google Scholar 

  41. Lazic, S.E.: The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci. 11(5), 397–407 (2010)

    Google Scholar 

  42. Lenth, R.V.: Some practical guidelines for effective sample size determination. Am Stat. 55(3), 187–193 (2001)

    MathSciNet  Article  Google Scholar 

  43. Li, H., Zhang, Q.: Multiobjective optimization problems with complicated pareto sets, MOEA/d and NSGA-II. IEEE Trans. Evol. Comput. 13(2), 284–302 (2009)

    Article  Google Scholar 

  44. Mathews, P.: Sample Size Calculations: Practical Methods for Engineers and Scientists, 1st edn. Matthews Malnar & Bailey Inc., Fairport Harbor (2010)

    Google Scholar 

  45. McGeoch, C.C.: Feature article—toward an experimental method for algorithm simulation. INFORMS J. Comput. 8(1), 1–15 (1996)

    MATH  Article  Google Scholar 

  46. Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers, 6th edn. Wiley, Hoboken (2013)

    Google Scholar 

  47. Mori, T., Sato, Y., Adriano, R., Igarashi, H.: Optimal design of RF energy harvesting device using genetic algorithm. Sens. Imag. 16(1), 14 (2015)

    Article  Google Scholar 

  48. Nuzzo, R.: Scientific method: statistical errors. Nature 506(7487), 150–152 (2014)

    Article  Google Scholar 

  49. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/

  50. Ridge, E.: Design of Experiments for the Tuning of Optimisation Algorithms. Ph.D. thesis, The University of York, UK (2007)

  51. Santos, H.G., Toffolo, T.A., Silva, C.L., Berghe, G.V.: Analysis of stochastic local search methods for the unrelated parallel machine scheduling problem. Int. Trans. Oper. Res. (2016). https://doi.org/10.1111/itor.12316

  52. Sawilowsky, S.S.: New effect size rules of thumb. J. Mod. Appl. Stat. Methods 8(2), 597–599 (2009)

    MathSciNet  Article  Google Scholar 

  53. Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psychol. 46(1), 561–584 (1995)

    Article  Google Scholar 

  54. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Taylor & Francis, Abingdon (2011)

    Google Scholar 

  55. Tenne, Y., Goh, C.K.: Computational Intelligence in Expensive Optimization Problems. Springer, Berlin (2010)

    Google Scholar 

  56. Vallada, E., Ruiz, R.: A genetic algorithm for the unrelated parallel machine scheduling problem with sequence dependent setup times. Eur. J. Oper. Res. 211(3), 612–622 (2011)

    MathSciNet  Article  Google Scholar 

  57. Yuan, B., Gallagher, M.: Statistical racing techniques for improved empirical evaluation of evolutionary algorithms. Parallel Problem Solving From Nature - PPSN VIII 3242, 172–181 (2004)

    Google Scholar 

  58. Yuan, B., Gallagher, M.: An improved small-sample statistical test for comparing the success rates of evolutionary algorithms. In: Proceedings of the 11th Annual conference on Genetic and evolutionary computation—GECCO09. Association for Computing Machinery (ACM) (2009)

  59. Zhang, Q., Li, H.: MOEA/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007)

    Article  Google Scholar 

  60. Zhang, Q., Zhou, A., Zhao, S., Suganthan, P., Liu, W., Tiwari, S.: Multiobjective optimization test instances for the cec 2009 special session and competition. Tech. Rep. CES-887, University of Essex (2008). http://dces.essex.ac.uk/staff/zhang/moeacompetition09.htm. (Revised on 20/04/2009)

  61. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C., Fonseca, V.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7(2), 117–132 (2003)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Felipe Campelo.

Additional information

Felipe Campelo has worked under grants from Brazilian agencies FAPEMIG (APQ-01099-16) and CNPq (404988/2016-4). Fernanda Takahashi has been funded by a Ph.D. scholarship from Brazilian agency CAPES. The source code for Experiment 2 was kindly provided by Dr. André Maravilha, ORCS Lab, UFMG, Brazil.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Campelo, F., Takahashi, F. Sample size estimation for power and accuracy in the experimental comparison of algorithms. J Heuristics 25, 305–338 (2019). https://doi.org/10.1007/s10732-018-9396-7

Download citation

Keywords

  • Experimental comparison of algorithms
  • Statistical methods
  • Sample size estimation
  • Accuracy of parameter estimation
  • Iterative sampling