Skip to main content
Log in

Sample size estimation for power and accuracy in the experimental comparison of algorithms

  • Published:
Journal of Heuristics Aims and scope Submit manuscript

Abstract

Experimental comparisons of performance represent an important aspect of research on optimization algorithms. In this work we present a methodology for defining the required sample sizes for designing experiments with desired statistical properties for the comparison of two methods on a given problem class. The proposed approach allows the experimenter to define desired levels of accuracy for estimates of mean performance differences on individual problem instances, as well as the desired statistical power for comparing mean performances over a problem class of interest. The method calculates the required number of problem instances, and runs the algorithms on each test instance so that the accuracy of the estimated differences in performance is controlled at the predefined level. Two examples illustrate the application of the proposed method, and its ability to achieve the desired statistical properties with a methodologically sound definition of the relevant sample sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Throughout this work we refer to an algorithm as the full structure with specific parameter values, i.e., to a completely defined instantiation of a given algorithmic framework.

  2. \(\varGamma \) can either be explicitly known or implicitly defined as a hypothetical set for which some available test instances can be considered a representative sample.

  3. The direction of the inequalities in (2) will depend on the type of performance measure used, i.e., on whether larger = better or vice versa.

  4. The definition of an initial value of \(n_0\) also helps increasing the probability that the sampling distributions of the means will be approximately normal.

  5. Considering a comparison where larger is better.

  6. If this assumption cannot be guaranteed, the use of percent differences is not advisable, and the researcher should instead perform comparisons using the simple differences.

  7. The independence between \(\phi _j^{(1)}\) and \(\bar{X}_{1j}\) is guaranteed as long as \(X_{1j}\) and \(X_{2j}\) are independent.

  8. Using inferential tests on \(\mathbf {y}_B\) is not good practice, as the number of resamples can be made arbitrarily large, which would artificially inflate the degrees-of-freedom of any such test.

  9. Notice that it is relatively common for the normality assumption to be violated in the original data, but valid under transformations such as log or square root. The topic of data transformations is, however, outside the scope of this manuscript.

  10. Although it is very common in the literature on the experimental comparison of algorithms to ignore the fact that Wilcoxon’s signed-ranks test works under the assumption of symmetry.

  11. There are other ways to calculate the sample size for the binomial sign test that are less conservative, but for the sake of brevity this will not be discussed here.

  12. While in this particular example the required computational budget for exhausting all available instances would not be unattainable, limitations to the number of instances that can be reasonably employed in an experiment can be much more severe when researching, for instance, heuristics for optimizing numerical models in engineering applications, or other expensive optimization scenarios (Tenne and Goh 2010). The present example was inspired in part by the authors’ past experience with such problems.

  13. The full replication script for this experiment is available in the Vignette “Adapting Algorithms for CAISEr” of the CAISEr package (Campelo and Takahashi 2017).

  14. More specifically: \(\widehat{se}_{\widehat{\phi }_j} = 0.0518\) for UF5 (28); \(\widehat{se}_{\widehat{\phi }_j} = 0.0544\) for UF3 (29); and\(\widehat{se}_{\widehat{\phi }_j} = 0.0536\)forUF5 (17)

  15. The instance files can be retrieved from http://soa.iti.es/problem-instances

  16. The source codes used for this experiment can be retrieved from http://github.com/andremaravilha/upmsp-scheduling.

  17. The graphical analysis of the residuals did not suggest expressive deviations of normality. The results table and residual analysis are provided in the Supplemental Materials.

References

  • Barr, R.S., Golden, B.L., Kelly, J.P., Resende, M.G.C., Stewart, W.R.: Designing and reporting on computational experiments with heuristic methods. J. Heuristics 1(1), 9–32 (1995)

    Article  MATH  Google Scholar 

  • Bartroff, J., Lai, T., Shih, M.C.: Sequential Experimentation in Clinical Trials: Design and Analysis. Springer, Berlin (2013)

    Book  MATH  Google Scholar 

  • Bartz-Beielstein, T.: New Experimentalism Applied to Evolutionary Computation. Ph.D. thesis, Universität Dortmund, Germany (2005)

  • Bartz-Beielstein, T.: Experimental Research in Evolutionary Computation. Springer, Berlin (2006)

    MATH  Google Scholar 

  • Bartz-Beielstein, T., Chiarandini, M., Paquete, L., Preuss, M.: Experimental Methods for the Analysis of Optimization Algorithms. Springer, Berlin (2010)

    Book  MATH  Google Scholar 

  • Bausell, R., Li, Y.F.: Power analysis for experimental research: a practical guide for the biological, medical and social sciences. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  • Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., Ruggeri, F.: A bayesian wilcoxon signed-rank test based on the dirichlet process. In: 30th International conference on machine learning, pp. 1026–1034 (2014)

  • Birattari, M.: On the estimation of the expected performance of a metaheuristic on a class of instances: how many instances, how many runs? Tech. Rep. IRIDIA/2004-001, Université Libre de Bruxelles, Belgium (2004)

  • Birattari, M.: Tuning Metaheuristics. A Machine Learning Perspective. Springer, Berlin Heidelberg (2009)

    Book  MATH  Google Scholar 

  • Birattari, M., Dorigo, M.: How to assess and report the performance of a stochastic algorithm on a benchmark problem: mean or best result on a number of runs? Optim. Lett. 1, 309–311 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Botella, J., Ximénez, C., Revuelta, J., Suero, M.: Optimization of sample size in controlled experiments: The CLAST rule. Behav. Res. Methods 38(1), 65–76 (2006)

    Article  Google Scholar 

  • Bradley Efron, R.T.: An Introduction to the Bootstrap, 1st edn. Chapman and Hall, Boca Raton (1994)

    MATH  Google Scholar 

  • Campelo, F., Batista, L.S., Aranha, C.: The MOEADr package—a component-based framework for multiobjective evolutionary algorithms based on decomposition. J. Stat. Softw. (2018). arXiv:1807.06731

  • Campelo, F., Takahashi, F.: CAISEr: Comparison of Algorithms with Iterative Sample Size Estimation (2017). https://CRAN.R-project.org/package=CAISEr

  • Carrano, E.G., Wanner, E.F., Takahashi, R.H.C.: A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms. IEEE Trans. Evol. Comput. 15(6), 848–870 (2011)

    Article  Google Scholar 

  • Chow, S.C., Wang, H., Shao, J.: Sample Size Calculations in Clinical Research. CRC Press, Boca Raton (2003)

    MATH  Google Scholar 

  • Coffin, M., Saltzman, M.J.: Statistical analysis of computational tests of algorithms and heuristics. INFORMS J. Comput. 12(1), 24–44 (2000)

    Article  MATH  Google Scholar 

  • Crawley, M.: The R Book, 2nd edn. Wiley, Hoboken (2013)

    MATH  Google Scholar 

  • Czarn, A., MacNish, C., Vijayan, K., Turlach, B.: Statistical exploratory analysis of genetic algorithms: the importance of interaction. In: Proceedings of the 2004 IEEE Congress on Evolutionary Computation. Institute of Electrical and Electronics Engineers (IEEE) (2004)

  • Davison, A.C., Hinkley, D.V.: Bootstrap Methods and Their Application. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  • del Amo, I.G., Pelta, D.A., González, J.R., Masegosa, A.D.: An algorithm comparison for dynamic optimization problems. Appl. Soft Comput. 12(10), 3176–3192 (2012)

    Article  Google Scholar 

  • Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  • Derrac, J., García, S., Hui, S., Suganthan, P.N., Herrera, F.: Analyzing convergence performance of evolutionary algorithms: a statistical approach. Inf. Sci. 289, 41–58 (2014)

    Article  Google Scholar 

  • Derrac, J., García, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)

    Article  Google Scholar 

  • Eiben, A., Jelasity, M.: A critical note on experimental research methodology in EC. In: Proceedings of the 2002 IEEE Congress on Evolutionary Computation. Institute of Electrical & Electronics Engineers (IEEE) (2002)

  • Fieller, E.C.: Some problems in interval estimation. J. R. Stat. Soc. Ser. B (Methodological) 16(2), 175–185 (1954)

    MathSciNet  MATH  Google Scholar 

  • Franz, V.: Ratios: A short guide to confidence limits and proper use (2007). https://arxiv.org/pdf/0710.2024v1.pdf

  • García, S., Fernández, A., Luengo, J., Herrera, F.: A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput. 13(10), 959–977 (2009)

    Article  Google Scholar 

  • García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)

    Article  Google Scholar 

  • García, S., Molina, D., Lozano, M., Herrera, F.: A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special session on real parameter optimization. J. Heuristics 15(6), 617–644 (2008)

    Article  MATH  Google Scholar 

  • Grissom, R.J., Kim, J.J.: Effect Sizes for Research, 2nd edn. Routledge, Abington (2012)

    Google Scholar 

  • Hansen, N., Auger, A., Mersmann, O., Tusar, T., Brockhoff, D.: COCO: A platform for comparing continuous optimizers in a black-box setting. CoRR arXiv:1603.08785 (2016)

  • Hansen, N., T\(\check{{\rm u}}\)sar, T., Mersmann, O., Auger, A., Brockoff, D.: COCO: The experimental procedure (2016). arXiv:1603.08776

  • Hooker, J.N.: Testing heuristics: We have it all wrong. J. Heuristics 1(1), 33–42 (1996)

    Article  MATH  Google Scholar 

  • Hurlbert, S.H.: Pseudoreplication and the design of ecological field experiments. Ecol. Monogr. 54(2), 187–211 (1984)

    Article  Google Scholar 

  • Jain, R.K.: The Art of Computer Systems Performance Analysis. Wiley, Hoboken (1991)

    MATH  Google Scholar 

  • Johnson, D.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M., Johnson, D., McGeoch, C. (eds.) Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 59, pp. 215–250. American Mathematical Society, Providence (2002)

    Google Scholar 

  • Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13(4), 455–492 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  • Krohling, R.A., Lourenzutti, R., Campos, M.: Ranking and comparing evolutionary algorithms with hellinger-TOPSIS. Appl. Soft Comput. 37, 217–226 (2015)

    Article  Google Scholar 

  • Kruschke, J.K.: Doing Bayesian Data Analysis: A Tutorial with R and BUGS, 1st edn. Academic Press Inc, Cambridege (2010)

    MATH  Google Scholar 

  • Lazic, S.E.: The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci. 11(5), 397–407 (2010)

    Google Scholar 

  • Lenth, R.V.: Some practical guidelines for effective sample size determination. Am Stat. 55(3), 187–193 (2001)

    Article  MathSciNet  Google Scholar 

  • Li, H., Zhang, Q.: Multiobjective optimization problems with complicated pareto sets, MOEA/d and NSGA-II. IEEE Trans. Evol. Comput. 13(2), 284–302 (2009)

    Article  Google Scholar 

  • Mathews, P.: Sample Size Calculations: Practical Methods for Engineers and Scientists, 1st edn. Matthews Malnar & Bailey Inc., Fairport Harbor (2010)

    Google Scholar 

  • McGeoch, C.C.: Feature article—toward an experimental method for algorithm simulation. INFORMS J. Comput. 8(1), 1–15 (1996)

    Article  MATH  Google Scholar 

  • Montgomery, D.C., Runger, G.C.: Applied Statistics and Probability for Engineers, 6th edn. Wiley, Hoboken (2013)

    MATH  Google Scholar 

  • Mori, T., Sato, Y., Adriano, R., Igarashi, H.: Optimal design of RF energy harvesting device using genetic algorithm. Sens. Imag. 16(1), 14 (2015)

    Article  Google Scholar 

  • Nuzzo, R.: Scientific method: statistical errors. Nature 506(7487), 150–152 (2014)

    Article  Google Scholar 

  • R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/

  • Ridge, E.: Design of Experiments for the Tuning of Optimisation Algorithms. Ph.D. thesis, The University of York, UK (2007)

  • Santos, H.G., Toffolo, T.A., Silva, C.L., Berghe, G.V.: Analysis of stochastic local search methods for the unrelated parallel machine scheduling problem. Int. Trans. Oper. Res. (2016). https://doi.org/10.1111/itor.12316

  • Sawilowsky, S.S.: New effect size rules of thumb. J. Mod. Appl. Stat. Methods 8(2), 597–599 (2009)

    Article  MathSciNet  Google Scholar 

  • Shaffer, J.P.: Multiple hypothesis testing. Ann. Rev. Psychol. 46(1), 561–584 (1995)

    Article  Google Scholar 

  • Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Taylor & Francis, Abingdon (2011)

    MATH  Google Scholar 

  • Tenne, Y., Goh, C.K.: Computational Intelligence in Expensive Optimization Problems. Springer, Berlin (2010)

    Book  MATH  Google Scholar 

  • Vallada, E., Ruiz, R.: A genetic algorithm for the unrelated parallel machine scheduling problem with sequence dependent setup times. Eur. J. Oper. Res. 211(3), 612–622 (2011)

    Article  MathSciNet  Google Scholar 

  • Yuan, B., Gallagher, M.: Statistical racing techniques for improved empirical evaluation of evolutionary algorithms. Parallel Problem Solving From Nature - PPSN VIII 3242, 172–181 (2004)

    Google Scholar 

  • Yuan, B., Gallagher, M.: An improved small-sample statistical test for comparing the success rates of evolutionary algorithms. In: Proceedings of the 11th Annual conference on Genetic and evolutionary computation—GECCO09. Association for Computing Machinery (ACM) (2009)

  • Zhang, Q., Li, H.: MOEA/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007)

    Article  Google Scholar 

  • Zhang, Q., Zhou, A., Zhao, S., Suganthan, P., Liu, W., Tiwari, S.: Multiobjective optimization test instances for the cec 2009 special session and competition. Tech. Rep. CES-887, University of Essex (2008). http://dces.essex.ac.uk/staff/zhang/moeacompetition09.htm. (Revised on 20/04/2009)

  • Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C., Fonseca, V.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7(2), 117–132 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felipe Campelo.

Additional information

Felipe Campelo has worked under grants from Brazilian agencies FAPEMIG (APQ-01099-16) and CNPq (404988/2016-4). Fernanda Takahashi has been funded by a Ph.D. scholarship from Brazilian agency CAPES. The source code for Experiment 2 was kindly provided by Dr. André Maravilha, ORCS Lab, UFMG, Brazil.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Campelo, F., Takahashi, F. Sample size estimation for power and accuracy in the experimental comparison of algorithms. J Heuristics 25, 305–338 (2019). https://doi.org/10.1007/s10732-018-9396-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10732-018-9396-7

Keywords

Navigation