Skip to main content
Log in

Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers

Empirical Software Engineering Aims and scope Submit manuscript


This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies, or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises “ask this question next” or “ignore that problem, it is not relevant to your goals”. Further, those agents can help us build “better” predictive models, where “better” can be either greater predictive accuracy or faster modeling time (which, in turn, enables the exploration of a wider range of options). We also caution that the era of papers that just use data miners is coming to an end. Results obtained from an unoptimized data miner can be quickly refuted, just by applying an optimizer to produce a different (and better performing) model. Our conclusion, hence, is that for software analytics it is possible, useful and necessary to combine data mining and optimization using DUO.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. This definition has been generalized with respect to Boyd and Vandenberghe (2004), not to be restricted to continuous optimization problems, where

  2. The optimization variable is usually identified by the symbol x, and the inequality and equality constraints are frequently identified by the symbols g and h in the optimization literature. However, we use the symbols a, g and \(g^{\prime \prime }\) here to avoid confusion with the terminology used in data mining, which is introduced later in this section.

  3. Any maximization problem can be re-written as a minimization problem.

  4. Using a process called “engineering judgement”; i.e. guessing.

  5., accessed 30 November 2018.

  6. Total time to process 20 repeated runs across multiple subsets of the data, for multiple data sets.

  7. In “order effects experiments”, the training data is re-arranged at random before running the learner again. In such experiments, a result is “unstable” if the learned model changes just by re-ordering the training data.

  8. The Gini index measures class diversity after a set of examples is divided by some criteria – in this case, the values of an attribute.

  9. The distance to of a predictor’s performance to the “utopia” point of recall= 1, false alarms= 0.

  10. E.g. in Python: scikit-learn and DEAP (Pedregosa et al. 2011; Rainville et al. 2012). E.g. in Java: Weka and (jMetal or SMAC) (Hall et al. 2009; Durillo and Nebro 2011; Hutter et al. 2011).


  • Abdessalem RB, Nejati S, Briand LC, Stifter T (2018) Testing vision-based control systems using learnable evolutionary algorithms. In: Proceedings of the 40th International Conference on Software Engineering, ICSE ’18. ACM, New York, pp 1016–1026.

  • Afzal W, Torkar R (2011) On the application of genetic programming for software engineering predictive modeling: a systematic review. Expert Syst Appl 38 (9):11,984–11,997

    Google Scholar 

  • Agrawal A, Fu W, Menzies T (2018a) What is wrong with topic modeling? and how to fix it using search-based software engineering. Inf Softw Technol 98:74–88

  • Agrawal A, Menzies T (2018b) Is better data better than better data miners?: on the benefits of tuning smote for defect prediction. In: Proceedings of the 40th International Conference on Software Engineering. ACM, pp 1050–1061

  • Ali MH, Al Mohammed BAD, Ismail A, Zolkipli MF (2018) A new intrusion detection system based on fast learning network and particle swarm optimization. IEEE Access 6:20,255–20,261

    Google Scholar 

  • Allamanis M, Barr ET, Devanbu P, Sutton C (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv (CSUR) 51(4):81

    Google Scholar 

  • Anderson-Cook CM (2005) Practical genetic algorithms

  • Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic programming: an introduction, vol 1. Morgan Kaufmann, San Francisco

  • Barua A, Thomas SW, Hassan AE (2012) What are developers talking about? an analysis of topics and trends in stack overflow. Empir Softw Eng 19:619–654

    Google Scholar 

  • Bird C, Menzies T, Zimmermann T (eds) (2015) The Art and Science of Analyzing Software Data. Morgan Kaufmann, Boston.

  • Bishop C (2006) Pattern recognition and machine learning. Springer, Berlin

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Boehm B, Clark B, Horowitz E, Westland C, Madachy R, Selby R (1995) Cost models for future software life cycle processes: Cocomo 2.0. Annals of software engineering

  • Boyd SP, Vandenberghe L (2004) Section 4.1 – optimization problems. In: Convex optimization. Cambridge University Press, Cambridge

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey

  • Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  • Catal C, Diri B (2009) Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf Sci 179 (8):1040–1058

    Google Scholar 

  • Chand S, Wagner M (2015) Evolutionary many-objective optimization: a quick-start guide. Surv Oper Res Manag Sci 20(2):35–42.

    MathSciNet  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • Chen J, Nair V, Menzies T (2017) Beyond evolutionary algorithms for search-based software engineering. Information and Software Technology

  • Chen D, Fu W, Krishna R, Menzies T (2018a) Applications of psychological science for actionable analytics. In: ESEC/SIGSOFT FSE

  • Chen J, Nair V, Krishna R, Menzies T (2018b) “Sampling” as a baseline optimizer for search-based software engineering. IEEE Transactions on Software Engineering

  • Chen J, Nair V, Menzies T (2018c) Beyond evolutionary algorithms for search-based software engineering. Inf Softw Technol 95:281–294

  • Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640

    Google Scholar 

  • Clarke J, Dolado JJ, Harman M, Hierons R, Jones B, Lumkin M, Mitchell B, Mancoridis S, Rees K, Roper M et al (2003) Reformulating software engineering as a search problem. IEE Proc-Softw 150(3):161–175

    Google Scholar 

  • Cohen WW (1995) Fast effective rule induction. In: Machine Learning Proceedings 1995. Elsevier, pp 115–123

  • De Carvalho AB, Pozo A, Vergilio SR (2010) A symbolic fault-prediction model based on multiobjective particle swarm optimization. J Syst Softw 83(5):868–882

    Google Scholar 

  • Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput 6(2):182–197.

    Google Scholar 

  • Deng L, Yu D et al (2014) Deep learning: methods and applications. Found Trends®; Signal Process 7(3–4):197–387

    MathSciNet  MATH  Google Scholar 

  • del Sagrado J, ÁAguila IM, Orellana FJ (2011) Requirements interaction in the next release problem. In: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation. ACM, pp 241–242

  • Du X, Yao X, Ni Y, Minku L, Ye P, Xiao R (2015) An evolutionary algorithm for performance optimization at software architecture level. In: 2015 IEEE congress on Evolutionary computation (CEC). IEEE, pp 2129–2136

  • Durillo JJ, Nebro AJ (2011) jmetal: A java framework for multi-objective optimization. Adv Eng Softw 42:760–771.

    Google Scholar 

  • Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning

  • Feather M, Menzies T (2002) Converging on the optimal attainment of requirements. In: 2002. Proceedings. IEEE joint international conference on Requirements engineering. IEEE, pp 263–270

  • Fishburn PC (1991) Nontransitive preferences in decision theory. J Risk Uncertain 4(2):113–134.

    MathSciNet  MATH  Google Scholar 

  • Frank E, Trigg L, Holmes G, Witten IH (2000) Technical note: Naive bayes for regression. Mach Learn 41(1):5–25.

    Google Scholar 

  • Freund Y, Schapire RE et al (1996) Experiments with a new boosting algorithm. In: Icml, vol 96. Citeseer, pp 148–156

  • Friedrich T, Göbel A, Quinzan F, Wagner M (2018a) Heavy-tailed mutation operators in single-objective combinatorial optimization. In: Auger A., Fonseca CM, Lourenċo N, Machado P, Paquete L, Whitley D (eds) Parallel problem solving from nature – PPSN XV. Springer International Publishing, Cham, pp 134–145

  • Friedrich T, Quinzan F, Wagner M (2018b) Escaping large deceptive basins of attraction with heavy-tailed mutation operators. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18. ACM, New York, pp 293–300.

  • Fu W, Menzies T, Shen X (2016a) Tuning for software analytics: is it really necessary? Inf Softw Technol 76:135–146

  • Fu W, Menzies T, Shen X (2016b) Tuning for software analytics: is it really necessary? Inf Softw Technol 76:135–146

  • Fu W, Nair V, Menzies T (2016c) Why is differential evolution better than grid search for tuning defect predictors? arXiv:1609.02613

  • Fu W, Menzies T (2017) Easy over hard: a case study on deep learning. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, pp 49–60

  • Fu W, Menzies T, Chen D, Agrawal A (2018) Building better quality predictors using “𝜖dominance”. arXiv:1803.04608

  • Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: 2015 IEEE/ACM 37Th IEEE international conference on software engineering, vol 1, pp 789–800

  • Glover F, Laguna M (1998) Tabu search. In: Handbook of combinatorial optimization. Springer, pp 2093–2229

  • Gondra I (2008) Applying machine learning to software fault-proneness prediction. J Syst Softw 81(2):186–195

    Google Scholar 

  • Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447

    Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: An update. SIGKDD Explor Newsl 11 (1):10–18.

    Google Scholar 

  • Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304

    Google Scholar 

  • Harman M, Jones BF (2001) Search-based software engineering. Inf Softw Technol 43(14):833–839

    Google Scholar 

  • Harman M, Mansouri SA, Zhang Y (2012) Search-based software engineering: trends, techniques and applications. ACM Comput Surv (CSUR) 45(1):11

    Google Scholar 

  • Hellendoorn VJ, Devanbu PT, Alipour MA (2018) On the naturalness of proofs. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, pp 724–728

  • Henard C, Papadakis M, Harman M, Le Traon Y (2015) Combining multi-objective search and constraint solving for configuring large software product lines. In: International conference on software engineering

  • Huang VL, Suganthan PN, Qin AK, Baskar S (2005) Multiobjective differential evolution with external archive and harmonic distance-based diversity measure. School of Electrical and Electronic Engineering Nanyang. Technological University Technical Report

  • Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045

    Google Scholar 

  • Huang SJ, Chiu NH, Chen LW (2008) Integration of the grey relational analysis with genetic algorithm for software effort estimation. Eur J Oper Res 188(3):898–909

    MATH  Google Scholar 

  • Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimization for general algorithm configuration. In: International conference on learning and intelligent optimization. Springer, pp 507–523

  • Jensen IH (2019) Naturalness of software: Science and applications, by prem devanbu

  • Jolliffe I (2011) Principal component analysis. In: International encyclopedia of statistical science. Springer, pp 1094–1096

  • Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106

    Google Scholar 

  • Kessentini M, Ruhe G (2016) A guest editorial: special section on search-based software engineering. Empir Softw Eng 21(6):2456–2458.

    Google Scholar 

  • Kotthoff L (2016) Algorithm selection for combinatorial search problems: a survey. In: Data mining and constraint programming. Springer, pp 149–190

  • Koza JR (1994) Genetic programming as a means for programming computers by natural selection. Stat Comput 4(2):87–112

    Google Scholar 

  • Krall J, Menzies T, Davies M (2015) Gale: Geometric active learning for search-based software engineering. IEEE Trans Softw Eng 41(10):1001–1018

    Google Scholar 

  • Krishna R, Menzies T (2018) Bellwethers: A baseline method for transfer learning. IEEE Transactions on Software Engineering

  • Krishna R, Menzies T (2019) Bellwethers: a baseline method for transfer learning. IEEE Trans Softw Eng 45(11):1081–1105

    Google Scholar 

  • Kuhn M (2008) Building predictive models in r using the caret package. Journal of Statistical Software. Articles 28(5):1–26.

    Google Scholar 

  • Kumar KV, Ravi V, Carr M, Kiran NR (2008) Software development cost estimation using wavelet neural networks. J Syst Softw 81(11):1853–1867

    Google Scholar 

  • Kwiatkowska M, Norman G, Parker D (2011) Prism 4.0: Verification of probabilistic real-time systems. In: International conference on computer aided verification. Springer, pp 585–591

  • Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496.

    Google Scholar 

  • Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R news 2(3):18–22

    Google Scholar 

  • Liu Y, Khoshgoftaar TM, Seliya N (2010) Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 36(6):852–864

    Google Scholar 

  • Majumder S, Balaji N, Brey K, Fu W, Menzies T (2018) 500+ times faster than deep learning (a case study exploring faster methods for text mining stackoverflow). arXiv:1802.05319

  • Menzies T, Elrawas O, Hihn J, Feather M, Madachy R, Boehm B (2007) The business case for automated software engineering. In: Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07. ACM, New York, pp 303–312.

  • Menzies T, Kocagüneli E, Minku L, Peters F, Turhan B (2013a) Data science for software engineering: Sharing data and models

  • Menzies T, Zimmermann T (2013b) Software analytics: so what? IEEE Softw 4:31–37

  • Menzies T, Williams L, Zimmermann T (2016) Perspectives on data science for software engineering. Morgan Kaufmann, Boston

    Google Scholar 

  • Menzies T, Zimmermann T (2018) Software analytics: What’s next? IEEE Softw 35(5):64–70.

    Google Scholar 

  • Menzies T, Shepperd M (2019) ‘bad smells’ in software analytics papers. Inf Softw Technol 112:35–47

    Google Scholar 

  • Minku LL, Yao X (2013a) Software effort estimation as a multiobjective learning problem. ACM Trans Softw Eng Methodol. 22(4).

  • Minku L, Yao X (2013b) An analysis of multi-objective evolutionary algorithms for training ensemble models based on different performance measures in software effort estimation. In: Proceedings of the 9th international conference on predictive models in software engineering. ACM, pp 8

  • Minku L, Yao X (2013c) Software effort estimation as a multiobjective learning problem. ACM Trans Softw Eng Methodol (TOSEM) 22(4):35

  • Minku L, Yao X (2014) How to make best use of cross-company data in software effort estimation?. In: ICSE. Hyderabad, pp 446–456

  • Minku L, Yao X (2017) Which models of the past are relevant to the present? a software effort estimation approach to exploiting useful past models. Autom Softw Eng J 24(7):499–542

    Google Scholar 

  • Montañez GD (2013) Bounding the number of favorable functions in stochastic search. In: 2013 IEEE Congress on evolutionary computation, pp 3019–3026.

  • Mori T, Uchihira N (2018) Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering.

  • Nair V, Menzies T, Siegmund N, Apel S (2017) Using bad learners to find good configurations. arXiv:1702.05701

  • Nair V, Agrawal A, Chen J, Fu W, Mathew G, Menzies T, Minku L, Wagner M, Yu Z (2018a) Data-driven search-based software engineering. In: Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18. ACM, New York, pp 341–352.

  • Nair V, Krishna R, Menzies T, Jamshidi P (2018b) Transfer learning with bellwethers to find good configurations. arXiv:1803.03900

  • Nair V, Yu Z, Menzies T, Siegmund N, Apel S (2018c) Finding faster configurations using Flash. arXiv:1801.02175

  • Neshat M, Alexander B, Wagner M, Xia Y (2018) A detailed comparison of meta-heuristic methods for optimising wave energy converter placements. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18. ACM, New York, pp 1318–1325.

  • Oliveira AL, Braga PL, Lima RM, Cornélio ML (2010) GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. Inf Softw Technol 52(11):1155–1166

    Google Scholar 

  • Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 522–531

  • Pareto V (1906) Manuale di economia politica, vol 13. Societa Editrice

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Peters F, Menzies T, Layman L (2015) Lace2: Better privacy-preserving data sharing for cross project defect prediction. In: 2015 IEEE/ACM 37Th IEEE international conference on software engineering, vol 1. IEEE, pp 801–811

  • Pohl R, Lauenroth K, Pohl K (2011) A performance comparison of contemporary algorithmic approaches for automated analysis operations on feature models. In: 2011 26Th IEEE/ACM international conference on automated software engineering (ASE 2011), pp 313–322.

  • Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57

    Google Scholar 

  • Polikar R (2006) Ensemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45

    Google Scholar 

  • Quinlan JR (1992) Learning with continuous classes. In: Proceedings AI’92. World Scientific, pp 343–348

  • Rainville D, Fortin FA, Gardner MA, Parizeau M, Gagné C et al (2012) Deap: a python framework for evolutionary algorithms. In: Proceedings of the 14th annual conference companion on Genetic and evolutionary computation. ACM, pp 85–92

  • Riffenburgh RH (1957) Linear discriminant analysis. Ph.D. thesis, Virginia Polytechnic Institute

  • Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Google Scholar 

  • Ryu D, Choi O, Baik J (2016) Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir Softw Eng 21(1):43–71

    Google Scholar 

  • Saber T, Brevet D, Botterweck G, Ventresque A (2017) Is seeding a good strategy in multi-objective feature selection when feature models evolve? Information and Software Technology

  • Sadiq AS, Alkazemi B, Mirjalili S, Ahmed N, Khan S, Ali I, Pathan ASK, Ghafoor KZ (2018) An efficient ids using hybrid magnetic swarm optimization in wanets. IEEE Access 6:29,041–29,053

    Google Scholar 

  • Sarro F, Di Martino S, Ferrucci F, Gravino C (2012a) A further analysis on the use of genetic algorithm to configure support vector machines for inter-release fault prediction. In: Proceedings of the 27th annual ACM symposium on applied computing. ACM, pp 1215–1220

  • Sarro F, Ferrucci F, Gravino C (2012b) Single and multi objective genetic programming for software development effort estimation. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12. ACM, New York, pp 1221–1226.

  • Sarro F, Petrozziello A, Harman M (2016) Multi-objective software effort estimation. In: 2016 IEEE/ACM 38th international conference on Software engineering (ICSE). IEEE, pp 619–630

  • Sayyadx AS, Ingram J, Menzies T, Ammar H (2013) Scalable product line configuration: a straw to break the camel’s back. In: 2013 28Th IEEE/ACM international conference on automated software engineering (ASE), pp 465–474

  • Sayyad AS, Menzies T, Ammar H (2013) On the value of user preferences in search-based software engineering: a case study in software product lines. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 492–501

  • Shen X, Minku L, Marturi N, Guo YN, Han Y (2018) A q-learning-based memetic algorithm for multi-objective dynamic software project scheduling. Inf Sci 428:1–29.

    MathSciNet  Google Scholar 

  • Steinwart I, Christmann A (2008) Support vector machines. Springer Science & Business Media

  • Storn R, Price K (1997) Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359.

    MathSciNet  MATH  Google Scholar 

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: 2016 IEEE/ACM 38th international conference on Software engineering (ICSE). IEEE, pp 321–332

  • Treude C, Wagner M (2019) Predicting good configurations for github and stack overflow topic models. In: Proceedings of the 16th International Conference on Mining Software Repositories, MSR ’19. IEEE Press, Piscataway, pp 84–95.

  • Tu H, Nair V (2018) Is one hyperparameter optimizer enough? In: SWAN 2018

  • van Gerven M, Bohte S (2018) Artificial neural networks as models of neural information processing. Frontiers Media, SA

  • Vandecruys O, Martens D, Baesens B, Mues C, De Backer M, Haesen R (2008) Mining software repositories for comprehensible software fault prediction models. J Syst Softw 81(5):823–839

    Google Scholar 

  • Veerappa V, Letier E (2011) Understanding clusters of optimal solutions in multi-objective decision problems. In: 2011 IEEE 19Th international requirements engineering conference, pp 89–98.

  • Wagner M, Minku L, Hassan AE, Clark J (2017) NII Shonan Meeting #2017-19: Data-driven search-based software engineering. Available online at Tech. Rep. 2017-19, NII Shonan Meeting Report

  • Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, pp 455–465

  • Weise T, Wu Z, Wagner M (2019) An improved generic bet-and-run strategy for speeding up stochastic local search. arXiv:1806.08984 (2018). Accepted for publication at AAAI

  • Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390.

    Google Scholar 

  • Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82.

    Google Scholar 

  • Wu X, Consoli P, Minku L, Ochoa G, Yao X, Paechter B (2016) An evolutionary hyper-heuristic for the software project scheduling problem. In: Handl J, Hart E, Lewis PR, López-Ibáṅez M, Ochoa G (eds) Parallel problem solving from nature – PPSN XIV. Springer, Cham, pp 37–47

  • Xia T, Krishna R, Chen J, Mathew G, Shen X, Menzies T (2018) Hyperparameter optimization for effort estimation. arXiv:1805.00336

  • Xu T, Jin L, Fan X, Zhou Y, Pasupathy S, Talwadker R (2015) Hey, you have given me too many knobs!: Understanding and dealing with over-designed configuration in system software. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015. ACM, New York, pp 307–319.

  • Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: 2016 31St IEEE/ACM international conference on automated software engineering (ASE), pp 51–62

  • Yu Z, Kraft NA, Menzies T (2018) Finding better active learners for faster literature reviews. Empir Softw Eng 23(6):3161–3186

    Google Scholar 

  • Zhang Q, Li H (2007) Moea/d: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans Evol Comput 11(6):712–731

    Google Scholar 

  • Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proceedings of the 38th International Conference on Software Engineering. ACM, pp 309–320

  • Zhong S, Khoshgoftaar TM, Seliya N (2004) Analyzing software measurement data with clustering techniques. IEEE Intell Syst 19(2):20–27

    Google Scholar 

  • Zitzler E, Künzli S (2004) Indicator-based selection in multiobjective search. In: PPSN

  • Zuluaga M, Krause A, Sergent G, Püschel M (2013) Active learning for multi-objective optimization. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pp I–462–I–470.

Download references


Earlier work ultimately leading to the present one was inspired by the NII Shonan Meeting on Data-Driven Search-based Software Engineering (, December 11-14, 2017. We thank the organizers of that workshop (Markus Wagner, Leandro L. Minku, Ahmed E. Hassan, and John Clark) for their academic leadership and inspiration. Dr Menzies’ work was partially supported by NSF grant No. 1703487. Dr Minku’s work was partially supported by EPSRC grant Nos. EP/R006660/1 and EP/R006660/2. Dr Wagner’s work was partially supported by the ARC grant DE160100850.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Leandro L. Minku.

Additional information

Communicated by: Yasutaka Kamei and Andy Zaidman

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agrawal, A., Menzies, T., Minku, L.L. et al. Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers. Empir Software Eng 25, 2099–2136 (2020).

Download citation

  • Published:

  • Issue Date:

  • DOI: