Abstract
Logistic regression is a simple and efficient supervised learning algorithm for estimating the probability of an outcome or class variable. In spite of its simplicity, logistic regression has shown very good performance in a range of fields. It is widely accepted in a range of fields because its results are easy to interpret. Fitting the logistic regression model usually involves using the principle of maximum likelihood. The Newton–Raphson algorithm is the most common numerical approach for obtaining the coefficients maximizing the likelihood of the data.
This work presents a novel approach for fitting the logistic regression model based on estimation of distribution algorithms (EDAs), a tool for evolutionary computation. EDAs are suitable not only for maximizing the likelihood, but also for maximizing the area under the receiver operating characteristic curve (AUC).
Thus, we tackle the logistic regression problem from a double perspective: likelihood-based to calibrate the model and AUC-based to discriminate between the different classes. Under these two objectives of calibration and discrimination, the Pareto front can be obtained in our EDA framework. These fronts are compared with those yielded by a multiobjective EDA recently introduced in the literature.
Similar content being viewed by others
References
Baumgartner C, Böhm C, Baumgartner D, Marini G, Weinberger K, Olgemöller B, Liebl B, Roscher AA (2004) Supervised machine learning techniques for the classification of metabolic disorders in newborns. Bioinformatics 20(17):2985–2996
Blanco R, Inza I, Larrañaga P (2003) Learning Bayesian networks in the space of structures by estimation of distribution algorithms. Int J Intell Syst 18:205–220
Bouckaert R, Frank E (2004) Evaluating the replicability of significance tests for comparing learning algorithms. In: Dai H, Srikant R, Zhang C (eds) PAKDD. LNAI, vol 3056. Springer, Berlin, pp 3–12
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 30(7):1145–1159
Brier G (1950) Verification of forecasts expressed in terms of probabilities. Monthly Weather Rev 78:1–3
Deb K, Sinha A, Kukkonen S (2006) Multi-objective test problems, linkages, and evolutionary methodologies. In: GECCO-2006, Genetic and evolutionary computation conference, vol 2. ACM Press, New York, pp 1141–1148
Fawcett T (2003) ROC graphs: Notes and practical considerations for data mining researchers. Technical report, HPL 2003-4, HP Labs
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading
Hajek J, Zidak ZB, Sen PK (1999) Theory of rank tests, 2nd edn. Academic Press, San Diego
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Harrell FE, Lee KL, Califf R, Pryor D, Rosati R (1984) Regression modelling strategies for improved prognostic prediction. Stat Med 3:143–152
Harrell FE, Lee KL, Mark DB (1996) Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 15:361–387
Hilden J (1991) The area under the ROC curve and its competitors. Med Decis Mak 11(2):95–101
Horton NJ, Brown ER, Qian L (2004) Use of R as a toolbox for mathematical statistics exploration. Am Stat 58(4):343–357
Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York
Ihaka R, Gentleman R (1996) R: A language for data analysis and graphics. J Comput Graph Stat 5:229–314
Inza I, Larrañaga P, Etxeberria R, Sierra B (2000) Feature subset selection by Bayesian network-based optimization. J Artif Intell Res 123(1–2):157–184
Kiang MY (2003) A comparative assessment of classification methods. Decis Support Syst 35:441–454
Larrañaga P, Lozano JA (2002) Estimation of distribution algorithms. A new tool for evolutionary computation. Kluwer Academic, Dordrecht
Larrañaga P, Etxeberria R, Lozano JA, Peña JM (2000) Optimization in continuous domains by learning and simulation of Gaussian networks. In: Workshop in optimization by building and using probabilistic models within the 2000 genetic and evolutionary computation conference, GECCO 2000, pp 201–204
Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L (2005) The use of ROC curves in biomedical informatics. J Biomed Inform 38:404–415
Lozano JA, Larrañaga P, Inza I, Bengoetxea E (2006) Towards a new evolutionary computation. Advances in estimation of distribution algorithms. Springer, New York
McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York
Minka T (2003) A comparison of numerical optimizers for logistic regression. Technical report, 758, Carnegie Mellon University
Nakamichi R, Imoto S, Miyano S (2004) Case-control study of binary disease trait considering interactions between SNPs and environmental effects using logistic regression. In: Fourth IEEE symposium on bioinformatics and bioengineering, vol 21, pp 73–78
Newman D, Hettich S, Blake C, Merz C (1998) UCI repository of machine learning databases
Ng A, Jordan M (2001) On discriminative versus generative classifiers: A comparison of logistic regression and naive Bayes. In: Proceedings of NIPS, vol 14, pp 841–848
Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, Oxford
Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceedings 15th international conference on machine learning. Morgan Kaufmann, San Mateo, pp 445–453
R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0
Romero T, Larrañaga P, Sierra B (2004) Learning Bayesian networks in the space of orderings with estimation of distribution algorithms. Int J Pattern Recogn Artif Intell 4(18):607–625
Ryan TP (1997) Modern regression methods. Wiley, New York
Steuer RE (1986) Multiple criteria optimization: Theory, computation, and application. Wiley, New York
Steyerberg E, Borsboom G, van Houwelingen H, Eijkemans M, Habbema J (2004) Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med 23(10):2567–2586
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B 36:111–147
Thisted RA (1988) Elements of statistical computing. Chapman and Hall, London
van den Hout WB (2003) The area under an ROC curve with limited information. Med Decis Mak 23:160–166
Vinterbo S, Ohno-Machado L (1999a) A genetic algorithm to select variables in logistic regression: Example in the domain of myocardial infarctio. J Am Med Inform Assoc 6:984–988
Vinterbo S, Ohno-Machado L (1999b). A recalibration method for predictive models with dichotomous outcomes. In: Predictive models in medicine: Some methods for construction and adaptation. PhD thesis, Norwegian University of Science and Technology
Winker P, Gilli M (2004) Applications of optimization heuristics to estimation and modelling problems. Computat Stat Data Anal 47:211–223
Zhang Q, Zhou A, Jin Y (2008) RM-MEDA: A regularity model based multiobjective estimation of distribution algorithms. IEEE Trans Evol Comput 12(1):41–63
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Robles, V., Bielza, C., Larrañaga, P. et al. Optimizing logistic regression coefficients for discrimination and calibration using estimation of distribution algorithms. TOP 16, 345–366 (2008). https://doi.org/10.1007/s11750-008-0054-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11750-008-0054-3
Keywords
- Logistic regression
- Evolutionary algorithms
- Estimation of distribution algorithms
- Calibration and discrimination