Abstract
In their search through a huge space of possible hypotheses, rule induction algorithms compare estimations of qualities of a large number of rules to find the one that appears to be best. This mechanism can easily find random patterns in the data which will – even though the estimating method itself may be unbiased (such as relative frequency) – have optimistically high quality estimates. It is generally believed that the problem, which eventually leads to overfitting, can be alleviated by using m-estimate of probability. We show that this can only partially mend the problem, and propose a novel solution to making the common rule evaluation functions account for multiple comparisons in the search. Experiments on artificial data sets and data sets from the UCI repository show a large improvement in accuracy of probability predictions and also a decent gain in AUC of the constructed models.
Chapter PDF
References
Atkinson, K.E.: An Introduction to Numerical Analysis. John Wiley and Sons, New York (1989)
Cestnik, B.: Estimating probabilities: A crucial task in machine learning. In: Proceedings of the Ninth European Conference on Artificial Intelligence, pp. 147–149 (1990)
Clark, P., Boswell, R.: Rule induction with CN2: Some recent improvements. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 151–163. Springer, Heidelberg (1991)
Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning Journal 4(3), 261–283 (1989)
Demšar, J., Zupan, B.: Orange: From experimental machine learning to interactive data mining. White Paper, Faculty of Computer and Information Science, University of Ljubljana (2004), http://www.ailab.si/orange
Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1993)
Fisher, R.A., Tippett, L.H.C.: Limiting forms of the frequency distribution of the largest and smallest member of a sample. Proc. Camb. Phil. Soc. 24, 180–190 (1928)
Fuernkranz, J., Flach, P.A.: Roc ’n’ rule learning – towards a better understanding of covering algorithms. Machine Learning 58(1), 39–77 (2005)
Gumbel, E.J.: Statistical theory of extreme values and some practical applications. National Bureau of Standards Applied Mathematics Series (US Government Printing Office) 33 (1954)
Gumbel, E.J., Lieblein, J.: Some applications of extreme-value models. American Statistician 8(5), 14–17 (1954)
Gupta, S.S.: Order statistics from the gamma distribution. Technometrics 2, 243–262 (1960)
Jensen, D.D., Cohen, P.R.: Multiple comparisons in induction algorithms. Machine Learning 38(3), 309–338 (2000)
Lavrač, N., Flach, P.A., Zupan, B.: Rule Evaluation Measures: A Unifying View. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS, vol. 1634, pp. 174–185. Springer, Heidelberg (1999)
Li, W., Sun, F., Grosse, I.: Extreme value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression. Journal of Computational Biology 11(2/3), 215–226 (2004)
Murphy, P.M., Aha, D.W.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1994), http://www.ics.uci.edu/~mlearn/mlrepository.html
Todorovski, L., Flach, P., Lavrač, N.: Predictive Performance of Weighted Relative Accuracy. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 255–264. Springer, Heidelberg (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Možina, M., Demšar, J., Žabkar, J., Bratko, I. (2006). Why Is Rule Learning Optimistic and How to Correct It. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science(), vol 4212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871842_33
Download citation
DOI: https://doi.org/10.1007/11871842_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45375-8
Online ISBN: 978-3-540-46056-5
eBook Packages: Computer ScienceComputer Science (R0)