Optimisation and Evaluation of Random Forests for Imbalanced Datasets
This paper deals with an optimization of Random Forests which aims at: adapting the concept of forest for learning imbalanced data as well as taking into account user’s wishes as far as recall and precision rates are concerned. We propose to adapt Random Forest on two levels. First of all, during the forest creation thanks to the use of asymmetric entropy measure associated to specific leaf class assignation rules. Then, during the voting step, by using an alternative strategy to the classical majority voting strategy. The automation of this second step requires a specific methodology for results quality assessment. This methodology allows the user to define his wishes concerning (1) recall and precision rates for each class of the concept to learn, and, (2) the importance he wants to confer to each one of those classes. Finally, results of experimental evaluations are presented.
KeywordsRandom Forest Recall Rate Minority Class Entropy Measure Vote Weighting
Unable to display preview. Download preview PDF.
- 2.Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proc. of International Conf. on Artificial Intelligence (IC-AI 2000) (2000)Google Scholar
- 3.Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proc. of the 4th International Conf. on Knowledge Discovery and DataMining, pp. 164–168 (2001)Google Scholar
- 4.Grzymala-Busse, J.W., Zheng, Z., Goodwin, L.K., Grzymala-Busse, W.J.: An approach to imbalanced data sets based on changing rule strength. In: Learning from Imbalanced Data Sets: Papers from the AAAI Workshop, pp. 69–74. AAAI Press, Menlo Park (2000); Technical Report WS-00-05Google Scholar
- 5.Weiss, G.M., Hirsh, H.: Learning to predict rare events in event sequences. In: Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 359–363 (1998)Google Scholar
- 7.Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc. of the 11th International Conference on Machine Learning, San Francisco, Morgan Kaufmann, San Francisco (1994)Google Scholar
- 10.Chen, C., Liaw, A.: Using Random Forest to Learn Imbalanced Data. Technical Report. Berkeley, Department of Statistics, University of California (2004)Google Scholar
- 11.Leon, F., Zaharia, M.H., Gâlea, D.: Performance Analysis of Categorization Algorithms. In: Proc. of the 8th International Symposium on Automatic Control and Computer Science, pp. 973–621 (2004) ISBN 973-621-086-3Google Scholar
- 12.Marcellin, S., Zigued, D.A., Ritschard, G.: An asymmetric entropy measure for decision trees. In: IPMU 2006, Paris, France (July 2006)Google Scholar
- 13.Kavsek, B., Lavrac, N., Todorovski, L.: ROC Analysis of Example Weighting in Subgroup Discovery. In: European Conference on Artificial Intelligence (2004)Google Scholar
- 14.Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598) (1983)Google Scholar
- 15.Hettich, S., Bay, S.D.: The UCI KDD Archive, Irvine, University of California, USA, Department of Information and Computer Science (1999)Google Scholar