Optimisation and Evaluation of Random Forests for Imbalanced Datasets

  • Julien Thomas
  • Pierre-Emmanuel Jouve
  • Nicolas Nicoloyannis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4203)


This paper deals with an optimization of Random Forests which aims at: adapting the concept of forest for learning imbalanced data as well as taking into account user’s wishes as far as recall and precision rates are concerned. We propose to adapt Random Forest on two levels. First of all, during the forest creation thanks to the use of asymmetric entropy measure associated to specific leaf class assignation rules. Then, during the voting step, by using an alternative strategy to the classical majority voting strategy. The automation of this second step requires a specific methodology for results quality assessment. This methodology allows the user to define his wishes concerning (1) recall and precision rates for each class of the concept to learn, and, (2) the importance he wants to confer to each one of those classes. Finally, results of experimental evaluations are presented.


Random Forest Recall Rate Minority Class Entropy Measure Vote Weighting 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Weiss, G.M.: Mining with Rarity: A Unifying Framework. SIGKDD Explorations 6(1), 7–19 (2004)CrossRefGoogle Scholar
  2. 2.
    Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Proc. of International Conf. on Artificial Intelligence (IC-AI 2000) (2000)Google Scholar
  3. 3.
    Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proc. of the 4th International Conf. on Knowledge Discovery and DataMining, pp. 164–168 (2001)Google Scholar
  4. 4.
    Grzymala-Busse, J.W., Zheng, Z., Goodwin, L.K., Grzymala-Busse, W.J.: An approach to imbalanced data sets based on changing rule strength. In: Learning from Imbalanced Data Sets: Papers from the AAAI Workshop, pp. 69–74. AAAI Press, Menlo Park (2000); Technical Report WS-00-05Google Scholar
  5. 5.
    Weiss, G.M., Hirsh, H.: Learning to predict rare events in event sequences. In: Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 359–363 (1998)Google Scholar
  6. 6.
    Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM Press, San Diego (1999)CrossRefGoogle Scholar
  7. 7.
    Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc. of the 11th International Conference on Machine Learning, San Francisco, Morgan Kaufmann, San Francisco (1994)Google Scholar
  8. 8.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)MATHGoogle Scholar
  9. 9.
    Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)MATHCrossRefGoogle Scholar
  10. 10.
    Chen, C., Liaw, A.: Using Random Forest to Learn Imbalanced Data. Technical Report. Berkeley, Department of Statistics, University of California (2004)Google Scholar
  11. 11.
    Leon, F., Zaharia, M.H., Gâlea, D.: Performance Analysis of Categorization Algorithms. In: Proc. of the 8th International Symposium on Automatic Control and Computer Science, pp. 973–621 (2004) ISBN 973-621-086-3Google Scholar
  12. 12.
    Marcellin, S., Zigued, D.A., Ritschard, G.: An asymmetric entropy measure for decision trees. In: IPMU 2006, Paris, France (July 2006)Google Scholar
  13. 13.
    Kavsek, B., Lavrac, N., Todorovski, L.: ROC Analysis of Example Weighting in Subgroup Discovery. In: European Conference on Artificial Intelligence (2004)Google Scholar
  14. 14.
    Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598) (1983)Google Scholar
  15. 15.
    Hettich, S., Bay, S.D.: The UCI KDD Archive, Irvine, University of California, USA, Department of Information and Computer Science (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Julien Thomas
    • 1
    • 2
  • Pierre-Emmanuel Jouve
    • 2
  • Nicolas Nicoloyannis
    • 1
  1. 1.Laboratoire ERICUniversité Lumière Lyon2France
  2. 2.Company Fenics LyonFrance

Personalised recommendations