The Bump Hunting by the Decision Tree with the Genetic Algorithm

  • Hideo Hirose

In difficult classification problems of the z-dimensional points into two groups giving 0–1 responses due to the messy data structure, it is more favorable to search for the denser regions for the response 1 points than to find the boundaries to separate the two groups. For such problems which can often be seen in customer databases, we have developed a bump hunting method using probabilistic and statistical methods as shown in the previous study. By specifying a pureness rate in advance, a maximum capture rate will be obtained. In finding the maximum capture rate, we have used the decision tree method combined with the genetic algorithm. Then, a trade-off curve between the pureness rate and the capture rate can be constructed. However, such a trade-off curve could be optimistic if the training data set alone is used. Therefore, we should be careful in assessing the accuracy of the tradeoff curve. Using the accuracy evaluation procedures such as the cross validation or the bootstrapped hold-out method combined with the training and test data sets, we have shown that the actually applicable trade-off curve can be obtained. We have also shown that an attainable upper bound trade-off curve can be estimated by using the extreme-value statistics because the genetic algorithm provides many local maxima of the capture rates with different initial values. We have constructed the three kinds of trade-off curves; the first is the curve obtained by using the training data; the second is the return capture rate curve obtained by using the extreme-value statistics; the last is the curve obtained by using the test data. These three are indispensable like the Trinity to comprehend the whole figure of the trade-off curve between the pureness rate and the capture rate. This paper deals with the behavior of the trade-off curve from a statistical viewpoint.

Keywords

Bump hunting Decision tree Genetic algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hastie, T., Tibshirani, R., and Friedman, J.H.: Elements of Statistical Learning. New York: Springer (2001)Google Scholar
  2. 2.
    Hirose, H.: A method to discriminate the minor groups from the major groups. Hawaii Int. Conf. Stat. Math. Related Fields. (2005)Google Scholar
  3. 3.
    Hirose, H.: Optimal boundary finding method for the bumpy regions. IFORS2005Google Scholar
  4. 4.
    Agarwal, D., Phillips, J.M., and Venkatasubramanian, S.: The hunting of the bump: On maxi mizing statistical discrepancy, SODA'06. (2006) 1137–1146Google Scholar
  5. 5.
    Becker, U. and Fahrmeir, L.: Bump hunting for risk: A new data mining tool and its applica tions, Comput. Stat., 16 (2001) 373–386MATHMathSciNetGoogle Scholar
  6. 6.
    Friedman, J.H. and Fisher, N.I.: Bump hunting in high—dimensional data, Statistics and Com puting. 9 (1999) 123–143CrossRefGoogle Scholar
  7. 7.
    Gray, J.B. and Fan, G.: Target: Tree analysis with randomly generated and evolved trees. Technical report. The University of Alabama (2003)Google Scholar
  8. 8.
    Hirose, H., Ohi, S., and Yukizane, T.: Assessment of the prediction accuracy in the bump hunting procedure. Hawaii Int. Conf. Stat. Math. Related Fields. (2007)Google Scholar
  9. 9.
    Davis, J. and Goadrich, M.: The relationship between precision—recall and ROC curves, Proc. 23rd Intl. Conf. Mach. Learn. (2006)Google Scholar
  10. 10.
    Fawcett, T.: An introduction to ROC analysis, Pattern Recog. Let. 27 (2006) 861–874Google Scholar
  11. 11.
    Hirose, H., Yukizane, T., and Deguchi, T.: The bump hunting method and its accuracy using the genetic algorithm with application to real customer data. submittedGoogle Scholar
  12. 12.
    Yukizane, T., Ohi, S., Miyano, E., and Hirose, H.: The bump hunting method using the genetic algorithm with the extreme—value statistics. IEICE Trans. Inf. Syst., E89—D (2006) 2332–2339Google Scholar
  13. 13.
    Castillo, E.: Extreme Value Theory in Engineering. San Diego, CA, USA: Academic (1998)Google Scholar
  14. 14.
    Hirose, H., Yukizane, T., and Miyano, E.: Boundary detection for bumps using the Gini's index in messy classification problems. CITSA2006. (2006) 293–298Google Scholar
  15. 15.
    Efron, B.: Estimating the error rate of a prediction rule: Improvements in cross—validation, JASA. 78 (1983) 316–331MATHMathSciNetGoogle Scholar
  16. 16.
    Kohavi, R.: A study of cross—validation and bootstrap for accuracy estimation and model se lection. IJCAI. (1995)Google Scholar
  17. 17.
    Yukizane T., Hirose, H., Ohi, S., and Miyano, E.: Accuracy of the Solution in the Bump Hunting. IPSJ MPS SIG report, MPS06-62-04 (2006) 13–16Google Scholar

Copyright information

© Springer Science+Business Media B.V 2009

Authors and Affiliations

  • Hideo Hirose
    • 1
  1. 1.Department of Systems Innovation and InformaticsKyushu Institute of TechnologyFukuokaJapan

Personalised recommendations