# The Bump Hunting by the Decision Tree with the Genetic Algorithm

In difficult classification problems of the z-dimensional points into two groups giving 0–1 responses due to the messy data structure, it is more favorable to search for the denser regions for the response 1 points than to find the boundaries to separate the two groups. For such problems which can often be seen in customer databases, we have developed a bump hunting method using probabilistic and statistical methods as shown in the previous study. By specifying a pureness rate in advance, a maximum capture rate will be obtained. In finding the maximum capture rate, we have used the decision tree method combined with the genetic algorithm. Then, a trade-off curve between the pureness rate and the capture rate can be constructed. However, such a trade-off curve could be optimistic if the training data set alone is used. Therefore, we should be careful in assessing the accuracy of the tradeoff curve. Using the accuracy evaluation procedures such as the cross validation or the bootstrapped hold-out method combined with the training and test data sets, we have shown that the actually applicable trade-off curve can be obtained. We have also shown that an attainable upper bound trade-off curve can be estimated by using the extreme-value statistics because the genetic algorithm provides many local maxima of the capture rates with different initial values. We have constructed the three kinds of trade-off curves; the first is the curve obtained by using the training data; the second is the return capture rate curve obtained by using the extreme-value statistics; the last is the curve obtained by using the test data. These three are indispensable like the Trinity to comprehend the whole figure of the trade-off curve between the pureness rate and the capture rate. This paper deals with the behavior of the trade-off curve from a statistical viewpoint.

## Keywords

Bump hunting Decision tree Genetic algorithm## Preview

Unable to display preview. Download preview PDF.

## References

- 1.Hastie, T., Tibshirani, R., and Friedman, J.H.: Elements of Statistical Learning. New York: Springer (2001)Google Scholar
- 2.Hirose, H.: A method to discriminate the minor groups from the major groups. Hawaii Int. Conf. Stat. Math. Related Fields. (2005)Google Scholar
- 3.Hirose, H.: Optimal boundary finding method for the bumpy regions. IFORS2005Google Scholar
- 4.Agarwal, D., Phillips, J.M., and Venkatasubramanian, S.: The hunting of the bump: On maxi mizing statistical discrepancy, SODA'06. (2006) 1137–1146Google Scholar
- 5.Becker, U. and Fahrmeir, L.: Bump hunting for risk: A new data mining tool and its applica tions, Comput. Stat., 16 (2001) 373–386MATHMathSciNetGoogle Scholar
- 6.Friedman, J.H. and Fisher, N.I.: Bump hunting in high—dimensional data, Statistics and Com puting. 9 (1999) 123–143CrossRefGoogle Scholar
- 7.Gray, J.B. and Fan, G.: Target: Tree analysis with randomly generated and evolved trees. Technical report. The University of Alabama (2003)Google Scholar
- 8.Hirose, H., Ohi, S., and Yukizane, T.: Assessment of the prediction accuracy in the bump hunting procedure. Hawaii Int. Conf. Stat. Math. Related Fields. (2007)Google Scholar
- 9.Davis, J. and Goadrich, M.: The relationship between precision—recall and ROC curves, Proc. 23rd Intl. Conf. Mach. Learn. (2006)Google Scholar
- 10.Fawcett, T.: An introduction to ROC analysis, Pattern Recog. Let. 27 (2006) 861–874Google Scholar
- 11.Hirose, H., Yukizane, T., and Deguchi, T.: The bump hunting method and its accuracy using the genetic algorithm with application to real customer data. submittedGoogle Scholar
- 12.Yukizane, T., Ohi, S., Miyano, E., and Hirose, H.: The bump hunting method using the genetic algorithm with the extreme—value statistics. IEICE Trans. Inf. Syst., E89—D (2006) 2332–2339Google Scholar
- 13.Castillo, E.: Extreme Value Theory in Engineering. San Diego, CA, USA: Academic (1998)Google Scholar
- 14.Hirose, H., Yukizane, T., and Miyano, E.: Boundary detection for bumps using the Gini's index in messy classification problems. CITSA2006. (2006) 293–298Google Scholar
- 15.Efron, B.: Estimating the error rate of a prediction rule: Improvements in cross—validation, JASA. 78 (1983) 316–331MATHMathSciNetGoogle Scholar
- 16.Kohavi, R.: A study of cross—validation and bootstrap for accuracy estimation and model se lection. IJCAI. (1995)Google Scholar
- 17.Yukizane T., Hirose, H., Ohi, S., and Miyano, E.: Accuracy of the Solution in the Bump Hunting. IPSJ MPS SIG report, MPS06-62-04 (2006) 13–16Google Scholar