Advertisement

Journal of Classification

, Volume 32, Issue 1, pp 21–45 | Cite as

Influence Measures for CART Classification Trees

  • Avner Bar-Hen
  • Servane Gey
  • Jean-Michel Poggi
Article

Abstract

This paper deals with measuring the influence of observations on the results obtained with CART classification trees. To define the influence of individuals on the analysis, we use influence measures to propose criterions to quantify the sensitivity of the CART classification tree analysis. The proposals are based on predictions and use jackknife trees. The analysis is extended to the pruned sequences of CART trees to produce CART specific notions of influence. Using the framework of influence functions, distributional results are derived.

A numerical example, the well known spam dataset, is presented to illustrate the notions developed throughout the paper. A real dataset relating the administrative classification of cities surrounding Paris, France, to the characteristics of their tax revenues distribution, is finally analyzed using the new influence-based tools.

Keywords

Influential individuals Influence functions Decision trees CART. 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. BAR-HEN, A., MARIADASSOU, M., POURSAT, M.-A., and VANDENKOORNHUYSE, P.H. (2008), “Influence Function for Robust Phylogenetic Reconstructions”, Molecular Biology and Evolution, 25(5), 869–873.Google Scholar
  2. BEL, L., ALLARD, D., LAURENT, J.M., CHEDDADI, R., and BAR-HEN, A. (2009), “CART Algorithm for Spatial Data: Application to Environmental and Ecological Data”, Computational Statistics and Data Analaysis, 53(8), 3082–3093.Google Scholar
  3. BOUSQUET, O., and ELISSEEFF, A. (2002). “Stability and Generalization”, Journal of Machine Learning Research 2, 499–526.Google Scholar
  4. BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., and STONE, C.J. (1993), Classification and Regression Trees, Boca Raton FL: Chapman and Hall.Google Scholar
  5. BRIAND, B., DUCHARME, G.R., PARACHE, V., and MERCAT-ROMMENS, C. (2009), “A Similarity Measure to Assess the Stability of Classification Trees”, Computational Statistics and Data Analysis, 53(4), 1208–1217.Google Scholar
  6. CAMPBELL, N.A. (1978), “The Influence Function as an Aid in Outlier Detection in Discriminant Analysis”, Applied Statistics, 27, 251–258.Google Scholar
  7. CHÈZE, N., and POGGI, J.M. (2006), “Outlier Detection by Boosting Regression Trees”, Journal of Statistical Research of Iran (JSRI), 3, 1–21.Google Scholar
  8. CRITCHLEY, F., and VITIELLO, C. (1991), “The Influence of Observations on Misclassification Probability Estimates in Linear Discriminant Analysis”, Biometrika, 78, 677–690.Google Scholar
  9. CROUX, C., and JOOSSENS, K. (2005), “Influence of Observations on the Misclassification Probability in Quadratic Discriminant Analysis”, Journal of Multivariate Analysis, 96(2), 384–403.Google Scholar
  10. CROUX, C., FILZMOSER, P., and JOOSSENS, K. (2008), “Classification Efficiencies for Robust Linear Discriminant Analysis”, Statistica Sinica, 18(2), 581–599.Google Scholar
  11. CROUX, C., HAESBROECK, G., and JOOSSENS, K. (2008), “Logistic Discrimination using Robust Estimators: An Influence Function Approach”, The Canadian Journal of Statistics, 36(1), 157–174.Google Scholar
  12. CUEVAS, A., and ROMO, J. (1995), “On the Estimation of the Influence Curve”, The Canadian Journal of Statistics, 23, 1–9.Google Scholar
  13. GEY, S., and POGGI, J.M. (2006), “Boosting and Instability for Regression Trees”, Computational Statistical and Data Analysis, 50(2), 533–550.Google Scholar
  14. GILL, R.D. (1989), “Non- and Semi-Parametric Maximum Likelihood Estimators and the Von Mises Method (Part 1)”, Scandinavian Journal of Statistics, 16, 97–128.Google Scholar
  15. HAMPEL, F.R. (1988), “The Influence Curve and Its Role in Robust Estimation”, Journal of the American Statistical Association, 69, 383–393.Google Scholar
  16. HASTIE, T.J., TIBSHIRANI, R.J., and FRIEDMAN, J.H. (2009), The Elements of Statistical Learning: Data Mining, Inference and Prediction (3rd ed.), New York: Springer. HUBER, P.J. (1981), “Robust Statistics”, New York: Wiley and Sons.Google Scholar
  17. MIGLIO, R., and SOFFRITTI, G. (2004), “The Comparison Between Classification Trees Through Proximity Measures”, Computational Statistics and Data Analysis, 45(3), 577–593.Google Scholar
  18. MILLER, R.G. (1974), “The Jackknife - A Review”, Biometrika, 61, 1–15.Google Scholar
  19. R DEVELOPMENT CORE TEAM (2009), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, ISBN 3-900051-07-0, http://www.R-project.org/.
  20. ROUSSEEUW, P. (1984), “Least Median of Squares Regression”, Journal of the Amererican Statistical Association, 79, 871–880.Google Scholar
  21. YOUNESS, G., and SAPORTA, G. (2009), “Comparing Partitions of Two Sets of Units Based on the Same Variables”, Advances in Data Analysis and Classification, 4(1), 53–64.Google Scholar
  22. VENABLES, W.N., and RIPLEY, B.D. (2002), Modern Applied Statistics with S (4th ed.), New York: Springer.Google Scholar
  23. ZHANG, H., and SINGER, B.H. (2010), Recursive Partitioning and Applications (2nd ed.), New York: Springer.Google Scholar

Copyright information

© Classification Society of North America 2015

Authors and Affiliations

  1. 1.Laboratoire MAP5Université Paris DescartesParisFrance
  2. 2.Laboratoire de MathématiquesUniversité Paris Sud and Université Paris DescartesParisFrance

Personalised recommendations