Advertisement

Computational Statistics

, Volume 34, Issue 4, pp 1613–1648 | Cite as

Classification tree algorithm for grouped variables

  • A. PoterieEmail author
  • J.-F. Dupuy
  • V. Monbet
  • L. Rouvière
Original paper
  • 63 Downloads

Abstract

We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree Penalized Linear Discriminant Analysis algorithm (TPLDA), a new-tree based approach which constructs a classification rule based on groups of variables. It consists in splitting a node by repeatedly selecting a group and then applying a regularized linear discriminant analysis based on this group. This process is repeated until some stopping criterion is satisfied. A pruning strategy is proposed to select an optimal tree. Compared to the existing multivariate classification tree methods, the proposed method is computationally less demanding and the resulting trees are more easily interpretable. Furthermore, TPLDA automatically provides a measure of importance for each group of variables. This score allows to rank groups of variables with respect to their ability to predict the response and can also be used to perform group variable selection. The good performances of the proposed algorithm and its interest in terms of prediction accuracy, interpretation and group variable selection are loud and compared to alternative reference methods through simulations and applications on real datasets.

Keywords

Supervised classification Groups of inputs Group variable selection Multivariate classification tree algorithms Group importance measure Regularized linear discriminant analysis 

Notes

Supplementary material

References

  1. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750Google Scholar
  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25Google Scholar
  3. Bouveyron C, Girard S, Schmid C (2007) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623MathSciNetzbMATHGoogle Scholar
  4. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca RatonzbMATHGoogle Scholar
  5. Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1):45–77zbMATHGoogle Scholar
  6. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  7. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87MathSciNetzbMATHGoogle Scholar
  8. Engreitz JM, Daigle BJ Jr, Marshall JJ, Altman RB (2010) Independent component analysis: mining microarray data for fundamental human gene expression modules. J Biomed Inform 43(6):932–944Google Scholar
  9. Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer series in statistics, vol 1. Springer, New YorkzbMATHGoogle Scholar
  10. Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc 84(405):165–175MathSciNetGoogle Scholar
  11. Genuer R, Poggi JM (2017) Arbres CART et Forêts aléatoires,Importance et sélection de variables, preprintGoogle Scholar
  12. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537Google Scholar
  13. Gregorutti B, Michel B, Saint-Pierre P (2015) Grouped variable importance with random forests and application to multiple functional data analysis. Comput Stat Data Anal 90:15–35MathSciNetzbMATHGoogle Scholar
  14. Grimonprez Q, Blanck S, Celisse A, Marot G (2018) MLGL: an R package implementing correlated variable selection by hierarchical clustering and group-lasso, preprintGoogle Scholar
  15. Guo Y, Hastie T, Tibshirani R (2006) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1):86–100zbMATHGoogle Scholar
  16. Huang D, Quan Y, He M, Zhou B (2009) Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data. J. Exp. Clin. Cancer Res. 28(1):149Google Scholar
  17. Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci Rev J Inst Math Stat 27(4):481–499MathSciNetzbMATHGoogle Scholar
  18. Jacob L, Obozinski G, Vert JP (2009) Group lasso with overlap and graph lasso. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 433–440Google Scholar
  19. Kaminski N, Friedman N (2002) Practical approaches to analyzing results of microarray experiments. Am J Respir Cell Mol Biol 27(2):125–132Google Scholar
  20. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30Google Scholar
  21. Lange K, Hunter DR, Yang I (2000) Optimization transfer using surrogate objective functions. J Comput Graph Stat 9(1):1–20MathSciNetGoogle Scholar
  22. Lee SI, Batzoglou S (2003) Application of independent component analysis to microarrays. Genome Biol 4(11):R76Google Scholar
  23. Li XB, Sweigart JR, Teng JT, Donohue JM, Thombs LA, Wang SM (2003) Multivariate decision trees using linear discriminants and tabu search. IEEE Trans Syst Man Cybern Part A Syst Hum 33(2):194–205Google Scholar
  24. Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228zbMATHGoogle Scholar
  25. Loh W (2014) Fifty years of classification and regression trees. Int Stat Rev 82(3):329–348MathSciNetzbMATHGoogle Scholar
  26. Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sinica 7(4):815–840MathSciNetzbMATHGoogle Scholar
  27. Meier L, Geer SVD, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc Ser B (Stat Methodol) 70(1):53–71MathSciNetzbMATHGoogle Scholar
  28. Mola F, Siciliano R (2002) Discriminant analysis and factorial multiple splits in recursive partitioning for data mining. In: International workshop on multiple classifier systems. Springer, Berlin, Heidelberg, pp 118–126zbMATHGoogle Scholar
  29. Murthy SK, Kasif S, Salzberg S, Beigel R (1993) OC1: a randomized algorithm for building oblique decision trees. In: Proceedings of AAAI, vol 93, pp 322–327Google Scholar
  30. Picheny V, Servien R, Villa-Vialaneix N (2016) Interpretable sparse sir for functional data, preprintGoogle Scholar
  31. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106Google Scholar
  32. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, BurlingtonGoogle Scholar
  33. Sewak MS, Reddy NP, Duan ZH (2009) Gene expression based leukemia sub-classification using committee neural networks. Bioinform Biol Insights 3:89Google Scholar
  34. Shao J, Wang Y, Deng X, Wang S et al (2011) Sparse linear discriminant analysis by thresholding for high dimensional data. Ann Stat 39(2):1241–1265MathSciNetzbMATHGoogle Scholar
  35. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68Google Scholar
  36. Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):25Google Scholar
  37. Tai F, Pan W (2007) Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23(23):3170–3177Google Scholar
  38. Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CW, Mesirov JP (2007) Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci 104(14):5959–5964Google Scholar
  39. Wei-Yin Loh NV (1988) Tree-structured classification via generalized discriminant analysis. J Am Stat Assoc 83(403):715–725MathSciNetzbMATHGoogle Scholar
  40. Wickramarachchi D, Robertson B, Reale M, Price C, Brown J (2016) HHCART: an oblique decision tree. Comput Stat Data Anal 96:12–23MathSciNetzbMATHGoogle Scholar
  41. Witten DM, Tibshirani R (2011) Penalized classification using Fisher’s linear discriminant. J R Stat Soc Ser B (Stat Methodol) 73(5):753–772MathSciNetzbMATHGoogle Scholar
  42. Xu P, Brock GN, Parrish RS (2009) Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Comput Stat Data Anal 53(5):1674–1687MathSciNetzbMATHGoogle Scholar
  43. Yin L, Huang CH, Ni J (2006) Clustering of gene expression data: performance and similarity analysis. BMC Bioinform 7(4):19–30Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • A. Poterie
    • 1
    Email author
  • J.-F. Dupuy
    • 1
  • V. Monbet
    • 2
  • L. Rouvière
    • 2
  1. 1.INSA, CNRS, IRMAR - UMR 6625Univ RennesRennesFrance
  2. 2.CNRS, IRMAR - UMR 6625Univ RennesRennesFrance

Personalised recommendations