# Classification tree algorithm for grouped variables

- 63 Downloads

## Abstract

We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree Penalized Linear Discriminant Analysis algorithm (TPLDA), a new-tree based approach which constructs a classification rule based on groups of variables. It consists in splitting a node by repeatedly selecting a group and then applying a regularized linear discriminant analysis based on this group. This process is repeated until some stopping criterion is satisfied. A pruning strategy is proposed to select an optimal tree. Compared to the existing multivariate classification tree methods, the proposed method is computationally less demanding and the resulting trees are more easily interpretable. Furthermore, TPLDA automatically provides a measure of importance for each group of variables. This score allows to rank groups of variables with respect to their ability to predict the response and can also be used to perform group variable selection. The good performances of the proposed algorithm and its interest in terms of prediction accuracy, interpretation and group variable selection are loud and compared to alternative reference methods through simulations and applications on real datasets.

## Keywords

Supervised classification Groups of inputs Group variable selection Multivariate classification tree algorithms Group importance measure Regularized linear discriminant analysis## Notes

## Supplementary material

## References

- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25Google Scholar
- Bouveyron C, Girard S, Schmid C (2007) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623MathSciNetzbMATHGoogle Scholar
- Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca RatonzbMATHGoogle Scholar
- Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1):45–77zbMATHGoogle Scholar
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87MathSciNetzbMATHGoogle Scholar
- Engreitz JM, Daigle BJ Jr, Marshall JJ, Altman RB (2010) Independent component analysis: mining microarray data for fundamental human gene expression modules. J Biomed Inform 43(6):932–944Google Scholar
- Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer series in statistics, vol 1. Springer, New YorkzbMATHGoogle Scholar
- Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc 84(405):165–175MathSciNetGoogle Scholar
- Genuer R, Poggi JM (2017) Arbres CART et Forêts aléatoires,Importance et sélection de variables, preprintGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537Google Scholar
- Gregorutti B, Michel B, Saint-Pierre P (2015) Grouped variable importance with random forests and application to multiple functional data analysis. Comput Stat Data Anal 90:15–35MathSciNetzbMATHGoogle Scholar
- Grimonprez Q, Blanck S, Celisse A, Marot G (2018) MLGL: an R package implementing correlated variable selection by hierarchical clustering and group-lasso, preprintGoogle Scholar
- Guo Y, Hastie T, Tibshirani R (2006) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1):86–100zbMATHGoogle Scholar
- Huang D, Quan Y, He M, Zhou B (2009) Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data. J. Exp. Clin. Cancer Res. 28(1):149Google Scholar
- Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci Rev J Inst Math Stat 27(4):481–499MathSciNetzbMATHGoogle Scholar
- Jacob L, Obozinski G, Vert JP (2009) Group lasso with overlap and graph lasso. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 433–440Google Scholar
- Kaminski N, Friedman N (2002) Practical approaches to analyzing results of microarray experiments. Am J Respir Cell Mol Biol 27(2):125–132Google Scholar
- Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30Google Scholar
- Lange K, Hunter DR, Yang I (2000) Optimization transfer using surrogate objective functions. J Comput Graph Stat 9(1):1–20MathSciNetGoogle Scholar
- Lee SI, Batzoglou S (2003) Application of independent component analysis to microarrays. Genome Biol 4(11):R76Google Scholar
- Li XB, Sweigart JR, Teng JT, Donohue JM, Thombs LA, Wang SM (2003) Multivariate decision trees using linear discriminants and tabu search. IEEE Trans Syst Man Cybern Part A Syst Hum 33(2):194–205Google Scholar
- Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228zbMATHGoogle Scholar
- Loh W (2014) Fifty years of classification and regression trees. Int Stat Rev 82(3):329–348MathSciNetzbMATHGoogle Scholar
- Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sinica 7(4):815–840MathSciNetzbMATHGoogle Scholar
- Meier L, Geer SVD, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc Ser B (Stat Methodol) 70(1):53–71MathSciNetzbMATHGoogle Scholar
- Mola F, Siciliano R (2002) Discriminant analysis and factorial multiple splits in recursive partitioning for data mining. In: International workshop on multiple classifier systems. Springer, Berlin, Heidelberg, pp 118–126zbMATHGoogle Scholar
- Murthy SK, Kasif S, Salzberg S, Beigel R (1993) OC1: a randomized algorithm for building oblique decision trees. In: Proceedings of AAAI, vol 93, pp 322–327Google Scholar
- Picheny V, Servien R, Villa-Vialaneix N (2016) Interpretable sparse sir for functional data, preprintGoogle Scholar
- Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106Google Scholar
- Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, BurlingtonGoogle Scholar
- Sewak MS, Reddy NP, Duan ZH (2009) Gene expression based leukemia sub-classification using committee neural networks. Bioinform Biol Insights 3:89Google Scholar
- Shao J, Wang Y, Deng X, Wang S et al (2011) Sparse linear discriminant analysis by thresholding for high dimensional data. Ann Stat 39(2):1241–1265MathSciNetzbMATHGoogle Scholar
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68Google Scholar
- Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):25Google Scholar
- Tai F, Pan W (2007) Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data. Bioinformatics 23(23):3170–3177Google Scholar
- Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CW, Mesirov JP (2007) Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci 104(14):5959–5964Google Scholar
- Wei-Yin Loh NV (1988) Tree-structured classification via generalized discriminant analysis. J Am Stat Assoc 83(403):715–725MathSciNetzbMATHGoogle Scholar
- Wickramarachchi D, Robertson B, Reale M, Price C, Brown J (2016) HHCART: an oblique decision tree. Comput Stat Data Anal 96:12–23MathSciNetzbMATHGoogle Scholar
- Witten DM, Tibshirani R (2011) Penalized classification using Fisher’s linear discriminant. J R Stat Soc Ser B (Stat Methodol) 73(5):753–772MathSciNetzbMATHGoogle Scholar
- Xu P, Brock GN, Parrish RS (2009) Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Comput Stat Data Anal 53(5):1674–1687MathSciNetzbMATHGoogle Scholar
- Yin L, Huang CH, Ni J (2006) Clustering of gene expression data: performance and similarity analysis. BMC Bioinform 7(4):19–30Google Scholar