Supervised clustering of variables

Regular Article
  • 324 Downloads

Abstract

In predictive modelling, highly correlated predictors lead to unstable models that are often difficult to interpret. The selection of features, or the use of latent components that reduce the complexity among correlated observed variables, are common strategies. Our objective with the new procedure that we advocate here is to achieve both purposes: to highlight the group structure among the variables and to identify the most relevant groups of variables for prediction. The proposed procedure is an iterative adaptation of a method developed for the clustering of variables around latent variables (CLV). Modification of the standard CLV algorithm leads to a supervised procedure, in the sense that the variable to be predicted plays an active role in the clustering. The latent variables associated with the groups of variables, selected for their “proximity” to the variable to be predicted and their “internal homogeneity”, are progressively added in a predictive model. The features of the methodology are illustrated based on a simulation study and a real-world application.

Keywords

Prediction Clustering of variables around latent variables (CLV) Forward regression model Sparse PLS regression 

Mathematics Subject Classification

62H30 62J05 

References

  1. Barnes RJ, Dhanoa MS, Lister SJ (1989) Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Appl Spectrosc 45:772–777CrossRefGoogle Scholar
  2. Chun H, Keles S (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc B 72(1):3–25CrossRefMathSciNetGoogle Scholar
  3. Filzmoser P, Liebmann B, Varmuza K (2009) Repeated double cross validation. J Chemom 23:160–171CrossRefGoogle Scholar
  4. Hastie T, Tibshirani R, Botstein D, Brown P (2001) Supervised harvesting of expression trees. Genom Biol 2(1):1–12Google Scholar
  5. Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the lasso. J Comput Graph Stat 12:531–547CrossRefMathSciNetGoogle Scholar
  6. Le Cao KA, Rossouw D, Robert-Grani C, Besse P (2008) Sparse PLS: variable selection when integrating omics data. Stat Appl Genet Mol Biol 7(1): Art No 35Google Scholar
  7. Le Thi HA, Le HM, Nguyen VV, Dinh TP (2008) A DC programming approach for feature selection in support vector machines learning. Adv Data Anal Classif 2:259–278CrossRefMathSciNetMATHGoogle Scholar
  8. Leardi R, Boggia R, Terrile M (1992) Genetic algorithms as a strategy for feature selection. J Chemom 6(5):267–281CrossRefGoogle Scholar
  9. Naes T, Kowalski B (1989) Predicting sensory profiles from external instrumental measurements. Food Qual Prefer 1:135–147CrossRefGoogle Scholar
  10. Park MY, Hastie T, Tibshirani R (2007) Averaged gene expressions for regression. Biostatistics 8(2):212–227Google Scholar
  11. Subedi S, Punzo A, Ingrassia S, McNicholas PD (2013) Clustering and classification via cluster-weighted factor analysers. Adv Data Anal Classif 7(1):5–40CrossRefMathSciNetMATHGoogle Scholar
  12. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B 58(1):267–288MathSciNetMATHGoogle Scholar
  13. Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208CrossRefMathSciNetMATHGoogle Scholar
  14. Vigneau E, Qannari E (2003) Clustering of variables around latent components. Commun Stat Simul Comput 32(4):1131–1150CrossRefMathSciNetMATHGoogle Scholar
  15. Vigneau E, Thomas F (2012) Model calibration and feature selection for orange juice authentication by 1H NMR spectroscopy. Chemom Intell Lab 117:22–30CrossRefGoogle Scholar
  16. Vigneau E, Sahmer K, Qannari EM, Bertrand D (2005) Clustering of variables to analyze spectral data. J Chemom 19(3):122–128Google Scholar
  17. Vigneau E, Endrizzi I, Qannari E (2011) Finding and explaining clusters of consumers using the CLV approach. Food Qual Pref 22(4):705–713CrossRefGoogle Scholar
  18. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Roy Stat Soc B 67(3):301–320Google Scholar
  19. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15:265–286CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Sensometrics and Chemometrics LaboratoryLUNAM University, ONIRISNantes CEDEX 3France

Personalised recommendations