Abstract
In this paper we present a new method for Joint Feature Selection and Classifier Learning using a sparse Bayesian approach. These tasks are performed by optimizing a global loss function that includes a term associated with the empirical loss and another one representing a feature selection and regularization constraint on the parameters. To minimize this function we use a recently proposed technique, the Boosted Lasso algorithm, that follows the regularization path of the empirical risk associated with our loss function. We develop the algorithm for a well known non-parametrical classification method, the relevance vector machine, and perform experiments using a synthetic data set and three databases from the UCI Machine Learning Repository. The results show that our method is able to select the relevant features, increasing in some cases the classification accuracy when feature selection is performed.
Similar content being viewed by others
References
Bellman R (1961) Adaptive control process: a guided tour. Princeton University Press, New Jersey
Duda R, Hart P, Stork D (2001) Pattern classification, 2nd edn. Wiley
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Schapire RE, Freund Y, Bartlett PL, Lee WS (1997) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):322–330
Madigan D, Genkin A, Lewis DD, Fradkin D (2005) Bayesian multinomial logistic regression for author identification. In: AIP conference proceedings—25th international workshop on Bayesian inference and maximum entropy methods in science and engineering, vol 803, pp 509–516, 23 November 2005
Abe N, Kudo M, Toyama J, Shimbo M (2006) Classifier-independent feature selection on the basis of divergence criterion. Pattern Anal Appl 9(2–3):127–137
Zivkovic Z, van der Heijden F (2004) Improving the selection of feature points for tracking. Pattern Anal Appl 7(2):144–150
Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19(2):153–158
Fisher R (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston
Masip D, Kuncheva LI, Vitria J (2005) An ensemble-based method for linear feature extraction for two-class problems. Pattern Anal Appl 8:227–237
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Guyon I, Weston J, Barnhill S, Vapnik V (2004) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Mao KZ (2004) Feature subset selection for support vector machines through discriminative function pruning analysis. IEEE Trans Syst Man Cybern Part B 34(1):60–67
Chen S, Wang X, Hong X, Harris CJ (2006) Kernel classifier construction using orthogonal forward selection and boosting with fisher ratio class separability measure. IEEE Trans Neural Netw 17(6):1652–1656
Hong X, Mitchell RJ (2007) Backward elimination model construction for regression and classification using leave-one-out criteria. Int J Syst Sci 38(2):101–113
Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for SVMs. In: Leen TK, Dietterich TG, Tresp V (eds) NIPS. MIT Press, Cambridge, pp 668–674
Neal RM (1996) Bayesian learning for neural networks. LNS, vol 118. Springer, Heidelberg
Seeger M (1999) Bayesian model selection for support vector machines, gaussian processes and other kernel classifiers. In: Solla SA, Leen TK, Müller K-R (eds) NIPS. The MIT Press, Cambridge, pp 603–609
Zhu J, Rosset S, Hastie T, Tibshirani R (2004) 1-Norm support vector machines. In: Thrun S, Saul L, Schölkopf B (eds) Advances in Neural information processing systems, vol 16. MIT Press, Cambridge
Jebara T, Jaakkola T (2000) Feature selection and dualities in maximum entropy discrimination. In: Proc. 16th conf. on uncertainty in artif. intell. Morgan Kaufmann Publ. Inc, San Francisco, pp 291–300
Li K, Peng J, Bai E (2006) A two-stage algorithm for identification of nonlinear dynamic systems. Automatica 42(7):1189–1197
Krishnapuram B, Hartemink AJ, Carin L, Figueiredo MAT (2004) A bayesian approach to joint feature selection and classifier design. IEEE Trans Pattern Anal Mach Intell 26(9):1105–1111
Tipping ME (2001) Sparse bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244
Effron B, Hastie T, Johnstone I, Tibshinrani R (2004) Least angle regression. Ann Stat 32(2):407–499
Effron B, Hastie T, Johnstone I, Tibshinrani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58(1):267–288
Zhao P, Yu B (2007) Stagewise lasso. J Mach Learn Res 8:2701–2726
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Osborne M, Presnell B, Turlach B (2000) A new approach to variable selection in least squares problems. J Numer Anal 20(3):389–403
Osborne M, Presnell B, Turlach B (2000) On the lasso and its dual. J Comput Graph Stat 9(2):319–337
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407
Vert J-P, Foveau N, Lajaunie C, Vandenbrouck Y (2006) An accurate and interpretable model for sirna efficacy prediction. BMC Bioinf 7:520–537
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Series B 67(1):91–108
Ghosh D, Chinnaiyan A (2005) Classification and selection of biomarkers in genomic data using lasso. J Biomed Biotechnol 2005(2):147–54
Gao J, Suzuki H, Yu B (2006) Approximation lasso methods for language modeling. In: ACL ’06: proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the ACL, Association for Computational Linguistics. Morristown, NJ, pp 225–232
Obozinsky G, Taskar B, Jordan M (2006) Multi-task feature selection. Tech. rep., Statistics Department UC Berkeley
Igual L, Seguí S, Vitrià J, Azpiroz F, Radeva P (2007) Sparse bayesian feature selection applied to intestinal motility analysis. In: XVI Congreso Argentino de Bioingeniería, pp 467–470
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–374
Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html
Pranckeviciene E, Ho T, Somorjai RL (2006) Class separability in spaces reduced by feature selection. In: ICPR (3). IEEE Computer Society, pp 254–257
Pudil P, Novovicova J, Kittler J (1994) Floating search methods in feature-selection. Pattern Recognit Lett 15(11):1119–1125
Acknowledgments
This work was partially supported by MEC grant TIC2006-15308-C02-01 and CONSOLIDER-INGENIO 2010 (CSD2007-00018).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lapedriza, À., Seguí, S., Masip, D. et al. A sparse Bayesian approach for joint feature selection and classifier learning. Pattern Anal Applic 11, 299–308 (2008). https://doi.org/10.1007/s10044-008-0130-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-008-0130-1