Abstract
The main challenge in working with gene expression microarrays is that the sample size is small compared to the large number of variables (genes). In many studies, the main focus is on finding a small subset of the genes, which are the most important ones for differentiating between different types of cancer, for simpler and cheaper diagnostic arrays. In this paper, a sparse Bayesian variable selection method in probit model is proposed for gene selection and classification. We assign a sparse prior for regression parameters and perform variable selection by indexing the covariates of the model with a binary vector. The correlation prior for the binary vector assigned in this paper is able to distinguish models with the same size. The performance of the proposed method is demonstrated with one simulated data and two well known real data sets, and the results show that our method is comparable with other existing methods in variable selection and classification.
Similar content being viewed by others
References
Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88:669–679
Armagan A, Dunson DB, Lee J (2013) Generalized double Pareto shrinkage. Stat Sin 3(1):119–143
Bae K, Mallick BK (2004) Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20(18):3423–3430
Baragatti M (2011) Bayesian variable selection for probit mixed models applied to gene selection. Bayesian Anal 6(2):209–230
Baragatti M, Pommeret D (2012) A study of variable selection using g-prior distribution with ridge parameter. Comput Stat Data Anal 56:1920–1934
Bradley P, Mangasarian O (1998) Feature selection via concave minimization and support vector machines. In: Proceedings of the 15th international conference on machine learning, pp 82–90
Brotherick I, Robson CN, Browell DA, Shenfine J, White MD, Cunliffe WJ, Shenton BK, Egan M, Webb LA, Lunt LG, Young JR, Higgs MJ (1998) Cytokeratin expression in breast cancer: phenotypic changes associated with disease progression. Cytometry 32:301–308
Chakraborty S (2009) Bayesian Binary kernel probit model for microarray based cancer classification and gene selection. Comput Stat Data Anal 53:4198–4209
Chakraborty S, Guo R (2011) Bayesian hybrid huberized SVM and its applications in high dimensional medical data. Comput Stat Data Anal 55(3):1342–1356
Chhikara R, Folks L (1989) The inverse Gaussian distribution: theory, methodology, and applications. Marcel Dekker, New York
Devroye L (1986) Non-uniform random variate generation. Springer, New York
Dougherty ER (2001) Small sample issues for microarray-based classification. Comp Funct Genomics 2:28–34
Dudoit Y, Yang H, Callow M, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Geman S, Geman D (1984) Stochastic relaxation, Gibbls distribution, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889
Geyer CJ (1992) Practical Markov chain Monte Carlo. Stat Sci 7:473–511
Gilks W, Richardson S, Spiegelhalter D (1996) Markov Chain Monte Carlo in practise. Chapman and Hall, London
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Gupta M, Ibrahim JG (2007) Variable selection in regression mixture modeling for the discovery of gene regulatory networks. J Am Stat Assoc 102(479):867–880
Guyon I, Weston J, Barnhill S, Vapnik V et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Hastie T, Tibshirani R, Friedman J (2001) The element of statistical learning. Springer, New York
Hendenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J (2001) Gene expression profiles in hereditary breast cancer. N Engl J Med 344:539–548
Hirota T, Morisaki T, Nishiyama Y, Marumoto T, Tada K, Hara T, Masuko N, Inagaki M, Hatakeyama K, Saya H (2000) Zyxin a regulator of actin filament assembly, targets the mitotic apparatus by interacting with h-warts/LATS1 tumor suppressor. J Cell Biol 149:1073–1086
Ishwaran H, Rao JS (2005) Spike and slab variable selection: frequentist and bayesian strategies. Ann Stat 33(2):730–773
Kass RE, Carlin BP, Gelman A, Neal R (1998) Markov Chain Monte Carlo in practice: a roundtable discussion. Am Stat 52:93–100
Lamnisos D, Griffin JE, Steel FJ Mark (2009) Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J Comput Graph Stat 18:592–612
Lee KE et al (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics 19:90–97
Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202–1214
Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6:76
Mallick BK, Ghosh D, Ghosh M (2005) Bayesian classification of tumors using gene expression data. J R Stat Soc B 67:219–232
Maruyama Y, George EI (2011) gBF: a fully Bayes factor with a generalized g-prior. Technical Report, University of Pennsylvania. arXiv:0801.4410
Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. J Am Stat Assoc 83:1023–1036
Nguyen DV, Rocke DM (2002) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216–1226
OHara RB, Sillanpaa MJ (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 4:85–118
Panagiotelisa A, Smith M (2008) Bayesian identification, selection and estimation of semiparametric functions in high dimensional additive models. J Econom 143:291–316
Park K, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686
Quintana MA, Conti DV (2013) Integrative variable selection via Bayesian model uncertainty. Stat Med 32(28):4938–4953
Sha N, Vannucci M, Tadesse M, Brown P, Dragoni I, Davies N, Roberts T, Contestabile A, Salmon M, Buckley C, Falciani F (2004) Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60:812–819
Stingo FC, Vannucci M (2011) Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 27(4):495–501
Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. Ann Math Stat 42:385–388
Tolosi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27:1986–1994
Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7:228
Yang A, Song X (2010) Bayesian variable selection for disease classication using gene expression data. Bioinformatics 26(2):215–222
Yuan M, Lin Y (2005) Efficient empirical bayes variable selection and estimation in linear models. J Am Stat Assoc 472:1215–1225
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian inference and decision techniques: essays in honor of Bruno de Finetti. NorthHolland, Amsterdam, pp 233–243
Zhou X, Liu K, Wong S (2004) Cancer classification and prediction using logistic regression with Bayesian gene selection. J Biomed Inform 37:249–259
Acknowledgments
The authors gratefully acknowledge the financial support of the Natural Science Foundation of China (11501294, 11101432, 11571073), the China Postdoctoral Science Foundation (2015M580374), and the Natural Science Foundation of Jiangsu (BK20141326).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Yang, A., Jiang, X., Shu, L. et al. Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis. Comput Stat 32, 127–143 (2017). https://doi.org/10.1007/s00180-016-0665-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-016-0665-3