Classification Approaches for Microarray Gene Expression Data Analysis
Classification approaches have been developed, adopted, and applied to distinguish disease classes at the molecular level using microarray data. Recently, a novel class of hierarchical probabilistic models based on a kernel-imbedding technique has become one of the best classification tools for microarray data analysis. These models were first developed as kernel-imbedded Gaussian processes (KIGPs) for binary class classification problems using microarray gene expression data, then they were further improved for multiclass classification problems under a unifying Bayesian framework. Specifically, an adaptive algorithm with a cascading structure was designed to find appropriate featuring kernels, to discover potentially significant genes, and to make optimal disease (e.g., tumor/cancer) class predictions with associated Bayesian posterior probabilities. Simulation studies and applications to publish real data showed that KIGPs performed very close to the Bayesian bound and consistently outperformed or performed among the best of a lot of state-of-the-art methods. The most unique advantage of the KIGP approach is its ability to explore both the linear and the nonlinear underlying relationships between the target features of a given disease classification problem and the involved explanatory gene expression data. This line of research has shed light on the broader usability of the KIGP approach for the analysis of other high-throughput omics data and omics data collected in time series fashion, especially when linear model based methods fail to work.
Key wordsMicroarray gene expression Kernel-imbedding Gaussian processes Markov chains Monte Carlo methods Nonlinear systems
This work was partially supported by the Loyola University Medical Center Research Development Funds and the SUN Microsystems Academic Equipment Grant for Bioinformatics. The author would like to thank Dr. Xin Zhao at Sanjole Inc. for his involvement on the KIGP work.
- 2.Dudoit S, Fridlyand J, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. JASA 97:77–87.Google Scholar
- 4.Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Amer. Statis. Assoc. 99:96–104.Google Scholar
- 5.Bair E, Hastie T, Paul D et al (2006) Prediction by supervised principal component. J. Amer. Statis. Assoc. 101:119–137.Google Scholar
- 12.Zhou X, Wang X, Dougherty ER (2004) Gene prediction using multinomial probit regression with Bayesian gene selection. EURASIP Journal on Applied Signal Processing 1: 115–124.Google Scholar
- 17.Zhao X, Cheung LWK (2011) Multi-class kernel-imbedded Gaussian processes for microarray data analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8(4):1041–1053.Google Scholar
- 23.Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press, Cambridge, Massachusetts.Google Scholar
- 24.Cristianini N, Shawe-Tayer J (2000) An introduction to support vector machines. Cambridge University Press.Google Scholar
- 25.Kuh A (2004) Least Square Kernel Methods and Applications. In: Soft Computing in Communications. Wang L (ed) p:361–383. Springer, Berlin.Google Scholar
- 31.Martin J, Regad L, Camproux A-C et al (2010) Finite Markov Chain Embedding for the Exact Distribution of Patterns in a Set of Random Sequences. In: Advances in Data Analysis- Statistics for Industry and Technology: Theory and Applications to Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks. Skiadas C (ed). p.171-180. Springer.Google Scholar