Abstract
The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.
Similar content being viewed by others
References
Hunter D, Li R. Variable selection via MM algorithms. Ann Statist, 33: 1617–1642 (2005)
Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimation in sparse high-dimensional regression models. Ann Statist, 36: 587–613 (2008)
Paul D, Bair E, Hastie T, et al. “Preconditioning” for feature selection and regression in high-dimensional problems. Ann Statist, 36: 1595–1618 (2007)
Zhang C H, Huang J. The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann Statist, 36: 1567–1594 (2008)
Kosorok M R, Ma S. Marginal asymptotics for the “large p, small n” paradigm: With applications to microarray data. Ann Statist, 35: 1456–1486 (2007)
Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. Ann Statist, 70: 849–911 (2007)
Tusher V, Tibshirani R, Chu C. Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc Nat Acad Sci USA, 98: 5116–5121 (2001)
Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA, 99: 6567–6572 (2002)
Marchini J, Donnelly P, Cardon L R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics, 37: 413–417 (2005)
Benjamini Y, Hochberg Y. Controlling the false discovery rate — A practical and powerful approach to multiple testing. J Royal Statist Soc Ser B, 57: 289–300 (1995)
Storey J D, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA, 100: 9440–9445 (2003)
Hoh J, Wille A, Ott J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Research, 11: 2115–2119 (2001)
Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Reviews Genetics, 4: 701–709 (2003)
Zaykin D V, Zhivotovsky L A, Westfall P H, et al. Truncated product method for combining p-values, Genet Epidemiol, 22: 170–185 (2002)
Dudbridge F, Koeleman B P C. Rank truncated product of P-values, with application to genome wide association scans. Genet Epidemiol, 25: 360–366 (2003)
Tibshirani R. Regression shrinkage and selection via the LASSO. J Royal Statist Soc Ser B, 58: 267–288 (1996)
Fan J, Li R. Variable selection via non-concave penalized likelihood and its oracle properties. J Amer Statist Assoc, 96: 1348–1360 (2001)
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Soc Ser B, 67: 301–320 (2005)
Efron B, Hastie T, Johnstone I, et al. Least angle regression. Ann Statist, 32: 407–499 (2004)
Ishwaran H, Rao J S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J Amer Statist Assoc, 98: 438–455 (2003)
Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model space. Biometrika, 95: 759–771 (2008)
Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Second International Symposium on Information Theory, eds. B.N. Petrox and F. Caski. Budapest: Akademiai Kiado, 267, 1973
Schwarz G. Estimating the dimension of a model. Ann Statist, 6: 461–464 (1978)
Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Statist, 35: 2313–2351 (2007)
Amos C I. Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet, 54: 535–543 (1994)
Chen Z, Chen J, Liu J. A tournament approach to the detection of multiple associations in genome-wide studies with pedigree data. Working Paper 2006-09, www.stats.uwaterloo.ca. Department of Statistics & Actuarial Sciences, University of Waterloo, 2006
Serfling R J. Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons, 1980
Broman K W, Speed T P. A model selection approach for the identification of quantitative trait loci in experimental crosses. J Royal Statist Soc Ser B, 64: 641–656 (2002)
Author information
Authors and Affiliations
Corresponding author
Additional information
Dedicated to Professor Zhidong Bai on the occasion of his 65th birthday
Zehua Chen was supported by Singapore Ministry of Educations ACRF Tier 1 (Grant No. R-155-000-065-112). Jiahua Chen was supported by the National Science and Engineering Research Countil of Canada and MITACS, Canada.
Rights and permissions
About this article
Cite this article
Chen, Z., Chen, J. Tournament screening cum EBIC for feature selection with high-dimensional feature spaces. Sci. China Ser. A-Math. 52, 1327–1341 (2009). https://doi.org/10.1007/s11425-009-0089-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11425-009-0089-4
Keywords
- extended Bayes information criterion
- feature selection
- penalized likelihood
- reduction of dimensionality
- small-n-large-P
- sure screening