Skip to main content
Log in

Tournament screening cum EBIC for feature selection with high-dimensional feature spaces

  • Published:
Science in China Series A: Mathematics Aims and scope Submit manuscript

Abstract

The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hunter D, Li R. Variable selection via MM algorithms. Ann Statist, 33: 1617–1642 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  2. Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimation in sparse high-dimensional regression models. Ann Statist, 36: 587–613 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  3. Paul D, Bair E, Hastie T, et al. “Preconditioning” for feature selection and regression in high-dimensional problems. Ann Statist, 36: 1595–1618 (2007)

    Article  MathSciNet  Google Scholar 

  4. Zhang C H, Huang J. The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann Statist, 36: 1567–1594 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  5. Kosorok M R, Ma S. Marginal asymptotics for the “large p, small n” paradigm: With applications to microarray data. Ann Statist, 35: 1456–1486 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  6. Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. Ann Statist, 70: 849–911 (2007)

    Google Scholar 

  7. Tusher V, Tibshirani R, Chu C. Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proc Nat Acad Sci USA, 98: 5116–5121 (2001)

    Article  MATH  Google Scholar 

  8. Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA, 99: 6567–6572 (2002)

    Article  Google Scholar 

  9. Marchini J, Donnelly P, Cardon L R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics, 37: 413–417 (2005)

    Article  Google Scholar 

  10. Benjamini Y, Hochberg Y. Controlling the false discovery rate — A practical and powerful approach to multiple testing. J Royal Statist Soc Ser B, 57: 289–300 (1995)

    MATH  MathSciNet  Google Scholar 

  11. Storey J D, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA, 100: 9440–9445 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  12. Hoh J, Wille A, Ott J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Research, 11: 2115–2119 (2001)

    Article  Google Scholar 

  13. Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Reviews Genetics, 4: 701–709 (2003)

    Article  Google Scholar 

  14. Zaykin D V, Zhivotovsky L A, Westfall P H, et al. Truncated product method for combining p-values, Genet Epidemiol, 22: 170–185 (2002)

    Article  Google Scholar 

  15. Dudbridge F, Koeleman B P C. Rank truncated product of P-values, with application to genome wide association scans. Genet Epidemiol, 25: 360–366 (2003)

    Article  Google Scholar 

  16. Tibshirani R. Regression shrinkage and selection via the LASSO. J Royal Statist Soc Ser B, 58: 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  17. Fan J, Li R. Variable selection via non-concave penalized likelihood and its oracle properties. J Amer Statist Assoc, 96: 1348–1360 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  18. Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Statist Soc Ser B, 67: 301–320 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  19. Efron B, Hastie T, Johnstone I, et al. Least angle regression. Ann Statist, 32: 407–499 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  20. Ishwaran H, Rao J S. Detecting differentially expressed genes in microarrays using Bayesian model selection. J Amer Statist Assoc, 98: 438–455 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  21. Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model space. Biometrika, 95: 759–771 (2008)

    Article  Google Scholar 

  22. Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Second International Symposium on Information Theory, eds. B.N. Petrox and F. Caski. Budapest: Akademiai Kiado, 267, 1973

    Google Scholar 

  23. Schwarz G. Estimating the dimension of a model. Ann Statist, 6: 461–464 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  24. Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Statist, 35: 2313–2351 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  25. Amos C I. Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet, 54: 535–543 (1994)

    Google Scholar 

  26. Chen Z, Chen J, Liu J. A tournament approach to the detection of multiple associations in genome-wide studies with pedigree data. Working Paper 2006-09, www.stats.uwaterloo.ca. Department of Statistics & Actuarial Sciences, University of Waterloo, 2006

  27. Serfling R J. Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons, 1980

    Book  MATH  Google Scholar 

  28. Broman K W, Speed T P. A model selection approach for the identification of quantitative trait loci in experimental crosses. J Royal Statist Soc Ser B, 64: 641–656 (2002)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to ZeHua Chen.

Additional information

Dedicated to Professor Zhidong Bai on the occasion of his 65th birthday

Zehua Chen was supported by Singapore Ministry of Educations ACRF Tier 1 (Grant No. R-155-000-065-112). Jiahua Chen was supported by the National Science and Engineering Research Countil of Canada and MITACS, Canada.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Chen, J. Tournament screening cum EBIC for feature selection with high-dimensional feature spaces. Sci. China Ser. A-Math. 52, 1327–1341 (2009). https://doi.org/10.1007/s11425-009-0089-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-009-0089-4

Keywords

MSC(2000)

Navigation