Advertisement

Statistical Papers

, Volume 57, Issue 2, pp 381–405 | Cite as

On two simple and effective procedures for high dimensional classification of general populations

  • Zhaoyuan LiEmail author
  • Jianfeng Yao
Regular Article

Abstract

In this paper, we generalize two criteria, the determinant-based and trace-based criteria proposed by Saranadasa (J Multivar Anal 46:154–174, 1993), to general populations for high dimensional classification. These two criteria compare some distances between a new observation and several different known groups. The determinant-based criterion performs well for correlated variables by integrating the covariance structure and is competitive to many other existing rules. The criterion however requires the measurement dimension be smaller than the sample size. The trace-based criterion, in contrast, is an independence rule and effective in the “large dimension-small sample size” scenario. An appealing property of these two criteria is that their implementation is straightforward and there is no need for preliminary variable selection or use of turning parameters. Their asymptotic misclassification probabilities are derived using the theory of large dimensional random matrices. Their competitive performances are illustrated by intensive Monte Carlo experiments and a real data analysis.

Keywords

High dimensional classification Large sample covariance matrix Delocalization Determinant-based criterion Trace-based criterion 

Mathematics Subject Classification

62H30 

Notes

Acknowledgments

Jianfeng Yao is partly supported by the GRF Grant HKU 705413P.

References

  1. Bai Z, Liu H, Wong WK (2009) Enhancement of the applicability of Markowitz’s portfolio optimization by utilizing random matrix theory. Math Financ 19:639–667MathSciNetCrossRefzbMATHGoogle Scholar
  2. Bai Z, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6:311–329MathSciNetzbMATHGoogle Scholar
  3. Bai Z, Silverstein W (2010) Spectral analysis of large dimensional random matrices. Science Press, BeijingCrossRefzbMATHGoogle Scholar
  4. Bickel P, Levina E (2004) Some theory for Fisher’s linear discriminant function ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10:989–1010MathSciNetCrossRefzbMATHGoogle Scholar
  5. Chen SX, Zhang LX, Zhong PS (2010) Tests for high dimensional covariance matrices. J Am Stat Assoc 105:810–819MathSciNetCrossRefzbMATHGoogle Scholar
  6. Cheng Y (2004) Asymptotic probabilities of misclassification of two discriminant functions in cases of high dimensional data. Stat Probab Lett 67:9–17MathSciNetCrossRefzbMATHGoogle Scholar
  7. Fan J, Fan Y (2008) High dimensional classification using features annealed independence rules. Ann Stat 36:2605–2637MathSciNetCrossRefzbMATHGoogle Scholar
  8. Fan J, Feng Y, Tong X (2012) A road to classification in high dimensional space: the regularized optimal affine discriminant. J R Stat Soc Series B 74:745–771MathSciNetCrossRefGoogle Scholar
  9. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188CrossRefGoogle Scholar
  10. Guo Y, Hastie T, Tibshirani R (2005) Regularized discriminant analysis and its application in microarrays. Biostatistics 1:1–18. R. package downloadable at http://cran.r-project.org/web/packages/ascrda/
  11. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRefGoogle Scholar
  12. Lange T, Mosler K, Mozharovskyi P (2014) Fast nonparametric classification based on data depth. Stat Pap 55:49–69MathSciNetCrossRefzbMATHGoogle Scholar
  13. Leung CY (2001) Error rates in classification consisting of discrete and continuous variables in the presence of covariates. Stat Pap 42:265–273MathSciNetCrossRefzbMATHGoogle Scholar
  14. Li J, Chen SX (2012) Two sample tests for high dimensional covariance matrices. Ann Stat 40:908–940MathSciNetCrossRefzbMATHGoogle Scholar
  15. Krzyśko M, Skorzybut M (2009) Discriminant analysis of multivariate repeated measures data with a Kronecker product structured covariance matrices. Stat Pap 50:817–835MathSciNetCrossRefzbMATHGoogle Scholar
  16. Saranadasa H (1993) Asymptotic expansion of the misclassification probabilities of D- and A-criteria for discrimination from two high dimensional populations using the theory of large dimensional random matrices. J Multivar Anal 46:154–174MathSciNetCrossRefzbMATHGoogle Scholar
  17. Shao J, Wang Y, Deng X, Wang S (2011) Sparse linear discriminant analysis by thresholding for high dimensional data. Ann Stat 39:1241–1265MathSciNetCrossRefzbMATHGoogle Scholar
  18. Srivastava MS, Kollo T, Rosen D (2011) Some tests for the covariance matrix with fewer observations than the dimension under non-normality. J Multivar Anal 102:1090–1103MathSciNetCrossRefzbMATHGoogle Scholar
  19. Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 99:6567–6572CrossRefGoogle Scholar
  20. Vapnik VN (1995) The nature of statistical learning theory. Springer, New YorkCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Statistics and Actuarial ScienceThe University of Hong KongHong KongChina

Personalised recommendations