A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data

Article

Abstract

In this paper, we consider a scale adjusted-type distance-based classifier for high-dimensional data. We first give such a classifier that can ensure high accuracy in misclassification rates for two-class classification. We show that the classifier is not only consistent but also asymptotically normal for high-dimensional data. We provide sample size determination so that misclassification rates are no more than a prespecified value. We propose a classification procedure called the misclassification rate adjusted classifier. We further develop the classifier to multiclass classification. We show that the classifier can still enjoy asymptotic properties and ensure high accuracy in misclassification rates for multiclass classification. Finally, we demonstrate the proposed classifier in actual data analyses by using a microarray data set.

Keywords

Asymptotic normality Distance-based classifier HDLSS  Sample size determination Two-stage procedure 

References

  1. Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika, 94, 760–766.Google Scholar
  2. Aoshima, M., Yata, K. (2011a). Two-stage procedures for high-dimensional data. Sequential Analysis (Editor’s special invited paper), 30, 356–399.Google Scholar
  3. Aoshima, M., Yata, K. (2011b). Authors’ response. Sequential Analysis, 30, 432–440.Google Scholar
  4. Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S.E., Lander, E. S., Golub, T. R., Korsmeyer, S. J. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30, 41–47.Google Scholar
  5. Bai, Z., Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6, 311–329.Google Scholar
  6. Baik, J., Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97, 1382–1408.Google Scholar
  7. Bickel, P. J., Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.Google Scholar
  8. Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.Google Scholar
  9. Chen, S. X., Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Annals of Statistics, 38, 808–835.Google Scholar
  10. Dudoit, S., Fridlyand, J., Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87.Google Scholar
  11. Ghosh, M., Mukhopadhyay, N., Sen, P. K. (1997). Sequential estimation. New York: Wiley.Google Scholar
  12. Hall, P., Marron, J. S., Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.Google Scholar
  13. Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.Google Scholar
  14. Huang, S., Tong, T., Zhao, H. (2010). Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics, 66, 1096–1106.Google Scholar
  15. Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29, 295–327.Google Scholar
  16. Jung, S., Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Annals of Statistics, 37, 4104–4130.Google Scholar
  17. Marron, J. S., Todd, M. J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.Google Scholar
  18. McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. Annals of Probability, 2, 620–628.Google Scholar
  19. Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17, 1617–1642.Google Scholar
  20. Saranadasa, H. (1993). Asymptotic expansion of the misclassification probabilities of D-and A-criteria for discrimination from two high dimensional populations using the theory of large dimensional random matrices. Journal of Multivariate Analysis, 46, 154–174.Google Scholar
  21. Srivastava, M. S. (2005). Some tests concerning the covariance matrix in high dimensional data. Journal of the Japan Statistical Society, 35, 251–272.Google Scholar
  22. Vapnic, V. N. (1999). The nature of statistical learning theory (second ed.). New York: Springer-Verlag.Google Scholar
  23. Yata, K., Aoshima, M. (2009). PCA consistency for non-Gaussian data in high dimension, low sample size context. Communications in Statistics. Theory and Methods, Special Issue Honoring Zacks, S. (ed. Mukhopadhyay, N.), 38, 2634–2652.Google Scholar
  24. Yata, K., Aoshima, M. (2010). Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix. Journal of Multivariate Analysis, 101, 2060–2077.Google Scholar
  25. Yata, K., Aoshima, M. (2012a). Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. Journal of Multivariate Analysis, 105, 193–215.Google Scholar
  26. Yata, K., Aoshima, M. (2012b). Asymptotic properties of a distance-based classifier for high-dimensional data. RIMS Koukyuroku, 1804, 53–64.Google Scholar
  27. Yata, K., Aoshima, M. (2013). Correlation tests for high-dimensional data using extended cross-data-matrix methodology. Journal of Multivariate Analysis, 117, 313–331.Google Scholar

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2013

Authors and Affiliations

  1. 1.Institute of MathematicsUniversity of TsukubaTsukubaJapan

Personalised recommendations