Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

  • Makoto AoshimaEmail author
  • Kazuyoshi Yata


We consider classifiers for high-dimensional data under the strongly spiked eigenvalue (SSE) model. We first show that high-dimensional data often have the SSE model. We consider a distance-based classifier using eigenstructures for the SSE model. We apply the noise-reduction methodology to estimation of the eigenvalues and eigenvectors in the SSE model. We create a new distance-based classifier by transforming data from the SSE model to the non-SSE model. We give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.


Asymptotic normality Data transformation Discriminant analysis Large p small n Noise-reduction methodology Spiked model 



We would like to thank two anonymous referees for their constructive comments.


  1. Ahn, J., Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97, 254–259.Google Scholar
  2. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.Google Scholar
  3. Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis (Editor’s special invited paper), 30, 356–399.Google Scholar
  4. Aoshima, M., Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.Google Scholar
  5. Aoshima, M., Yata, K. (2015a). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.Google Scholar
  6. Aoshima, M., Yata, K. (2015b). High-dimensional quadratic classifiers in non-sparse settings. arXiv preprint. arXiv:1503.04549.
  7. Aoshima, M., Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.Google Scholar
  8. Bai, Z., Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6, 311–329.Google Scholar
  9. Bickel, P. J., Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.Google Scholar
  10. Cai, T. T., Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566–1577.Google Scholar
  11. Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.Google Scholar
  12. Chen, S. X., Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics, 38, 808–835.Google Scholar
  13. Christensen, B. C., Houseman, E. A., Marsit, C. J., Zheng, S., Wrensch, M. R., Wiemels, J. L., et al. (2009). Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genetics, 5, e1000602.CrossRefGoogle Scholar
  14. Dudoit, S., Fridlyand, J., Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87.Google Scholar
  15. Fan, J., Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.Google Scholar
  16. Glaab, E., Bacardit, J., Garibaldi, J. M., Krasnogor, N. (2012). Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE, 7, e39932.Google Scholar
  17. Gravier, E., Pierron, G., Vincent-Salomon, A., Gruel, N., Raynal, V., Savignoni, A., et al. (2010). A prognostic DNA signature for T1T2 node-negative breast cancer patients. Genes, Chromosomes and Cancer, 49, 1125–1134.CrossRefGoogle Scholar
  18. Hall, P., Marron, J. S., Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.Google Scholar
  19. Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.Google Scholar
  20. Jeffery, I. B., Higgins, D. G., Culhane, A. C. (2006). Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics, 7, 359.Google Scholar
  21. Li, Q., Shao, J. (2015). Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, 25, 457–473.Google Scholar
  22. Marron, J. S., Todd, M. J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.Google Scholar
  23. McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. The Annals of Probability, 2, 620–628.MathSciNetCrossRefzbMATHGoogle Scholar
  24. Naderi, A., Teschendorff, A. E., Barbosa-Morais, N. L., Pinder, S. E., Green, A. R., Powe, D. G., et al. (2007). A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene, 26, 1507–1516.CrossRefGoogle Scholar
  25. Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.Google Scholar
  26. Ramey J. A. (2016). Datamicroarray: collection of data sets for classification.
  27. Shao, J., Wang, Y., Deng, X., Wang, S. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics, 39, 1241–1265.Google Scholar
  28. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8, 68–74.CrossRefGoogle Scholar
  29. Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., et al. (2003). The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. The New England Journal of Medicine, 349, 2483–2494.CrossRefGoogle Scholar
  30. Watanabe, H., Hyodo, M., Seo, T., Pavlenko, T. (2015). Asymptotic properties of the misclassification rates for Euclidean distance discriminant rule in high-dimensional data. Journal of Multivariate Analysis, 140, 234–244.Google Scholar
  31. Yata, K., Aoshima, M. (2010). Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix. Journal of Multivariate Analysis, 101, 2060–2077.Google Scholar
  32. Yata, K., Aoshima, M. (2012). Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. Journal of Multivariate Analysis, 105, 193–215.Google Scholar
  33. Yata, K., Aoshima, M. (2013). PCA consistency for the power spiked model in high-dimensional settings. Journal of Multivariate Analysis, 122, 334–354.Google Scholar
  34. Yata, K., Aoshima, M. (2015). Principal component analysis based clustering for high-dimension, low-sample-size data. arXiv preprint. arXiv:1503.04525.

Copyright information

© The Institute of Statistical Mathematics, Tokyo 2018

Authors and Affiliations

  1. 1.Institute of MathematicsUniversity of TsukubaTsukubaJapan

Personalised recommendations