Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

Aoshima, Makoto; Yata, Kazuyoshi

doi:10.1007/s10463-018-0655-z

Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

Published: 10 March 2018

Volume 71, pages 473–503, (2019)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Makoto Aoshima¹ &
Kazuyoshi Yata¹

740 Accesses
19 Citations
Explore all metrics

Abstract

We consider classifiers for high-dimensional data under the strongly spiked eigenvalue (SSE) model. We first show that high-dimensional data often have the SSE model. We consider a distance-based classifier using eigenstructures for the SSE model. We apply the noise-reduction methodology to estimation of the eigenvalues and eigenvectors in the SSE model. We create a new distance-based classifier by transforming data from the SSE model to the non-SSE model. We give simulation studies and discuss the performance of the new classifier. Finally, we demonstrate the new classifier by using microarray data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Quadratic Classifier for High-Dimension, Low-Sample-Size Data Under the Strongly Spiked Eigenvalue Model

Robust Classification of High-Dimensional Data Using Data-Adaptive Energy Distance

Nonparametric classification of high dimensional observations

Article 08 October 2022

References

Ahn, J., Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97, 254–259.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis (Editor’s special invited paper), 30, 356–399.
Aoshima, M., Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.
Aoshima, M., Yata, K. (2015a). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.
Aoshima, M., Yata, K. (2015b). High-dimensional quadratic classifiers in non-sparse settings. arXiv preprint. arXiv:1503.04549.
Aoshima, M., Yata, K. (2018). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.
Bai, Z., Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6, 311–329.
Bickel, P. J., Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.
Cai, T. T., Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566–1577.
Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.
Chen, S. X., Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics, 38, 808–835.
Christensen, B. C., Houseman, E. A., Marsit, C. J., Zheng, S., Wrensch, M. R., Wiemels, J. L., et al. (2009). Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genetics, 5, e1000602.
Article Google Scholar
Dudoit, S., Fridlyand, J., Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87.
Fan, J., Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.
Glaab, E., Bacardit, J., Garibaldi, J. M., Krasnogor, N. (2012). Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE, 7, e39932.
Gravier, E., Pierron, G., Vincent-Salomon, A., Gruel, N., Raynal, V., Savignoni, A., et al. (2010). A prognostic DNA signature for T1T2 node-negative breast cancer patients. Genes, Chromosomes and Cancer, 49, 1125–1134.
Article Google Scholar
Hall, P., Marron, J. S., Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.
Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.
Jeffery, I. B., Higgins, D. G., Culhane, A. C. (2006). Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics, 7, 359.
Li, Q., Shao, J. (2015). Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, 25, 457–473.
Marron, J. S., Todd, M. J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. The Annals of Probability, 2, 620–628.
Article MathSciNet MATH Google Scholar
Naderi, A., Teschendorff, A. E., Barbosa-Morais, N. L., Pinder, S. E., Green, A. R., Powe, D. G., et al. (2007). A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene, 26, 1507–1516.
Article Google Scholar
Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.
Ramey J. A. (2016). Datamicroarray: collection of data sets for classification. https://github.com/ramhiser/datamicroarray.
Shao, J., Wang, Y., Deng, X., Wang, S. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics, 39, 1241–1265.
Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8, 68–74.
Article Google Scholar
Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., et al. (2003). The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. The New England Journal of Medicine, 349, 2483–2494.
Article Google Scholar
Watanabe, H., Hyodo, M., Seo, T., Pavlenko, T. (2015). Asymptotic properties of the misclassification rates for Euclidean distance discriminant rule in high-dimensional data. Journal of Multivariate Analysis, 140, 234–244.
Yata, K., Aoshima, M. (2010). Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix. Journal of Multivariate Analysis, 101, 2060–2077.
Yata, K., Aoshima, M. (2012). Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations. Journal of Multivariate Analysis, 105, 193–215.
Yata, K., Aoshima, M. (2013). PCA consistency for the power spiked model in high-dimensional settings. Journal of Multivariate Analysis, 122, 334–354.
Yata, K., Aoshima, M. (2015). Principal component analysis based clustering for high-dimension, low-sample-size data. arXiv preprint. arXiv:1503.04525.

Download references

Acknowledgements

We would like to thank two anonymous referees for their constructive comments.

Author information

Authors and Affiliations

Institute of Mathematics, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8571, Japan
Makoto Aoshima & Kazuyoshi Yata

Authors

Makoto Aoshima
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyoshi Yata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Makoto Aoshima.

Additional information

Research of the first author was partially supported by Grants-in-Aid for Scientific Research (A) and Challenging Exploratory Research, Japan Society for the Promotion of Science (JSPS), under Contract Numbers 15H01678 and 26540010. Research of the second author was partially supported by Grant-in-Aid for Young Scientists (B), JSPS, under Contract Number 26800078.

About this article

Cite this article

Aoshima, M., Yata, K. Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Ann Inst Stat Math 71, 473–503 (2019). https://doi.org/10.1007/s10463-018-0655-z

Download citation

Received: 26 December 2016
Revised: 12 January 2018
Published: 10 March 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s10463-018-0655-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

Abstract

Access this article

Similar content being viewed by others

A Quadratic Classifier for High-Dimension, Low-Sample-Size Data Under the Strongly Spiked Eigenvalue Model

Robust Classification of High-Dimensional Data Using Data-Adaptive Energy Distance

Nonparametric classification of high dimensional observations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

Abstract

Access this article

Similar content being viewed by others

A Quadratic Classifier for High-Dimension, Low-Sample-Size Data Under the Strongly Spiked Eigenvalue Model

Robust Classification of High-Dimensional Data Using Data-Adaptive Energy Distance

Nonparametric classification of high dimensional observations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation