Abstract
We consider the nonparametric classification of high dimensional, low sample size (HDLSS) data where the classical discrimination methods break down due to the singularity of the sample covariance matrix. We present new dissimilarity indices, discuss their asymptotic properties in the HDLSS setting, use them in building powerful classifiers, and compare their behavior with existing methods. We illustrate the difficulties with the Euclidean nearest neighbor method and prove that dissimilarity-based classifiers produce misclassification rates that tend to zero as \(p\rightarrow \infty \). We present test-based classifiers in the HDLSS setting. A simulation study compares the misclassification rates of diagonal linear discriminant analysis with twelve other nonparametric classifiers. The methods are applied to microarray data for classification of prostate cancer.
Similar content being viewed by others
References
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: International conference on database theory, pp 420–434. Springer
Ahn J, Marron JS, Muller KM, Chi Y (2007) The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94:760–766
Angiulli F (2018) On the behavior of intrinsically high-dimensional spaces: distances, direct and reverse nearest neighbors, and hubness. J Mach Learn Res 18:1–60
Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53:479–506
Baringhaus L, Franz C (2004) On a new multivariate two-sample test. J Multivar Anal 88:190–206
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: International conference on database theory, pp 217–235. Springer
Biswas M, Ghosh AK (2014) A nonparametric two-sample test applicable to high dimensional data. J Multivar Anal 123:160–171
Chakraborty S, Zhang X (2021) A new framework for distance and kernel-based metrics in high dimensions. Electron J Stat 15:5455–5522
Chan Y-B, Hall P (2009) Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika 96:469–478
Chen SX, Qin YL (2010) A two sample test for high dimensional data with application togene-set testing. Ann Stat 38:808–835
Chung D, Keles S (2010) Sparse partial least squares classification for high dimensional data. Stat Appl Genet Mol Biol 9(1):17
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188
Fix E, Hodges JL (1989) Discriminatory analysis: nonparametric discrimination: consistency properties. Int Stat Rev 57:238–247
Hall P, Marron JS, Neeman A (2005) Geometric representation of high dimension, low sample size data. J R Stat Soc B 67:427–444
Henze N (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat 16:772–783
Huang S, Tong T, Zhao H (2010) Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics 66(4):1096–1106
Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis. Prentice-Hall, Upper Saddle River
Lange T, Mosler K, Mozharovskyi P (2014) Fast nonparametric classification based on data depth. Stat Pap 55:49–69
Liao SM, Akritas M (2007) Test-based classification: a linkage between classification and statistical testing. Stat Probab Lett 77(12):1269–1281
Marozzi M (2016) Multivariate tests based on interpoint distances with application to magnetic resonance imaging. Stat Methods Med Res 25:2593–2610
Marozzi M, Mukherjee A, Kalina J (2020) Interpoint distance tests for high-dimensional comparison studies. J Appl Stat 47(4):653–665
Modarres R (2022) A high dimensional measure of dissimilarity. Comput Stat Data Anal 10:20. https://doi.org/10.1016/j.csda.2022.107560
Modarres R, Song Y (2020) Interpoint distances: applications, properties and visualization. J Appl Stoch Models Bus Ind 36(6):1147–1168
Pal AK, Mondal PK, Ghosh AK (2016) High dimensional nearest neighbor classification based on differences of inter-point distances. Pattern Recogn Lett 74:1–8
Radovanovic M, Nanopoulos A, Ivanovic M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531
Roy S, Sarkar S, Dutta S, Ghosh AK (2022a) On generalizations of some distance based classifiers for HDLSS data. J Mach Learn Res 23:1–41
Roy S, Choudhury JR, Dutta S (2022b) On some fast and robust classifiers for high dimension, low sample size data. In: Proceedings of the 25th international conference on artificial intelligence and statistics, vol 151. Proceedings of machine learning research. PMLR, 28–30, 9943–9968
Sarkar S, Ghosh AK (2020) On perfect clustering of high dimension, low sample size data. IEEE Trans Pattern Anal Mach Intell 42(9):2257–2272
Sarkar S, Biswas R, Ghosh AK (2020) High dimensional two-sample tests based on a new class of dissimilarity indices. Technical Report
Stiglic G, Kokol P (2010) Stability of ranked gene lists in large microarray analysis studies. J Biomed Biotechnol 616358:1–9
Ting KM, Zhu Y, Carman M, Zhu Y, Zhou Z-H (2016) Overcoming key weaknesses of distance-based neighborhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1205–1214
West M, Blanchette C, Dressman H, Huang E, Ishida S, Sprang R, Zuzan H, Olson J, Marks J, Nevins J (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 98:11462–11467
Xu P, Brock GN, Parrish RS (2009) Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Comput Stat Data Anal 53(5):1674–1687
Acknowledgements
I would like to thank two anonymous referees for constructive suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof
Lemma 3 Assuming (B1)–(B3) hold, if \(\mathbf{z}\) is from \(\pi _x\), then \(\mathbb {I}_j^\delta (\mathbf{X}_i)\rightarrow 1\) and \(\mathbb {I}_j^\delta (\mathbf{Y}_j)\rightarrow 0\) so that \(T_1(NN_{\delta _0})\rightarrow k>0\) while \(T_2(NN_{\delta _0})\rightarrow 0\) with probability 1 as \(p\rightarrow \infty \). Hence, \(T_1(NN_{\delta _0})>T_2(NN_{\delta _0})\) and \(\mathbf{z}\) will be assigned to \(\pi _x\) with probability 1 as \(p\rightarrow \infty \). Similarly, if \(\mathbf{z}\) is from \(\pi _y\), then \(\mathbb {I}_j^\delta (\mathbf{X}_i)\rightarrow 0\) and \(\mathbb {I}_j^\delta (\mathbf{Y}_j)\rightarrow 1\) so that \(T_1(NN_{\delta _0})\rightarrow 0\) while \(T_2(NN_{\delta _0})\rightarrow k>0\) with probability 1 as \(p\rightarrow \infty \). Hence, \(T_1(NN_{\delta _0})<T_2(NN_{\delta _0})\) and \(\mathbf{z}\) will be assigned to \(\pi _x\) with probability 1 as \(p\rightarrow \infty \). Replacing \(\delta _0\) with \(\rho _0\) in the above argument establishes that as \(p\rightarrow \infty \), the NN method with index \(\rho _0\) has zero misclassification rate as \(p\rightarrow \infty \). \(\square \)
Proof
Lemma 4 Assuming (B1)–(B3) hold, and using comment 1, if \(\mathbf{z}\) is from \(\pi _x\), then \(T_1(\delta _0)\rightarrow 0\) while \(T_2(\delta _0)\rightarrow m\tilde{\delta }_0({\mathbb F},{\mathbb G})>0\) with probability 1 as \(p\rightarrow \infty \). Hence, \(T_1(\delta _0)<T_2(\delta _0)\) and \(\mathbf{z}\) will be assigned to \(\pi _x\) with probability 1 as \(p\rightarrow \infty \) If \(\mathbf{z}\) is from \(\pi _y\), then \(T_1(\delta _0)\rightarrow m\tilde{\delta }_0({\mathbb F},{\mathbb G})>0\) while \(T_2(\delta _0)\rightarrow 0\) with probability 1 as \(p\rightarrow \infty \). Hence, \(T_1(\delta _0)>T_2(\delta _0)\) and \(\mathbf{z}\) will be assigned to \(\pi _y\) with probability 1 as \(p\rightarrow \infty \).
Comment 2 explains the behavior of \(T_1(\rho _0)\). If \(\mathbf{z}\) is from \(\pi _x\), then \(T_1(\rho _0)\rightarrow 0\) while \(T_2(\rho _0)\rightarrow m\tilde{\rho }_0({\mathbb F},{\mathbb G})>0\) with probability 1 as \(p\rightarrow \infty \). Hence, \(T_1(\rho _0)<T_2(\rho _0)\) and \(\mathbf{z}\) will be assigned to \(\pi _x\) with probability 1 as \(p\rightarrow \infty \) If the new observation \(\mathbf{z}\) is from \(\pi _y\), then \(T_1(\rho _0)\rightarrow m\tilde{\rho }_0({\mathbb F},{\mathbb G})>0\) while \(T_2(\rho _0)\rightarrow 0\) with probability 1 as \(p\rightarrow \infty \). Hence, \(T_1(\rho _0)>T_2(\rho _0)\) and \(\mathbf{z}\) will be assigned to \(\pi _y\) with probability 1 as \(p\rightarrow \infty \). \(\square \)
Proof
Lemma 1 We obtain \(T_1(BF_\delta )=2\bar{\delta }_{(x\cup z)y}-\bar{\delta }_{x\cup z}-\bar{\delta }_{y}\). and \(T_2({BF}_\delta )=2\bar{\delta }_{x(y\cup z)}-\bar{\delta }_{x}-\bar{\delta }_{y\cup z}\). Assuming (B1)–(B3) hold, if \(\mathbf{z}\) is from \(\pi _x\) and assigned to \(\pi _x\), then there are \(m+1\) observations in \(x\cup z\) and n observations in the \(\mathbf{Y}\) sample. Hence, using comment 2, \(\bar{\delta }_{x\cup z}\rightarrow 0\), \(\bar{\delta }_{(y)}\rightarrow 0\), and \(\bar{\delta }_{(x\cup z)y}\rightarrow \tilde{\delta }_0({\mathbb F},{\mathbb G})\) as \(p\rightarrow \infty \). Hence, \(T_1({BF}_\delta )\) converges to \(2\tilde{\delta }_0({\mathbb F},{\mathbb G})>0\) in probability as \(p\rightarrow \infty \). If the new observation \(\mathbf{z}\) is from \(\pi _x\) and assigned to \(\pi _y\), then there are m observations in \(\mathbf{X}\) sample and \(n+1\) observations in the \(y\cup z\) sample. Hence, \(\delta _{x}\rightarrow 0\), \(\bar{\delta }_{y\cup z}\rightarrow \frac{2}{n+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\), and \(\bar{\delta }_{x(y\cup z)}\rightarrow \frac{n}{n+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\). Hence, \(T_2({BF}_\delta )\) converges to \(\frac{2(n-1)}{n+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\) in probability as \(p\rightarrow \infty \). Since \(T_1(BF_{\delta _0})>T_2(BF_{\delta _0})\), \(\mathbf{z}\) will be assigned to \(\pi _x\) with probability 1 as \(p\rightarrow \infty \).
Assuming (B1)–(B3) hold, if \(\mathbf{z}\) is from \(\pi _y\) and assigned to \(\pi _x\), then there are \(m+1\) observations in \(x\cup z\) and n observations in the \(\mathbf{Y}\) sample. Hence, \(\bar{\delta }_{x\cup z}\rightarrow \frac{2}{m+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\), \(\bar{\delta }_{(y)}\rightarrow 0\), and \(\bar{\delta }_{(x\cup z)y}\rightarrow \frac{m}{m+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\) as \(p\rightarrow \infty \). Now, \(T_1({BF}_\delta )\) converges to \(\frac{2(m-1)}{m+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})>0\) in probability as \(p\rightarrow \infty \). If \(\mathbf{z}\) is from \(\pi _y\) and assigned to \(\pi _y\), then there are m observations in \(\mathbf{X}\) sample and \(n+1\) observations in the \(y\cup z\) sample. Hence, \(\delta _{x}\rightarrow 0\), \(\bar{\delta }_{y\cup z}\rightarrow 0\), and \(\bar{\delta }_{x(y\cup z)}\rightarrow \frac{n}{n+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\). Hence, \(T_2({BF}_\delta )\) converges to \(2\tilde{\delta }_0({\mathbb F},{\mathbb G})\) in probability as \(p\rightarrow \infty \). Since \(T_1(BF_{\delta _0})<T_2(BF_{\delta _0})\), \(\mathbf{z}\) will be assigned to \(\pi _y\) with probability 1 as \(p\rightarrow \infty \). \(\square \)
Proof
Lemma 2 We obtain \(T_1({BG}_\delta )=(\bar{\delta }_{(x\cup z) y}- \bar{\delta }_{(x\cup z)})^2+(\bar{\delta }_{(x\cup z) y}- \bar{\delta }_{(y)})^2\) and \(T_2({BG}_\delta )=(\bar{\delta }_{x (y\cup z)}- \bar{\delta }_{(y\cup z)})^2+(\bar{\delta }_{x(y\cup z)}- \bar{\delta }_{(x)})^2\). Assuming (B1)–(B3) hold, if \(\mathbf{z}\) is from \(\pi _x\) and assigned to \(\pi _x\), then there are \(m+1\) observations in \(x\cup z\) and n observations in the \(\mathbf{Y}\) sample. Hence, \(\bar{\delta }_{x\cup z}\rightarrow 0\), \(\bar{\delta }_{(y)}\rightarrow 0\), and \(\bar{\delta }_{(x\cup z)y}\rightarrow \tilde{\delta }_0({\mathbb F},{\mathbb G})\) as \(p\rightarrow \infty \). Hence, using comment 2, \(T_1({BF}_\delta )\) converges to \(2\tilde{\delta }^2_0({\mathbb F},{\mathbb G})>0\) in probability as \(p\rightarrow \infty \). If \(\mathbf{z}\) is from \(\pi _x\) and assigned to \(\pi _y\), then there are m observations in \(\mathbf{X}\) sample and \(n+1\) observations in the \(y\cup z\) sample. Hence, \(\delta _{x}\rightarrow 0\), \(\bar{\delta }_{y\cup z}\rightarrow \frac{2}{n+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\), and \(\bar{\delta }_{x(y\cup z)}\rightarrow \frac{n}{n+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\). Hence, \(T_2({BF}_\delta )\) converges to \(\frac{(n-2)^2+n^2}{(n+1)^2}\tilde{\delta }^2_0({\mathbb F},{\mathbb G})\) in probability as \(p\rightarrow \infty \). Since \(T_1(BG_{\delta _0})>T_2(BG_{\delta _0})\), \(\mathbf{z}\) will be assigned to \(\pi _x\) with probability 1 as \(p\rightarrow \infty \).
Assuming (B1)–(B3) hold, if \(\mathbf{z}\) is from \(\pi _y\) and assigned to \(\pi _x\), then there are \(m+1\) observations in \(x\cup z\) and n observations in the \(\mathbf{Y}\) sample. Hence, using comment 1, \(\bar{\delta }_{x\cup z}\rightarrow \frac{2}{m+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\), \(\bar{\delta }_{(y)}\rightarrow 0\), and \(\bar{\delta }_{(x\cup z)y}\rightarrow \frac{m}{m+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\) as \(p\rightarrow \infty \). Hence, \(T_1({BF}_\delta )\) converges to \(\frac{(m-2)^2+m^2}{(m+1)^2}\tilde{\delta }^2_0({\mathbb F},{\mathbb G})>0\) in probability as \(p\rightarrow \infty \). If the new observation \(\mathbf{z}\) is from \(\pi _y\) and assigned to \(\pi _y\), then there are m observations in \(\mathbf{X}\) sample and \(n+1\) observations in the \(y\cup z\) sample. Hence, \(\delta _{x}\rightarrow 0\), \(\bar{\delta }_{y\cup z}\rightarrow 0\), and \(\bar{\delta }_{x(y\cup z)}\rightarrow \frac{m}{m+1}\tilde{\delta }_0({\mathbb F},{\mathbb G})\). Hence, \(T_2({BF}_\delta )\) converges to \(\frac{2m^2}{(m+1)^2}\tilde{\delta }^2_0({\mathbb F},{\mathbb G})\) in probability as \(p\rightarrow \infty \). Since \(T_1(BG_{\delta _0})<T_2(BG_{\delta _0})\), \(\mathbf{z}\) will be assigned to \(\pi _y\) with probability 1 as \(p\rightarrow \infty \). \(\square \)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Modarres, R. Nonparametric classification of high dimensional observations. Stat Papers 64, 1833–1859 (2023). https://doi.org/10.1007/s00362-022-01363-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01363-3