High dimensional binary classification under label shift: phase transition and regularization

Cheng, Jiahui; Chen, Minshuo; Liu, Hao; Zhao, Tuo; Liao, Wenjing

doi:10.1007/s43670-023-00071-9

High dimensional binary classification under label shift: phase transition and regularization

Original Article
Published: 25 October 2023

Volume 21, article number 32, (2023)
Cite this article

Sampling Theory, Signal Processing, and Data Analysis Aims and scope Submit manuscript

Jiahui Cheng¹^na1,
Minshuo Chen²^na1,
Hao Liu³,
Tuo Zhao⁴ &
…
Wenjing Liao¹

68 Accesses
Explore all metrics

Abstract

Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis

Article 18 March 2024

PPML: Penalized Partial Least Squares Discriminant Analysis for Multi-Label Learning

Regularized label relaxation with negative technique for image classification

Article 18 May 2022

References

Theodore Wilbur Anderson: An introduction to multivariate statistical analysis. Technical report, Wiley, New York (1962)
Google Scholar
Bai, Z., Silverstein, Jack.: Spectral analysis of large dimensional random matrices, 01 (2010)
Bai, Z. D., Yin, Y. Q.: Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab., 21(3):1275–1294, 07 (1993). https://doi.org/10.1214/aop/1176989118.
Barandela, Ricardo, Rangel, E., Sánchez, José Salvador., Ferri, Francesc J.: Restricted decontamination for the imbalanced training sample problem. In Iberoamerican congress on pattern recognition, pages 424–431. Springer, (2003)
Bartlett, Peter L., Long, Philip M., Lugosi, Gábor, Tsigler, Alexander: Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117 (48):30063–30070, (2020)
Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854 (2019)
Article MathSciNet Google Scholar
Belkin, Mikhail, Hsu, Daniel, Ji, Xu.: Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science 2(4), 1167–1180 (2020)
Article MathSciNet Google Scholar
Bibas, Koby, Fogel, Yaniv, Feder, Meir.: A new look at an old problem: A universal learning approach to linear regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2304–2308. IEEE, (2019)
Bickel, Peter J., Levina, Elizaveta.: Some theory for fisher’s linear discriminant function, naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989–1010, (2004)
Bishop, Christopher M.: Neural networks for pattern recognition. (1995)
Christopher, M.: Bishop. Springer, Pattern Recognition and Machine Learning (2006)
Google Scholar
Branco, Paula, Torgo, Luis, Ribeiro, Rita: A survey of predictive modelling under imbalanced distributions, (2015)
Cai, Tony, Liu, Weidong: A direct estimation approach to sparse linear discriminant analysis. Journal of the American statistical association 106(496), 1566–1577 (2011)
Article MathSciNet Google Scholar
Cao, Kaidi, Wei, Colin, Gaidon, Adrien, Arechiga, Nikos, Ma, Tengyu.: Learning imbalanced datasets with label-distribution-aware margin loss. (2019). arXiv:1906.07413 [cs.LG]
Chan, Yee Seng, Ng, Hwee Tou: Word sense disambiguation with distribution estimation. page 1010-1015, (2005)
synthetic minority over-sampling technique: Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote. Journal of artificial intelligence research 16, 321–357 (2002)
Google Scholar
Cui, Yin, Jia, Menglin, Lin, Tsung-Yi, Song, Yang, Belongie, Serge.: Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268–9277, (2019)
Deng, Zeyu, Kammoun, Abla, Thrampoulidis, Christos.: A model of double descent for high-dimensional binary linear classification. 11 (2019)
Dereziński, Michał., Liang, Feynman., Mahoney, Michael W.: Exact expressions for double descent and implicit regularization via surrogate random design. arXiv preprint arXiv:1912.04533, (2019)
Duin, Robert PW.: Small sample size generalization. In Proceedings of the Scandinavian Conference on Image Analysis, volume 2, pages 957–964. PROCEEDINGS PUBLISHED BY VARIOUS PUBLISHERS, (1995)
Elkan, Charles.: The foundations of cost-sensitive learning. page 973-978, (2001)
Elkhalil, Khalil, Kammoun, Abla, Couillet, Romain, Al-Naffouri, Tareq Y., Alouini, Mohamed-Slim.: A large dimensional study of regularized discriminant analysis. IEEE Transactions on Signal Processing, 68: 2464–2479, (2020)
Fisher, Ronald A.: The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, (1936)
Friedman, Jerome H.: Regularized discriminant analysis. Journal of the American statistical association, 84 (405):165–175, (1989)
Fukunaga, Keinosuke.: Introduction to statistical pattern recognition. Elsevier, (2013)
Grzymala-Busse, Jerzy W., Goodwin, Linda K., Grzymala-Busse, Witold J., Zheng, Xinqun.: An approach to imbalanced data sets based on changing rule strength. In Rough-neural computing, pages 543–553. Springer, (2004)
Guo, Yaqian, Hastie, Trevor, Tibshirani, Robert: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)
Article Google Scholar
Haixiang, Guo, Yijing, Li., Jennifer Shang, Gu., Mingyun, Huang Yuanyue, Bing, Gong: Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73, 220–239 (2017)
Article Google Scholar
Hastie, Trevor, Montanari, Andrea, Rosset, Saharon, Tibshirani, Ryan J.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, (2019)
He, Haibo, Garcia, Edwardo A.: Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, (2009)
Hong, Xia., Chen, Sheng, Harris, Chris J.: A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18 (1):28–41, (2007)
Hua, Jianping, Xiong, Zixiang, Lowey, James, Suh, Edward, Dougherty, Edward R.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8):1509–1515, (2005)
Huang, Chen, Li, Yining, Loy, Chen Change., Tang, Xiaoou: Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375–5384, (2016)
Hughes, Gordon: On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory 14(1), 55–63 (1968)
Article Google Scholar
Wolfram Research, Inc. Mathematica, Version 12.2. URL https://www.wolfram.com/mathematica. Champaign, IL, (2020)
Japkowicz, Nathalie, Stephen, Shaju: The class imbalance problem: A systematic study. Intelligent data analysis 6(5), 429–449 (2002)
Article Google Scholar
Jo, Taeho, Japkowicz, Nathalie: Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter 6(1), 40–49 (2004)
Article Google Scholar
Kingma, Diederik P., Ba, Jimmy.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, (2014)
Krizhevsky, Alex, et al: Learning multiple layers of features from tiny images. (2009)
Kubat, Miroslav, Matwin, Stan, et al.: Addressing the curse of imbalanced training sets: one-sided selection. In International Conference on Machine Learning, volume 97, pages 179–186. Citeseer, (1997)
Kukar, Matjaz, Kononenko, Igor, et al.: Cost-sensitive learning with neural networks. In ECAI, volume 15, pages 88–94. Citeseer, (1998)
LeCun, Yann, Bottou, Léon., Bengio, Yoshua, Haffner, Patrick: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lehmann, Erich L.: Fisher, Neyman, and the creation of classical statistics. Springer Science & Business Media, (2011)
Lipton, Zachary, Wang, Yu-Xiang, Smola, Alexander.: Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pages 3122–3130. PMLR, (2018)
Liu, Xu-Ying, Wu, Jianxin, Zhou, Zhi-Hua: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, (2008)
Liu, Yang, An, Aijun, Huang, Xiangji: Boosting prediction accuracy on imbalanced datasets with svm ensembles. In Pacific-Asia conference on knowledge discovery and data mining, pages 107–118. Springer, (2006)
Namee, Brian Mac, Cunningham, Padraig, Byrne, Stephen, Corrigan, Owen I.: The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1): 51–70, (2002)
Mai, Qing, Zou, Hui, Yuan, Ming: A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99(1), 29–42 (2012)
Article MathSciNet Google Scholar
Maloof, Marcus A.: Learning when data sets are imbalanced and when costs are unequal and unknown. In ICML-2003 workshop on learning from imbalanced data sets II, volume 2, pages 2–1, (2003)
Marcenko, V.A., Pastur, Leonid: Distribution of eigenvalues for some sets of random matrices. Math USSR Sb, 1:457–483, 01 (1967)
Mazurowski, Maciej A., Habas, Piotr A., Zurada, Jacek M., Lo, Joseph Y., Baker, Jay A., Tourassi, Georgia D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 21(2-3):427–436, (2008)
Mei, Song., Montanari, Andrea.: The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, (2019)
Montanari, Andrea, Ruan, Feng, Sohn, Youngtak, Yan, Jun: The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. 11 (2019)
Muthukumar, Vidya, Vodrahalli, Kailas, Subramanian, Vignesh, Sahai, Anant: Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory 1(1), 67–83 (2020)
Article Google Scholar
Nakkiran, Preetum.: More data can hurt for linear regression: Sample-wise double descent. arXiv preprint arXiv:1912.07242, (2019)
Philip, K., Chan, SJS.: Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164–168, (1998)
Pinsky, Mark A.: An introduction to probability theory and its applications, vol. 2 (william feller). SIAM Rev., 14(4):662-663, (October 1972). ISSN 0036-1445. https://doi.org/10.1137/1014119
Quiñonero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton: and N. Characterizing learning transfer, Lawrence. When training and test sets are different (2009)
Google Scholar
Radivojac, Predrag, Chawla, Nitesh V., Dunker, A Keith, Obradovic, Zoran: Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics, 37(4): 224–239, (2004)
Raudys, Sarunas, Duin, Robert PW.: Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern recognition letters 19(5–6), 385–392 (1998)
Article Google Scholar
Seiffert, Chris, Khoshgoftaar, Taghi M., Van Hulse, Jason, Napolitano, Amri: Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1):185–197, (2009)
Shao, Jun, Wang, Yazhen, Deng, Xinwei, Wang, Sijian, et al.: Sparse linear discriminant analysis by thresholding for high dimensional data. Annals of Statistics 39(2), 1241–1265 (2011)
Article MathSciNet Google Scholar
Sifaou, Houssem, Kammoun, Abla, Alouini, Mohamed-Slim: High-dimensional linear discriminant analysis classifier for spiked covariance model. Journal of Machine Learning Research, 21, (2020)
Sima, Chao, Dougherty, Edward R.: The peaking phenomenon in the presence of feature-selection. Pattern Recognition Letters, 29(11): 1667–1674, (2008)
Storkey, Amos J.: When training and test sets are different: characterising learning transfer. pages 3–28, (2009)
Sun, Yanmin, Kamel, Mohamed S, Wong, Andrew KC., Wang, Yang: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, (2007)
Velilla, Santiago, Hernández, Adolfo: On the consistency properties of linear and quadratic discriminant analyses. Journal of multivariate analysis 96(2), 219–236 (2005)
Article MathSciNet Google Scholar
Wang, Benjamin X., Japkowicz, Nathalie.: Boosting support vector machines for imbalanced data sets. Knowledge and information systems, 25(1): 1–20, (2010)
Wang, Cheng, Jiang, Binyan: On the dimension effect of regularized linear discriminant analysis. Electron. J. Statist. 12(2), 2709–2742 (2018). https://doi.org/10.1214/18-EJS1469
Article MathSciNet Google Scholar
Wang, Yu-Xiong, Ramanan, Deva, Hebert, Martial.: Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 7032–7042, (2017)
Weiss, Gary M.: Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1): 7–19, (2004)
Xiao, Jianxiong, Hays, James, Ehinger, Krista A., Oliva, Aude, Torralba, Antonio.: Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, (2010)
Xu, Ji, Hsu, Daniel.: On the number of variables to use in principal component regression. arXiv preprint arXiv:1906.01139, (2019)
Zhao, Eric, Liu, Anqi, Anandkumar, Animashree, Yue, Yisong.: Active learning under label shift. (2021). arXiv:2007.08479 [cs.LG]
Zollanvari, Amin, Dougherty, Edward R.: Generalized consistent error estimator of linear discriminant analysis. IEEE transactions on signal processing, 63 (11):2804–2814, (2015)
Zollanvari, Amin, Braga-Neto, Ulisses M., Dougherty, Edward R.: Analytic study of performance of error estimators for linear discriminant analysis. IEEE Transactions on Signal Processing, 59 (9):4238–4255, (2011)

Download references

Author information

Jiahui Cheng and Minshuo Chen contributed equally to this work.

Authors and Affiliations

School of Mathematics, Georgia Institute of Technology, Atlanta, USA
Jiahui Cheng & Wenjing Liao
Department of Electrical and Computer Engineering, Princeton University, Princeton, USA
Minshuo Chen
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Hao Liu
School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA
Tuo Zhao

Authors

Jiahui Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Minshuo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tuo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenjing Liao.

Additional information

Communicated by Mark Iwen.

This research is partially supported by NSF DMS 2012652 and NSF CAREER 2145167.

Appendices

A proof of phase transition in Sect. 3.2

Misclassification error as $n_1/ n_0 \rightarrow \infty $

In this section, we will give a proof to show the misclassification error tends to 0.5 when $n_1/n_0\rightarrow \infty $. When the training data set is extremely imbalanced with $n_1/n_0\rightarrow \infty $, i.e., $\gamma _0/\gamma _1\rightarrow \infty $, the Bayes classifier tends to classify all data points to class 1. This leads to the following limits,

$$\begin{aligned} \frac{\Delta ^2+\gamma _0-\gamma _1}{\sqrt{\Delta ^2+\gamma _0+\gamma _1}}\rightarrow +\infty ,\qquad \frac{\Delta ^2+\gamma _1-\gamma _0}{\sqrt{\Delta ^2+\gamma _0+\gamma _1}}\rightarrow -\infty . \end{aligned}$$

Then the limit of misclassification error is given by

$$\begin{aligned} \mathcal {R}(f_{\widehat{\alpha }, \widehat{\beta }}^{\widehat{b}})\overset{\mathrm{a.s.}}{\longrightarrow }\frac{1}{2}\Phi (-\infty )+\frac{1}{2}\Phi (+\infty )=0.5. \end{aligned}$$

The proof is complete. Phase transition knots

In this section, we provide theoretical justifications of the phase transition knots given in Sect. 3.2. Denote the asymptotic misclassification error in (6) and (7) as $\mathcal {R}(\gamma _0,\gamma _1)$. We fix $\gamma _0$ and let $\gamma _1$ vary starting from the balanced case with $\gamma _1=\gamma _0$.

The transition knots are obtained by a local analysis about the instantaneous change of misclassification error as $n_1/n_0$ slightly increases from 1. We observe that, as $n_1/n_0$ slightly increases from 1, the misclassification error decreases in Phase I and III, and increases in Phase II. Notice that $\gamma _1$ slightly decreases from $\gamma _0$ as $n_1/n_0$ slightly increases from 1.

The instantaneous change of $\mathcal {R}(\gamma _0,\gamma _1)$ with respect to $\gamma _1$ can be characterized by the following partial derivative, $\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}$:

$$\begin{aligned} \frac{\partial }{\partial \gamma _1}\mathcal {R}\,\vert \,_{\gamma _1=\gamma _0} = \frac{\Delta ^2 \phi \left( \frac{-\Delta ^2(1-\frac{1}{2}\gamma _0)^{\frac{1}{2}}}{2\sqrt{\Delta ^2+2\gamma _0}}\right) }{16(\Delta ^2+2\gamma _0)^{\frac{3}{2}}(1-\frac{1}{2}\gamma _0)^{\frac{1}{2}}} \times [4+\Delta ^2], \end{aligned}$$

for $\gamma _0<2$, and

$$\begin{aligned} \frac{\partial }{\partial \gamma _1}\mathcal {R}\,\vert \,_{\gamma _1=\gamma _0} = \frac{\Delta ^2\phi \left( \frac{-\Delta ^2(\frac{1}{2}\gamma _0-1)^{\frac{1}{2}}}{\gamma _0\sqrt{\Delta ^2+2\gamma _0}}\right) }{4\gamma _0^2(\Delta ^2+2\gamma _0)^{\frac{3}{2}}(\frac{1}{2}\gamma _0-1)^{\frac{1}{2}}} \times \underbrace{[4\gamma _0^2-(12-\Delta ^2)\gamma _0-4\Delta ^2]}_{Q(\gamma _0,\Delta )}, \end{aligned}$$

for $\gamma _0>2$, where $\phi $ is the probability density function of the standard normal distribution. The $Q(\gamma _0,\Delta )$ term is a quadratic function with two roots of opposite signs. The positive root is $\gamma _b=\frac{1}{8}(12-\Delta ^2+\sqrt{\Delta ^4+40\Delta ^2+144})$. The sign of the above partial derivative has the following cases:

When $\gamma _0\in (0,2)$, $\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}$ is always positive. As a result, $\mathcal {R}(\gamma _0,\gamma _1)$ decreases as $\gamma _1$ decreases from $\gamma _0$, which corresponds to Phase I.
When $\gamma _0 \in (2,\gamma _b)$, $\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}$ is negative. In this case $\mathcal {R}(\gamma _0,\gamma _1)$ increases as $\gamma _1$ decreases from $\gamma _0$, which corresponds to Phase II.
When $\gamma _0 \in (\gamma _b,+\infty )$, $\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}$ is positive. In this case $\mathcal {R}(\gamma _0,\gamma _1)$ decreases as $\gamma _1$ decreases from $\gamma _0$, which includes Phase III.

B Proofs of Lemmas in Sect. 6.1

1.1 B.1 Proofs of Lemmas 4 and 5

Proof of lemma 4

We will prove this result by Basu theorem, i.e., we will show that ($\widehat{\mu }-\mu $) is a complete and sufficient statistic and $\widehat{\Sigma }$ is an auxiliary statistic w.r.t. $\mu $.

First, we show $\widehat{\mu }$ is a complete statistic. We need to check that for any $\mu $ and measurable function g, $\mathbb {E}[g(\hat{\mu })] = 0$ for any $\mu $ implies $\mathbb {P}(g(\hat{\mu }) = 0) = 1$ for any $\mu $. Indeed, for any measurable function g such that the expectation of $g(\widehat{\mu })$ over sample space $(x_1,x_2,\dots ,x_n)$ is zero, i.e.,

$$\begin{aligned} \mathbb {E}[g(\widehat{\mu })] = 0\quad \text {for any}~\mu , \end{aligned}$$

(30)

we can derive $\mathbb {P}(g(\hat{\mu } - \mu ) = 0) = 1$ by taking derivatives of Equation (30) w.r.t. $\mu $ recursively,

$$\begin{aligned} \mathbb {E}\bigg [ h(\widehat{\mu }) g(\widehat{\mu })\bigg ] = 0, \text {for any polynomial { h}}, \end{aligned}$$

and therefore $\widehat{\mu }$ is a complete statistic w.r.t. parameter ${\mu }$.

To prove $\widehat{\mu }$ is also a sufficient statistic for ${\mu }$, we need to show that given the statistic $\widehat{\mu }$ the conditional distribution of $x_1,...,x_n$ does not depend on $\mu $. Note that $\widehat{\mu }$ has a multivariate normal distribution, i.e., $\widehat{\mu } \sim \mathcal {N}({\mu }, \frac{1}{n}\Sigma )$, since $\widehat{\mu } = \frac{1}{n} \sum _{i=1}^n x_i$ is a linear combination of i.i.d. multivariate normal vectors $x_1,x_2, \dots , x_n$. The pdf of $\hat{\mu }$ and the joint distribution of $x_1, x_2, \dots , x_n$ are given by

$$\begin{aligned} f(\widehat{\mu } )&= \frac{1}{(2\pi )^{\frac{p}{2}}\vert \frac{1}{n}\Sigma \vert ^{\frac{1}{2}}} \exp \left( -\frac{n}{2}(\widehat{\mu } - {\mu })^\top \Sigma ^{-1} (\widehat{\mu } - {\mu })\right) ,\nonumber \\ f(x_1,\dots ,x_n)&= \frac{1}{(2\pi )^\frac{np}{2} \vert \Sigma \vert ^\frac{n}{2}} \exp \left( -\sum _{i=1}^n \frac{1}{2}(x_i-{\mu })^\top \Sigma ^{-1} (x_i-{\mu })\right) . \end{aligned}$$

(31)

The joint density function of $x_1, \dots , x_n$ and $\hat{\mu }$ is given by

$$\begin{aligned} f(x_1,\dots ,x_n,\widehat{\mu }) =f(x_1,\dots ,x_n) \mathbbm {1}\left( \widehat{\mu }=\frac{1}{n}(x_1+x_2+\dots +x_n)\right) . \end{aligned}$$

(32)

By taking the fraction of (31) and (32), the conditional density of $x_1,\dots , x_n$ given $\hat{\mu }$ is

$$\begin{aligned} f\left( x_1,\dots ,x_n\,\vert \,\widehat{\mu }\right) = C \exp \left( -\frac{1}{2}(x-\widehat{\mu })^\top \Sigma ^{-1} (x-\widehat{\mu })\right) , \end{aligned}$$

(33)

where C is a constant. By Fisher–Neyman factorization theorem [43], given the statistic $\widehat{\mu }$ the conditional distribution of $x_1,...,x_n$ does not depend on $\mu $ and therefore $\hat{\mu }$ is a sufficient statistic for ${\mu }$.

Sample covariance has a distribution which doesn’t depend on the parameter ${\mu }$.

$$\begin{aligned} \widehat{\Sigma }= \sum _{i=1}^n (x_i - \widehat{\mu })^\top (x_i - \widehat{\mu }) = \sum _{i=1}^n z_i^\top z_i, \end{aligned}$$

(34)

and therefore it is a auxiliary statistic.

Combining $\widehat{\mu }$ being a complete and sufficient statistic and $\widehat{\Sigma }$ being an auxiliary statistic, we obtain that $\widehat{\mu }$ and $\widehat{\Sigma }$ are independent, by Basu Theorem. $\square $

Proof of lemma 5

The following isotropic property of Wishart distribution has been given by [69]. For any orthogonal matrix $U \in \mathbb {R}^{p \times p}$, we have

$$\begin{aligned} U^\top \left( \frac{Z^\top Z}{n-2}\right) U \sim \mathcal {W}(I_p,n-2). \end{aligned}$$

(35)

We next apply this property to the left-hand side of equation (10),

$$\begin{aligned} \begin{aligned} z^\top \left( \frac{Z^\top Z}{n-2}\right) ^{\dagger }z&=z^\top U_i U_i^\top \left( \frac{Z^\top Z}{n-2}\right) ^\dagger U_i U_i^\top z\\&=\Vert {z}\Vert ^2 e_i^\top \left( U_i^\top \left( \frac{Z^\top Z}{n-2}\right) U_i\right) ^{\dagger } e_i\\&\mathop {=}\limits ^d\Vert {z}\Vert ^2 e_i^\top \left( \frac{Z^\top Z}{n-2}\right) ^{\dagger } e_i, \end{aligned} \end{aligned}$$

(36)

where $U_i$ is a orthogonal matrix that transforms the vector z to canonical basis vector $e_i$, i.e.,

$$\begin{aligned} U_i^\top z = \Vert {z}\Vert _2 e_i, \quad i=1,2,\dots ,p. \end{aligned}$$

(37)

We can further simplify the product $ z ^\top (\frac{Z^\top Z}{n-2})^{\dagger } z$ by taking an average over the index i,

$$\begin{aligned} \begin{aligned} z^\top \left( \frac{Z^\top Z}{n-2}\right) ^\dagger z&\mathop {=}\limits ^d \frac{1}{p} \Vert z \Vert _2^2 \sum _{i=1,...,p} e_i^\top \left( \frac{Z^\top Z}{n-2}\right) ^{\dagger } e_i\\&= \frac{1}{p}\Vert {z}\Vert ^2_2 \text {tr}\left( \left( \frac{Z^\top Z}{n-2}\right) ^\dagger \right) , \end{aligned} \end{aligned}$$

(38)

where we get the isotropicity of Wishart distribution in Equation (10). $\square $

1.2 B.2 Proofs of Lemmas 6, 7 and 8

Proof of Lemma 6

We use the eigenvalue decomposition of $Z^\top Z=U^\top D U$ to simplify the left-hand side of Eq. (11),

$$\begin{aligned} \begin{aligned} \textrm{tr}[(Z^\top Z)^\dagger ]&=\text {tr}\left( U D^\dagger U^\top \right) \\&=\text {tr}\left( D^\dagger \right) \\&=\sum \limits _{s \in \lambda (Z^\top Z),s\ne 0} \frac{1}{s}. \end{aligned} \end{aligned}$$

The result above implies that the trace of the pseudo-inverse of $Z^\top Z$ is equal to the sum of the reciprocal of its eigenvalues. By the same arguments on $ZZ^\top $, we can show that

$$\begin{aligned} \textrm{tr}[(ZZ^\top )^\dagger ] =\sum \limits _{s \in \lambda (ZZ^\top ),s\ne 0} \frac{1}{s}. \end{aligned}$$

Then we deduce the desired result by the fact that the set of non-zero eigenvalues of $ZZ^\top $ matches that of $Z^\top Z$. $\square $

Proof of Lemma 7

We first compute the limit of $\frac{1}{\sqrt{p}}\mu _d^\top z$. The linear combination of multivariate normal random vector $z\sim \mathcal {N}(0,I_p)$ is a normal random variable, namely, $\frac{1}{\sqrt{p}}\mu _d^\top z \sim \mathcal {N}(0, \frac{1}{p}{\mu }_d^\top \mu _d)$. From the concentration inequality of the normal random variable [57], we have

$$\begin{aligned} \mathbb {P}\left( \left| \frac{1}{\sqrt{p}}\mu _d^\top z\right| \ge \frac{\epsilon }{\sqrt{p}}\Vert \mu _d\Vert _2\right) \le 2 \frac{1}{\sqrt{2\pi }\epsilon } e^{-\epsilon ^2/2} \quad \text {for all }x\ge 0. \end{aligned}$$

(39)

Combining Eq. (39), $e^{-x}<\frac{1}{x}$ for $x>0$ and $\Vert \mu _d\Vert _2 < 2\Delta $ for a sufficiently large p, the sum of the probabilities of $\frac{1}{\sqrt{p}}\vert \mu _d^\top {z}\vert > \epsilon $ is finite, for any positive $\epsilon > 0$, i.e.,

$$\begin{aligned} \begin{aligned} \sum _{p=1}^{\infty }\mathbb {P}\left( \frac{1}{\sqrt{p}}\vert \mu _d^\top {z}\vert > \epsilon \right)&\le \sum _{p=1}^{\infty } \frac{\Vert \mu _d\Vert _2}{\sqrt{2\pi p}\epsilon } e^{-(\epsilon ^2p)/(2\Vert \mu _d\Vert _2^2)}\\&<\sum _{p=1}^{\infty }\frac{2\Vert \mu _d\Vert _2^3}{\sqrt{2\pi }\epsilon ^3}~\frac{1}{p^{3/2}}\\&<\infty . \end{aligned} \end{aligned}$$

By the Borel–Cantelli lemma, we have

$$\begin{aligned} \frac{1}{\sqrt{p}}\mu _d^\top {z} \overset{\mathrm{a.s.}}{\longrightarrow }0. \end{aligned}$$

We next consider the limit of $\frac{1}{p}z_\ell ^\top z_\ell $. Since $z_\ell $ satisfies the chi-squared distribution independently with expectation $\mathbb {E}[z_{\ell ,i}^2]=1$ and finite variance $\mathbb {V}ar (z_{\ell ,i}^2)=2$, we know the average of the squared elements in $z_\ell $ converges to the expectation almost surely by the strong law of the large numbers, namely,

$$\begin{aligned} \frac{1}{p}z_\ell ^\top z_\ell \overset{\mathrm{a.s.}}{\longrightarrow }1. \end{aligned}$$

By the same arguments given above, $n_\ell $ satisfies the binomial distribution $B(n_9+n_1,\pi _\ell )$ which is composed of $n_0+n_1$ independent Bernoulli distribution with expectation $\pi _0$ and Variance $\pi _0\pi _1$. From the strong law of large numbers, we have

$$\begin{aligned} \frac{n_\ell }{n_0+n_1}\overset{\mathrm{a.s.}}{\longrightarrow }\pi _\ell . \end{aligned}$$

$\square $

Proof of Lemma 8

We first derive the expression of $m(\zeta )$. The Marchenko-Pastur law is supported on a compact subset of $\mathcal {R}^+$, i.e., $\textrm{supp}(F_{\gamma })\subset [a,b]$ where

$$\begin{aligned} a=(1-\sqrt{\gamma })^2, \quad \text {and} \quad b=(1+\sqrt{\gamma })^2. \end{aligned}$$

Let $\{z_k\}$ be a sequence of complex numbers such that $\textrm{Im}(z_k)>0, \textrm{Re}(z_k)=\zeta $ for any k and $\lim _{k\rightarrow \infty }z_k=\zeta $. Consider the sequence of integral

$$\begin{aligned} \int _{a}^b \frac{1}{s-z_k} dF_\gamma (s). \end{aligned}$$

For any k, $0<\gamma <1$ and $s>a$, we have

$$\begin{aligned} \left| \frac{1}{s-z_k}\right| \le \left| \frac{1}{s}\right|<\frac{1}{(1-\sqrt{\gamma })^2}<\infty . \end{aligned}$$

By the dominated convergence theorem, we have

$$\begin{aligned} \int \frac{1}{s-\zeta } dF_\gamma (s)&=\int _{a}^b \lim _{k\rightarrow \infty }\frac{1}{s-z_k} dF_\gamma (s)=\lim _{k\rightarrow \infty } \int _{a}^b \frac{1}{s-z_k} dF_\gamma (s)\nonumber \\&=\lim _{k\rightarrow \infty } \int \frac{1}{s-z_k} dF_\gamma (s). \quad \end{aligned}$$

(40)

To compute $\int \frac{1}{s-z_k} dF_\gamma (s)$, [2, Lemma 3.11] gives

$$\begin{aligned} \int \frac{1}{s-z_k} dF_\gamma (s)= \frac{1-\gamma -z_k+\sqrt{(z_k-\gamma -1)^2-4\gamma }}{2\gamma z_k}. \end{aligned}$$

(41)

According to the definition of the square root of complex numbers in [2, Eq. (2.3.2)], the real part of $\sqrt{(z_k-\gamma -1)^2-4\gamma }$ has the same sign as that of $z_k-\gamma -1$. Since $\textrm{Re}(z_k)=\zeta \le 0, \gamma >0$, the real part of $\sqrt{(z_k-\gamma -1)^2-4\gamma }$ is negative and gives

$$\begin{aligned} \lim _{k\rightarrow \infty } \sqrt{(z_k-\gamma -1)^2-4\gamma }= -\sqrt{(\zeta -\gamma -1)^2-4\gamma }. \end{aligned}$$

(42)

Substituting (42) and (41) into (40) gives rise to (12).

We then compute m(0). When substituting $\zeta =0$ into (12), both the numerator and the denominator are 0. Here we apply L’Hospital’s rule:

$$\begin{aligned} m(0)&=\lim _{\zeta \rightarrow 0}\frac{1-\gamma -\zeta -\sqrt{(\zeta -\gamma -1)^2-4\gamma }}{2\gamma \zeta } \\&=\lim _{\zeta \rightarrow 0} \frac{1}{2\gamma } \left( -1-\frac{\zeta -\gamma -1}{\sqrt{(\zeta -\gamma -1)^2-4\gamma }}\right) \\&=\frac{1}{2\gamma } \left( -1-\frac{-\gamma -1}{\sqrt{(-\gamma -1)^2-4\gamma }}\right) \nonumber \\&=\frac{1}{2\gamma } \left( -1+\frac{1+\gamma }{1-\gamma }\right) \nonumber \\&=\frac{1}{1-\gamma }. \end{aligned}$$

We next derive the expression of $\frac{d}{d\zeta }m(\zeta )$. To derive the expression, we first show that

$$\begin{aligned} \int \frac{1}{(s-\zeta )^2} dF_\gamma (s)= \lim _{z\rightarrow \zeta }\frac{d }{d z} \int \frac{1}{s-z} dF_\gamma (s) \quad \text {for }z\in \mathbb {C}\text { with }\textrm{Re}(z)=\zeta , \textrm{Im}(z)>0. \end{aligned}$$

(43)

Let $\{h_k\}$ be a set of complex numbers such that $\vert \textrm{Re}(h_k)\vert \le \vert \zeta \vert /2$ for any k and $\lim _{k\rightarrow \infty }h_k=0$. For any k and $s\ge a$, we have

$$\begin{aligned} \left| \frac{1}{(s-z-h_k)(s-z)}\right| \le \frac{1}{(1-\sqrt{\gamma })^4}<\infty . \end{aligned}$$

By the dominated convergence theorem, we have

$$\begin{aligned} \frac{\partial }{\partial z} \int \frac{1}{s-z} dF_\gamma (s)=&\lim _{k\rightarrow \infty }\frac{1}{h_k}\left[ \int \frac{1}{s-(z+h_k)} dF_\gamma (s)-\int \frac{1}{s-z} dF_\gamma (s)\right] \\ =&\lim _{k\rightarrow \infty }\frac{1}{h_k}\int _{a}^{b} \frac{1}{s-z-h_k}-\frac{1}{s-z} dF_\gamma (s) \\ =&\lim _{k\rightarrow \infty }\int _{a}^{b} \frac{1}{(s-z-h_k)(s-z)} dF_\gamma (s) \\ =&\int _{a}^{b}\frac{1}{(s-z)^2} dF_\gamma (s), \end{aligned}$$

where the second equality holds since $F_{\gamma }(s)$ is supported on [a, b].

Since

$$\begin{aligned} \left| \frac{1}{(s-z)^2}\right| \le \frac{1}{(1-\sqrt{\gamma })^4}<\infty , \end{aligned}$$

for any $s\ge a$, we have

$$\begin{aligned} \int \frac{1}{(s-\zeta )^2}dF_{\gamma }(s)&= \int _{a}^{b} \lim _{z\rightarrow \zeta }\frac{1}{(s-z)^2}dF_{\gamma }(s) =\lim _{z\rightarrow \zeta } \int _{a}^{b} \frac{1}{(s-z)^2}dF_{\gamma }(s)\\&= \lim _{z\rightarrow \lambda }\frac{\partial }{\partial z} \int \frac{1}{s-z} dF_\gamma (s). \end{aligned}$$

Using (41), we have

$$\begin{aligned}&\frac{\partial }{\partial z} \int \frac{1}{s-z} dF_\gamma (s) \nonumber \\ {}&= \frac{(2\gamma z)\left[ -1+\frac{z-\gamma -1}{\sqrt{(z-\gamma -1)^2-4\gamma }}\right] -(2\gamma )\left( 1-\gamma -z+\sqrt{(z-\gamma -1)^2-4\gamma }\right) }{4\gamma ^2z^2} \nonumber \\&=\frac{(2\gamma )(\gamma -1)+\frac{(2\gamma z)(z-\gamma -1)}{\sqrt{(z-\gamma -1)^2-4\gamma }}-2\gamma \sqrt{(z-\gamma -1)^2-4\gamma }}{4\gamma ^2z^2}. \end{aligned}$$

(44)

Letting $z\rightarrow \zeta $ in (44) and recall that the real part of $\sqrt{(z-\gamma -1)^2-4\gamma }$ is negative, one gets (13).

To compute $\frac{d}{d\zeta }m(0)$, by L’Hopital’s rule, we deduce

$$\begin{aligned} \frac{d}{d\zeta }m(0)= \lim _{\zeta \rightarrow 0} \frac{(2\gamma )(\gamma -1)-\frac{(2\gamma z)(\zeta -\gamma -1)}{\sqrt{(\zeta -\gamma -1)^2-4\gamma }}+2\gamma \sqrt{(\zeta -\gamma -1)^2-4\gamma }}{4\gamma ^2\zeta ^2}=\frac{1}{(1-\gamma )^3} . \end{aligned}$$

(45)

$\square $

C Proofs in Sect. 4

1.1 C.1 Proof of Theorem 3

Proof of Theorem 3

The proof uses the same technique as in the Theorem 2, the misclassification error is the same as (14), and we only need to show the limits of $q_0$ and $q_1$. By the change of variables formula in (15) and Lemma 8, we deduce

$$\begin{aligned}&\hat{\beta }^\top (\hat{\alpha } - \mu _0) \nonumber \\&= - \frac{1}{2} (\mu _d + \frac{1}{\sqrt{n_0}}z_0 - \frac{1}{\sqrt{n_1}}z_1)^\top \left( \frac{Z^\top Z}{n-2} + \lambda I_p\right) ^\dagger (\mu _d - \frac{1}{\sqrt{n_0}} z_0 - \frac{1}{\sqrt{n_1}} z_1)\nonumber \\&=\frac{1}{2n_0} z_0^\top \left( \frac{Z^\top Z}{n-2}+ \lambda I_p\right) ^\dagger z_0 - \frac{1}{2}(\mu _d - z_1/\sqrt{n_1})^\top \left( \frac{Z^\top Z }{n-2}+ \lambda I_p\right) ^\dagger (\mu _d - z_1/\sqrt{n_1}) \nonumber \\&\overset{d}{=}\ \left[ -\frac{1}{2}\Vert \mu _d -\tfrac{1}{\sqrt{n_1}}z_1\Vert _2^2 + \frac{1}{2n_0} \Vert z_0\Vert _2^2 \right] \times \frac{1}{p} \text {tr}\left( \frac{Z^\top Z}{n-2} + \lambda I_p\right) ^{\dagger } \nonumber \\&\overset{\mathrm{a.s.}}{\longrightarrow }-\frac{1}{2} (\Delta ^2 + \gamma _0 -\gamma _1) m(-\lambda ). \end{aligned}$$

(46)

Similarly we can derive

$$\begin{aligned} \Vert \widehat{\beta }\Vert _\Sigma ^2 =&(\widehat{\mu }_0-\widehat{\mu }_1)^\top (\widehat{\Sigma }+\lambda I_p)^\dagger \Sigma (\widehat{\Sigma }+ \lambda I_p)^\dagger (\widehat{\mu }_0-\widehat{\mu }_1)\nonumber \\ =&({\mu }_d + \frac{1}{\sqrt{n_0}}{z}_0 - \frac{1}{\sqrt{n_1}}{z}_1)^\top \left( \left( \frac{Z^\top Z}{n-2} + \lambda I_p\right) ^{\dagger }\right) ^2 ({\mu }_d + \frac{1}{\sqrt{n_0}}{z}_0 - \frac{1}{\sqrt{n_1}}{z}_1)\nonumber \\ \overset{d}{=}\ {}&\Vert {\mu _d + \frac{1}{\sqrt{n_0}}z_0 - \frac{1}{\sqrt{n_1}}z_1}\Vert _2^2 \times \frac{1}{p} \text {tr}\left( \left( \frac{Z^\top Z}{n-2}+\lambda I_p\right) ^\dagger \right) ^2\nonumber \\ \overset{\mathrm{a.s.}}{\longrightarrow }&(\Delta ^2 + \gamma _0 + \gamma _1)m'(-\lambda ). \end{aligned}$$

(47)

Combining (46) and (47), as well as putting the threshold term $\ln (n_1/n_0)$ back, we obtain

$$\begin{aligned} q_0 \overset{\mathrm{a.s.}}{\longrightarrow }\frac{-\frac{1}{2}(\Delta ^2 -\gamma _0+\gamma _1)m(-\lambda )+\ln {\frac{\gamma _0}{\gamma _1}}}{\sqrt{(\Delta ^2 + \gamma _1 + \gamma _0)m'(-\lambda )}}. \end{aligned}$$

The same argument of analyzing $q_0$ applies to $q_1$ and therefore, we have

$$\begin{aligned} q_1 \overset{\mathrm{a.s.}}{\longrightarrow }\frac{-\frac{1}{2}(\Delta ^2 + \gamma _0 - \gamma _1)m(-\lambda )+\ln {\frac{\gamma _1}{\gamma _0}}}{\sqrt{(\Delta ^2 + \gamma _1 + \gamma _0)m'(-\lambda )}}. \end{aligned}$$

We complete the proof by substituting $q_0,q_1$ above into (14). $\square $

1.2 C.2 Proof of Regularized Phase Transition in Sect. 4.2

In this section, we show with a strong regularization, the phase transition phenomenon will vanish. Denote the asymptotic misclassification error in Theorem 3 as

$$\begin{aligned} \mathcal {R}_\lambda (\gamma _0, \gamma _1) = \sum _{\ell =0,1} \Phi \left( \frac{g(\gamma _0, \gamma _1, \ell )m(-\lambda )+(-1)^\ell \ln {\frac{\gamma _0}{\gamma _1}}}{k(\gamma _0, \gamma _1)\sqrt{m'(-\lambda )}}\right) , \end{aligned}$$

and we use the shorthand $\mathcal {R}_\lambda (\gamma )$ to denote $\mathcal {R}_\lambda (\gamma _0,\gamma _1)$ with the balanced data, i.e., $\gamma _0=\gamma _1 = 2\gamma $,

$$\begin{aligned} \mathcal {R}_\lambda (\gamma ) := \mathcal {R}_\lambda (2\gamma ,2\gamma ) = \Phi \left( \frac{-\Delta ^2 m(-\lambda )}{2\sqrt{(\Delta ^2+4\gamma )m'(-\lambda )})}\right) . \end{aligned}$$

We show the phase transition phenomenon vanishes with a strong regularization, namely,

$$\begin{aligned} \frac{\partial }{\partial \gamma _1} \mathcal {R}_\lambda (\gamma _0,\gamma _1)\,\vert \,_{\gamma _0=\gamma _1=2\gamma }> 0 \quad \hbox { for a strong}\ \lambda >0. \end{aligned}$$

To see the result above, we need to show that $\frac{\partial \mathcal {R}_\lambda (\gamma )}{\partial \gamma }>0$ with a strong regularization. Specifically, invoking Chain rule and by some manipulation, we have

$$\begin{aligned} \frac{\partial }{\partial \gamma _1}\mathcal {R}_\lambda (\gamma _0, \gamma _1)\,\vert \,_{\gamma _0=\gamma _1=2\gamma } = \frac{\partial \mathcal {R}_\lambda (\gamma ) }{\partial \gamma }\frac{\partial \gamma }{\partial \gamma _1}\,\vert \,_{\gamma _0=\gamma _1=2\gamma }. \end{aligned}$$

By Mathematica Software [35], we check

$$\begin{aligned} \frac{\partial \mathcal {R}_\lambda (\gamma )}{\partial \gamma } = \phi \left( \frac{-\Delta ^2m(-\lambda )}{2\sqrt{(\Delta ^2+4\gamma )m'(-\lambda )}}\right) \underbrace{\frac{\partial }{\partial \gamma } \left( \frac{-\Delta ^2m(-\lambda )}{2\sqrt{(\Delta ^2+4\gamma )m'(-\lambda )}}\right) }_{(\star )}, \end{aligned}$$

where $\phi $ is the pdf of the standard normal distribution. By Mathematica Software [35], the denominator of $(\star )$ is also always positive, and is given by

$$\begin{aligned}&2 \sqrt{2} \gamma ^3 \lambda ^3 \frac{\left[ (\gamma +\lambda +1)^2-4 \gamma \right] ^{3/2}\left( 4 \gamma +\Delta ^2\right) ^{3/2}}{\left[ \gamma \lambda ^2 \left( (\gamma +\lambda +1)^2-4 \gamma \right) \right] ^{3/2}} \\ {}&\times \Big [\gamma ^3 + \gamma ^2 (\sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+2 \lambda -3)\\&\quad +\gamma \Big (-2 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}+\lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+\lambda ^2+3\Big )\\&\quad +(\lambda +1) (\sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-\lambda -1)\Big ]^{3/2}. \end{aligned}$$

As a result, the sign of $\frac{\partial \mathcal {R}_\lambda (\gamma )}{\partial \gamma }$ is determined by the numerator of $(\star )$ given as

$$\begin{aligned}&\Delta ^2 \Big \{\gamma (\lambda +1) \Big [\Delta ^2 \Big (-3 \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2} -2 \lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2 +\lambda ^2}\\&+\lambda ^2+5 \lambda +4\Big )+8 (\lambda +1)^2 \left( \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-\lambda -1\right) \Big ]\\&+\gamma ^2 \Big [\Delta ^2 \Big (2 \lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+3 \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}\\ {}&-6 \lambda -6\Big ) -4 (\lambda +1) \Big (2 \lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-9 \lambda -9\\ {}&+7 \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}\Big )\Big ] +\gamma ^3 \Big [4 \lambda \Big (\sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}\\ {}&+\lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+\lambda ^2-9\Big )+36 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2} \\ {}&+\Delta ^2 \left( -\sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+\lambda +4\right) -64\Big ] +\gamma ^4 \Big [-\Delta ^2+56 \\ {}&{} -20 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}+4 \lambda \left( 2 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}+3 \lambda -4\right) \Big ] \\&+4 \gamma ^5 \left[ \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+3 \lambda -6\right] +4 \gamma ^6 \\&+\Delta ^2 (\lambda +1)^3 \left( \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-\lambda -1\right) \Big \}. \end{aligned}$$

Combining the denominator and numerator, $\frac{\partial }{\partial \gamma } \mathcal {R}_\lambda (\gamma )$ is positive only when one of the following case happens,

1.
$\Delta>0 ~\textrm{and}~ 0<\gamma \le 1~\textrm{and}~\lambda >0$.
2.
$\Delta>0 ~\textrm{and}~ \gamma \ge \frac{\sqrt{\Delta ^2+4}}{2}+1~\textrm{and}~ \lambda >0$.
3.
$\Delta >0 ~\textrm{and}~ 1<\gamma <\frac{\sqrt{\Delta ^2+4}}{2}+1 ~\textrm{and}~ \lambda \ge \text {the smallest real root of} \big [{\#1}^4 (32 \gamma +4 \Delta ^2)+{\#1}^3 (96 \gamma ^2+8 \gamma \Delta ^2+128 \gamma +16 \Delta ^2) +{\#1}^2 (96 \gamma ^3+112 \gamma ^2+16 \gamma \Delta ^2+192 \gamma +\Delta ^4+24 \Delta ^2) +{\#1} (-8 \gamma ^3 \Delta ^2-16 \gamma ^2 \Delta ^2+32 \gamma ^4-96 \gamma ^3-64 \gamma ^2-2 \gamma \Delta ^4+8 \gamma \Delta ^2+128 \gamma +2 \Delta ^4+16 \Delta ^2) -4 \gamma ^4 \Delta ^2 +16 \gamma ^3 \Delta ^2+\gamma ^2 \Delta ^4-16 \gamma ^2 \Delta ^2-16 \gamma ^4+64 \gamma ^3- 80 \gamma ^2-2 \gamma \Delta ^4+32 \gamma +\Delta ^4+4 \Delta ^2\big ]$.

Consequently, we deduce that the misclassification error increases when $\gamma $ grows in the interval $\left( 0, 1\right) $ or $\left( \frac{\sqrt{\Delta ^2+4}}{2}, \infty \right) $; when $\gamma $ grows in $\left( 1,\frac{\sqrt{\Delta ^2+4}}{2}\right) $, the misclassification error decreases when $\lambda $ is small, yet increases when $\lambda $ is large. For example, when $\Delta ^2=9$ and $\lambda =1$, $\mathcal {R}_\lambda (\gamma )$ increases monotonically with respect to $\gamma $, and the peaking phenomenon disappears. Meanwhile we have the instantaneous derivative $\frac{\partial \mathcal {R}_\lambda (\gamma _0,\gamma _1)}{\partial \gamma _1}\,\vert \,_{\gamma _0=\gamma _1=2\gamma } >0$ for any $\gamma _0$, which implies that the phase transition phenomenon vanishes.

The commands of Mathematica are provided as follows.

In[1]: $m(\gamma \_,\lambda \_)\text {:=}\frac{\sqrt{(\gamma +\lambda +1)^2-4 \gamma }+\gamma -\lambda -1}{2 \gamma \lambda }$

In[2]: $R(\gamma \_,\lambda \_,\Delta \_)\text {:=}-\frac{\Delta ^2 m(\gamma ,\lambda )}{2 \sqrt{-\left( 4 \gamma +\Delta ^2\right) \frac{\partial m(\gamma ,\lambda )}{\partial \lambda }}}$

In[3]: $\text {de}(\gamma \_,\lambda \_,\Delta \_)\text {:=}$

$\text {Evaluate}\left[ \text {Denominator}\left[ \text {FullSimplify}\left[ \text {Together}\left[ \frac{\partial R(\gamma ,\lambda ,\Delta )}{\partial \gamma }\right] \right] \right] \right] $

In[4]: $\text {nu}(\gamma \_ ,\lambda \_ ,\Delta \_ )\text {:=}$

$\text {Evaluate}\left[ \text {Numerator}\left[ \text {FullSimplify}\left[ \text {Together}\left[ \frac{\partial R(\gamma ,\lambda ,\Delta )}{\partial \gamma }\right] \right] \right] \right] $

In[5]: $\text {Reduce}[\text {de}(\gamma ,\lambda ,\Delta )\ge 0\wedge \lambda>0\wedge \gamma>0\wedge \Delta >0,\{\gamma ,\lambda \}]$

In[6]: $\text {Reduce}[\text {nu}(\gamma ,\lambda ,\Delta )\ge 0\wedge \lambda>0\wedge \gamma>0\wedge \Delta >0,\{\gamma ,\lambda \}]$

In[7]: $\text {Reduce}[\text {nu}(\gamma ,1,3)\ge 0\wedge \gamma >0,\{\gamma \}]$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cheng, J., Chen, M., Liu, H. et al. High dimensional binary classification under label shift: phase transition and regularization. Sampl. Theory Signal Process. Data Anal. 21, 32 (2023). https://doi.org/10.1007/s43670-023-00071-9

Download citation

Received: 23 December 2022
Accepted: 22 September 2023
Published: 25 October 2023
DOI: https://doi.org/10.1007/s43670-023-00071-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High dimensional binary classification under label shift: phase transition and regularization

Abstract

Access this article

Similar content being viewed by others

Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis

PPML: Penalized Partial Least Squares Discriminant Analysis for Multi-Label Learning

Regularized label relaxation with negative technique for image classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

A proof of phase transition in Sect. 3.2

B Proofs of Lemmas in Sect. 6.1

1.1 B.1 Proofs of Lemmas 4 and 5

Proof of lemma 4

Proof of lemma 5

1.2 B.2 Proofs of Lemmas 6, 7 and 8

Proof of Lemma 6

Proof of Lemma 7

Proof of Lemma 8

C Proofs in Sect. 4

1.1 C.1 Proof of Theorem 3

Proof of Theorem 3

1.2 C.2 Proof of Regularized Phase Transition in Sect. 4.2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

High dimensional binary classification under label shift: phase transition and regularization

Abstract

Access this article

Similar content being viewed by others

Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis

PPML: Penalized Partial Least Squares Discriminant Analysis for Multi-Label Learning

Regularized label relaxation with negative technique for image classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

A proof of phase transition in Sect. 3.2

B Proofs of Lemmas in Sect. 6.1

1.1 B.1 Proofs of Lemmas 4 and 5

Proof of lemma 4

Proof of lemma 5

1.2 B.2 Proofs of Lemmas 6, 7 and 8

Proof of Lemma 6

Proof of Lemma 7

Proof of Lemma 8

C Proofs in Sect. 4

1.1 C.1 Proof of Theorem 3

Proof of Theorem 3

1.2 C.2 Proof of Regularized Phase Transition in Sect. 4.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation