Skip to main content
Log in

High dimensional binary classification under label shift: phase transition and regularization

  • Original Article
  • Published:
Sampling Theory, Signal Processing, and Data Analysis Aims and scope Submit manuscript

Abstract

Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Theodore Wilbur Anderson: An introduction to multivariate statistical analysis. Technical report, Wiley, New York (1962)

    Google Scholar 

  2. Bai, Z., Silverstein, Jack.: Spectral analysis of large dimensional random matrices, 01 (2010)

  3. Bai, Z. D., Yin, Y. Q.: Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab., 21(3):1275–1294, 07 (1993). https://doi.org/10.1214/aop/1176989118.

  4. Barandela, Ricardo, Rangel, E., Sánchez, José Salvador., Ferri, Francesc J.: Restricted decontamination for the imbalanced training sample problem. In Iberoamerican congress on pattern recognition, pages 424–431. Springer, (2003)

  5. Bartlett, Peter L., Long, Philip M., Lugosi, Gábor, Tsigler, Alexander: Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117 (48):30063–30070, (2020)

  6. Belkin, Mikhail, Hsu, Daniel, Ma, Siyuan, Mandal, Soumik: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854 (2019)

    Article  MathSciNet  Google Scholar 

  7. Belkin, Mikhail, Hsu, Daniel, Ji, Xu.: Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science 2(4), 1167–1180 (2020)

    Article  MathSciNet  Google Scholar 

  8. Bibas, Koby, Fogel, Yaniv, Feder, Meir.: A new look at an old problem: A universal learning approach to linear regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2304–2308. IEEE, (2019)

  9. Bickel, Peter J., Levina, Elizaveta.: Some theory for fisher’s linear discriminant function, naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989–1010, (2004)

  10. Bishop, Christopher M.: Neural networks for pattern recognition. (1995)

  11. Christopher, M.: Bishop. Springer, Pattern Recognition and Machine Learning (2006)

    Google Scholar 

  12. Branco, Paula, Torgo, Luis, Ribeiro, Rita: A survey of predictive modelling under imbalanced distributions, (2015)

  13. Cai, Tony, Liu, Weidong: A direct estimation approach to sparse linear discriminant analysis. Journal of the American statistical association 106(496), 1566–1577 (2011)

    Article  MathSciNet  Google Scholar 

  14. Cao, Kaidi, Wei, Colin, Gaidon, Adrien, Arechiga, Nikos, Ma, Tengyu.: Learning imbalanced datasets with label-distribution-aware margin loss. (2019). arXiv:1906.07413 [cs.LG]

  15. Chan, Yee Seng, Ng, Hwee Tou: Word sense disambiguation with distribution estimation. page 1010-1015, (2005)

  16. synthetic minority over-sampling technique: Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote. Journal of artificial intelligence research 16, 321–357 (2002)

    Google Scholar 

  17. Cui, Yin, Jia, Menglin, Lin, Tsung-Yi, Song, Yang, Belongie, Serge.: Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268–9277, (2019)

  18. Deng, Zeyu, Kammoun, Abla, Thrampoulidis, Christos.: A model of double descent for high-dimensional binary linear classification. 11 (2019)

  19. Dereziński, Michał., Liang, Feynman., Mahoney, Michael W.: Exact expressions for double descent and implicit regularization via surrogate random design. arXiv preprint arXiv:1912.04533, (2019)

  20. Duin, Robert PW.: Small sample size generalization. In Proceedings of the Scandinavian Conference on Image Analysis, volume 2, pages 957–964. PROCEEDINGS PUBLISHED BY VARIOUS PUBLISHERS, (1995)

  21. Elkan, Charles.: The foundations of cost-sensitive learning. page 973-978, (2001)

  22. Elkhalil, Khalil, Kammoun, Abla, Couillet, Romain, Al-Naffouri, Tareq Y., Alouini, Mohamed-Slim.: A large dimensional study of regularized discriminant analysis. IEEE Transactions on Signal Processing, 68: 2464–2479, (2020)

  23. Fisher, Ronald A.: The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, (1936)

  24. Friedman, Jerome H.: Regularized discriminant analysis. Journal of the American statistical association, 84 (405):165–175, (1989)

  25. Fukunaga, Keinosuke.: Introduction to statistical pattern recognition. Elsevier, (2013)

  26. Grzymala-Busse, Jerzy W., Goodwin, Linda K., Grzymala-Busse, Witold J., Zheng, Xinqun.: An approach to imbalanced data sets based on changing rule strength. In Rough-neural computing, pages 543–553. Springer, (2004)

  27. Guo, Yaqian, Hastie, Trevor, Tibshirani, Robert: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)

    Article  Google Scholar 

  28. Haixiang, Guo, Yijing, Li., Jennifer Shang, Gu., Mingyun, Huang Yuanyue, Bing, Gong: Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73, 220–239 (2017)

    Article  Google Scholar 

  29. Hastie, Trevor, Montanari, Andrea, Rosset, Saharon, Tibshirani, Ryan J.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, (2019)

  30. He, Haibo, Garcia, Edwardo A.: Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, (2009)

  31. Hong, Xia., Chen, Sheng, Harris, Chris J.: A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18 (1):28–41, (2007)

  32. Hua, Jianping, Xiong, Zixiang, Lowey, James, Suh, Edward, Dougherty, Edward R.: Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8):1509–1515, (2005)

  33. Huang, Chen, Li, Yining, Loy, Chen Change., Tang, Xiaoou: Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375–5384, (2016)

  34. Hughes, Gordon: On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory 14(1), 55–63 (1968)

    Article  Google Scholar 

  35. Wolfram Research, Inc. Mathematica, Version 12.2. URL https://www.wolfram.com/mathematica. Champaign, IL, (2020)

  36. Japkowicz, Nathalie, Stephen, Shaju: The class imbalance problem: A systematic study. Intelligent data analysis 6(5), 429–449 (2002)

    Article  Google Scholar 

  37. Jo, Taeho, Japkowicz, Nathalie: Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter 6(1), 40–49 (2004)

    Article  Google Scholar 

  38. Kingma, Diederik P., Ba, Jimmy.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, (2014)

  39. Krizhevsky, Alex, et al: Learning multiple layers of features from tiny images. (2009)

  40. Kubat, Miroslav, Matwin, Stan, et al.: Addressing the curse of imbalanced training sets: one-sided selection. In International Conference on Machine Learning, volume 97, pages 179–186. Citeseer, (1997)

  41. Kukar, Matjaz, Kononenko, Igor, et al.: Cost-sensitive learning with neural networks. In ECAI, volume 15, pages 88–94. Citeseer, (1998)

  42. LeCun, Yann, Bottou, Léon., Bengio, Yoshua, Haffner, Patrick: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  43. Lehmann, Erich L.: Fisher, Neyman, and the creation of classical statistics. Springer Science & Business Media, (2011)

  44. Lipton, Zachary, Wang, Yu-Xiang, Smola, Alexander.: Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pages 3122–3130. PMLR, (2018)

  45. Liu, Xu-Ying, Wu, Jianxin, Zhou, Zhi-Hua: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, (2008)

  46. Liu, Yang, An, Aijun, Huang, Xiangji: Boosting prediction accuracy on imbalanced datasets with svm ensembles. In Pacific-Asia conference on knowledge discovery and data mining, pages 107–118. Springer, (2006)

  47. Namee, Brian Mac, Cunningham, Padraig, Byrne, Stephen, Corrigan, Owen I.: The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1): 51–70, (2002)

  48. Mai, Qing, Zou, Hui, Yuan, Ming: A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99(1), 29–42 (2012)

    Article  MathSciNet  Google Scholar 

  49. Maloof, Marcus A.: Learning when data sets are imbalanced and when costs are unequal and unknown. In ICML-2003 workshop on learning from imbalanced data sets II, volume 2, pages 2–1, (2003)

  50. Marcenko, V.A., Pastur, Leonid: Distribution of eigenvalues for some sets of random matrices. Math USSR Sb, 1:457–483, 01 (1967)

  51. Mazurowski, Maciej A., Habas, Piotr A., Zurada, Jacek M., Lo, Joseph Y., Baker, Jay A., Tourassi, Georgia D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 21(2-3):427–436, (2008)

  52. Mei, Song., Montanari, Andrea.: The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, (2019)

  53. Montanari, Andrea, Ruan, Feng, Sohn, Youngtak, Yan, Jun: The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. 11 (2019)

  54. Muthukumar, Vidya, Vodrahalli, Kailas, Subramanian, Vignesh, Sahai, Anant: Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory 1(1), 67–83 (2020)

    Article  Google Scholar 

  55. Nakkiran, Preetum.: More data can hurt for linear regression: Sample-wise double descent. arXiv preprint arXiv:1912.07242, (2019)

  56. Philip, K., Chan, SJS.: Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164–168, (1998)

  57. Pinsky, Mark A.: An introduction to probability theory and its applications, vol. 2 (william feller). SIAM Rev., 14(4):662-663, (October 1972). ISSN 0036-1445. https://doi.org/10.1137/1014119

  58. Quiñonero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton: and N. Characterizing learning transfer, Lawrence. When training and test sets are different (2009)

    Google Scholar 

  59. Radivojac, Predrag, Chawla, Nitesh V., Dunker, A Keith, Obradovic, Zoran: Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics, 37(4): 224–239, (2004)

  60. Raudys, Sarunas, Duin, Robert PW.: Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern recognition letters 19(5–6), 385–392 (1998)

    Article  Google Scholar 

  61. Seiffert, Chris, Khoshgoftaar, Taghi M., Van Hulse, Jason, Napolitano, Amri: Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1):185–197, (2009)

  62. Shao, Jun, Wang, Yazhen, Deng, Xinwei, Wang, Sijian, et al.: Sparse linear discriminant analysis by thresholding for high dimensional data. Annals of Statistics 39(2), 1241–1265 (2011)

    Article  MathSciNet  Google Scholar 

  63. Sifaou, Houssem, Kammoun, Abla, Alouini, Mohamed-Slim: High-dimensional linear discriminant analysis classifier for spiked covariance model. Journal of Machine Learning Research, 21, (2020)

  64. Sima, Chao, Dougherty, Edward R.: The peaking phenomenon in the presence of feature-selection. Pattern Recognition Letters, 29(11): 1667–1674, (2008)

  65. Storkey, Amos J.: When training and test sets are different: characterising learning transfer. pages 3–28, (2009)

  66. Sun, Yanmin, Kamel, Mohamed S, Wong, Andrew KC., Wang, Yang: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, (2007)

  67. Velilla, Santiago, Hernández, Adolfo: On the consistency properties of linear and quadratic discriminant analyses. Journal of multivariate analysis 96(2), 219–236 (2005)

    Article  MathSciNet  Google Scholar 

  68. Wang, Benjamin X., Japkowicz, Nathalie.: Boosting support vector machines for imbalanced data sets. Knowledge and information systems, 25(1): 1–20, (2010)

  69. Wang, Cheng, Jiang, Binyan: On the dimension effect of regularized linear discriminant analysis. Electron. J. Statist. 12(2), 2709–2742 (2018). https://doi.org/10.1214/18-EJS1469

    Article  MathSciNet  Google Scholar 

  70. Wang, Yu-Xiong, Ramanan, Deva, Hebert, Martial.: Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 7032–7042, (2017)

  71. Weiss, Gary M.: Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1): 7–19, (2004)

  72. Xiao, Jianxiong, Hays, James, Ehinger, Krista A., Oliva, Aude, Torralba, Antonio.: Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, (2010)

  73. Xu, Ji, Hsu, Daniel.: On the number of variables to use in principal component regression. arXiv preprint arXiv:1906.01139, (2019)

  74. Zhao, Eric, Liu, Anqi, Anandkumar, Animashree, Yue, Yisong.: Active learning under label shift. (2021). arXiv:2007.08479 [cs.LG]

  75. Zollanvari, Amin, Dougherty, Edward R.: Generalized consistent error estimator of linear discriminant analysis. IEEE transactions on signal processing, 63 (11):2804–2814, (2015)

  76. Zollanvari, Amin, Braga-Neto, Ulisses M., Dougherty, Edward R.: Analytic study of performance of error estimators for linear discriminant analysis. IEEE Transactions on Signal Processing, 59 (9):4238–4255, (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenjing Liao.

Additional information

Communicated by Mark Iwen.

This research is partially supported by NSF DMS 2012652 and NSF CAREER 2145167.

Appendices

A proof of phase transition in Sect. 3.2

Misclassification error as \(n_1/ n_0 \rightarrow \infty \)

In this section, we will give a proof to show the misclassification error tends to 0.5 when \(n_1/n_0\rightarrow \infty \). When the training data set is extremely imbalanced with \(n_1/n_0\rightarrow \infty \), i.e., \(\gamma _0/\gamma _1\rightarrow \infty \), the Bayes classifier tends to classify all data points to class 1. This leads to the following limits,

$$\begin{aligned} \frac{\Delta ^2+\gamma _0-\gamma _1}{\sqrt{\Delta ^2+\gamma _0+\gamma _1}}\rightarrow +\infty ,\qquad \frac{\Delta ^2+\gamma _1-\gamma _0}{\sqrt{\Delta ^2+\gamma _0+\gamma _1}}\rightarrow -\infty . \end{aligned}$$

Then the limit of misclassification error is given by

$$\begin{aligned} \mathcal {R}(f_{\widehat{\alpha }, \widehat{\beta }}^{\widehat{b}})\overset{\mathrm{a.s.}}{\longrightarrow }\frac{1}{2}\Phi (-\infty )+\frac{1}{2}\Phi (+\infty )=0.5. \end{aligned}$$

The proof is complete. Phase transition knots

In this section, we provide theoretical justifications of the phase transition knots given in Sect. 3.2. Denote the asymptotic misclassification error in (6) and (7) as \(\mathcal {R}(\gamma _0,\gamma _1)\). We fix \(\gamma _0\) and let \(\gamma _1\) vary starting from the balanced case with \(\gamma _1=\gamma _0\).

The transition knots are obtained by a local analysis about the instantaneous change of misclassification error as \(n_1/n_0\) slightly increases from 1. We observe that, as \(n_1/n_0\) slightly increases from 1, the misclassification error decreases in Phase I and III, and increases in Phase II. Notice that \(\gamma _1\) slightly decreases from \(\gamma _0\) as \(n_1/n_0\) slightly increases from 1.

The instantaneous change of \(\mathcal {R}(\gamma _0,\gamma _1)\) with respect to \(\gamma _1\) can be characterized by the following partial derivative, \(\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}\):

$$\begin{aligned} \frac{\partial }{\partial \gamma _1}\mathcal {R}\,\vert \,_{\gamma _1=\gamma _0} = \frac{\Delta ^2 \phi \left( \frac{-\Delta ^2(1-\frac{1}{2}\gamma _0)^{\frac{1}{2}}}{2\sqrt{\Delta ^2+2\gamma _0}}\right) }{16(\Delta ^2+2\gamma _0)^{\frac{3}{2}}(1-\frac{1}{2}\gamma _0)^{\frac{1}{2}}} \times [4+\Delta ^2], \end{aligned}$$

for \(\gamma _0<2\), and

$$\begin{aligned} \frac{\partial }{\partial \gamma _1}\mathcal {R}\,\vert \,_{\gamma _1=\gamma _0} = \frac{\Delta ^2\phi \left( \frac{-\Delta ^2(\frac{1}{2}\gamma _0-1)^{\frac{1}{2}}}{\gamma _0\sqrt{\Delta ^2+2\gamma _0}}\right) }{4\gamma _0^2(\Delta ^2+2\gamma _0)^{\frac{3}{2}}(\frac{1}{2}\gamma _0-1)^{\frac{1}{2}}} \times \underbrace{[4\gamma _0^2-(12-\Delta ^2)\gamma _0-4\Delta ^2]}_{Q(\gamma _0,\Delta )}, \end{aligned}$$

for \(\gamma _0>2\), where \(\phi \) is the probability density function of the standard normal distribution. The \(Q(\gamma _0,\Delta )\) term is a quadratic function with two roots of opposite signs. The positive root is \(\gamma _b=\frac{1}{8}(12-\Delta ^2+\sqrt{\Delta ^4+40\Delta ^2+144})\). The sign of the above partial derivative has the following cases:

  • When \(\gamma _0\in (0,2)\), \(\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}\) is always positive. As a result, \(\mathcal {R}(\gamma _0,\gamma _1)\) decreases as \(\gamma _1\) decreases from \(\gamma _0\), which corresponds to Phase I.

  • When \(\gamma _0 \in (2,\gamma _b)\), \(\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}\) is negative. In this case \(\mathcal {R}(\gamma _0,\gamma _1)\) increases as \(\gamma _1\) decreases from \(\gamma _0\), which corresponds to Phase II.

  • When \(\gamma _0 \in (\gamma _b,+\infty )\), \(\frac{\partial }{\partial \gamma _1}\mathcal {R}(\gamma _0,\gamma _1)\,\vert \,_{\gamma _1=\gamma _0}\) is positive. In this case \(\mathcal {R}(\gamma _0,\gamma _1)\) decreases as \(\gamma _1\) decreases from \(\gamma _0\), which includes Phase III.

B Proofs of Lemmas in Sect. 6.1

1.1 B.1 Proofs of Lemmas 4 and 5

Proof of lemma 4

We will prove this result by Basu theorem, i.e., we will show that (\(\widehat{\mu }-\mu \)) is a complete and sufficient statistic and \(\widehat{\Sigma }\) is an auxiliary statistic w.r.t. \(\mu \).

First, we show \(\widehat{\mu }\) is a complete statistic. We need to check that for any \(\mu \) and measurable function g, \(\mathbb {E}[g(\hat{\mu })] = 0\) for any \(\mu \) implies \(\mathbb {P}(g(\hat{\mu }) = 0) = 1\) for any \(\mu \). Indeed, for any measurable function g such that the expectation of \(g(\widehat{\mu })\) over sample space \((x_1,x_2,\dots ,x_n)\) is zero, i.e.,

$$\begin{aligned} \mathbb {E}[g(\widehat{\mu })] = 0\quad \text {for any}~\mu , \end{aligned}$$
(30)

we can derive \(\mathbb {P}(g(\hat{\mu } - \mu ) = 0) = 1\) by taking derivatives of Equation (30) w.r.t. \(\mu \) recursively,

$$\begin{aligned} \mathbb {E}\bigg [ h(\widehat{\mu }) g(\widehat{\mu })\bigg ] = 0, \text {for any polynomial { h}}, \end{aligned}$$

and therefore \(\widehat{\mu }\) is a complete statistic w.r.t. parameter \({\mu }\).

To prove \(\widehat{\mu }\) is also a sufficient statistic for \({\mu }\), we need to show that given the statistic \(\widehat{\mu }\) the conditional distribution of \(x_1,...,x_n\) does not depend on \(\mu \). Note that \(\widehat{\mu }\) has a multivariate normal distribution, i.e., \(\widehat{\mu } \sim \mathcal {N}({\mu }, \frac{1}{n}\Sigma )\), since \(\widehat{\mu } = \frac{1}{n} \sum _{i=1}^n x_i\) is a linear combination of i.i.d. multivariate normal vectors \(x_1,x_2, \dots , x_n\). The pdf of \(\hat{\mu }\) and the joint distribution of \(x_1, x_2, \dots , x_n\) are given by

$$\begin{aligned} f(\widehat{\mu } )&= \frac{1}{(2\pi )^{\frac{p}{2}}\vert \frac{1}{n}\Sigma \vert ^{\frac{1}{2}}} \exp \left( -\frac{n}{2}(\widehat{\mu } - {\mu })^\top \Sigma ^{-1} (\widehat{\mu } - {\mu })\right) ,\nonumber \\ f(x_1,\dots ,x_n)&= \frac{1}{(2\pi )^\frac{np}{2} \vert \Sigma \vert ^\frac{n}{2}} \exp \left( -\sum _{i=1}^n \frac{1}{2}(x_i-{\mu })^\top \Sigma ^{-1} (x_i-{\mu })\right) . \end{aligned}$$
(31)

The joint density function of \(x_1, \dots , x_n\) and \(\hat{\mu }\) is given by

$$\begin{aligned} f(x_1,\dots ,x_n,\widehat{\mu }) =f(x_1,\dots ,x_n) \mathbbm {1}\left( \widehat{\mu }=\frac{1}{n}(x_1+x_2+\dots +x_n)\right) . \end{aligned}$$
(32)

By taking the fraction of (31) and (32), the conditional density of \(x_1,\dots , x_n\) given \(\hat{\mu }\) is

$$\begin{aligned} f\left( x_1,\dots ,x_n\,\vert \,\widehat{\mu }\right) = C \exp \left( -\frac{1}{2}(x-\widehat{\mu })^\top \Sigma ^{-1} (x-\widehat{\mu })\right) , \end{aligned}$$
(33)

where C is a constant. By Fisher–Neyman factorization theorem [43], given the statistic \(\widehat{\mu }\) the conditional distribution of \(x_1,...,x_n\) does not depend on \(\mu \) and therefore \(\hat{\mu }\) is a sufficient statistic for \({\mu }\).

Sample covariance has a distribution which doesn’t depend on the parameter \({\mu }\).

$$\begin{aligned} \widehat{\Sigma }= \sum _{i=1}^n (x_i - \widehat{\mu })^\top (x_i - \widehat{\mu }) = \sum _{i=1}^n z_i^\top z_i, \end{aligned}$$
(34)

and therefore it is a auxiliary statistic.

Combining \(\widehat{\mu }\) being a complete and sufficient statistic and \(\widehat{\Sigma }\) being an auxiliary statistic, we obtain that \(\widehat{\mu }\) and \(\widehat{\Sigma }\) are independent, by Basu Theorem. \(\square \)

Proof of lemma 5

The following isotropic property of Wishart distribution has been given by [69]. For any orthogonal matrix \(U \in \mathbb {R}^{p \times p}\), we have

$$\begin{aligned} U^\top \left( \frac{Z^\top Z}{n-2}\right) U \sim \mathcal {W}(I_p,n-2). \end{aligned}$$
(35)

We next apply this property to the left-hand side of equation (10),

$$\begin{aligned} \begin{aligned} z^\top \left( \frac{Z^\top Z}{n-2}\right) ^{\dagger }z&=z^\top U_i U_i^\top \left( \frac{Z^\top Z}{n-2}\right) ^\dagger U_i U_i^\top z\\&=\Vert {z}\Vert ^2 e_i^\top \left( U_i^\top \left( \frac{Z^\top Z}{n-2}\right) U_i\right) ^{\dagger } e_i\\&\mathop {=}\limits ^d\Vert {z}\Vert ^2 e_i^\top \left( \frac{Z^\top Z}{n-2}\right) ^{\dagger } e_i, \end{aligned} \end{aligned}$$
(36)

where \(U_i\) is a orthogonal matrix that transforms the vector z to canonical basis vector \(e_i\), i.e.,

$$\begin{aligned} U_i^\top z = \Vert {z}\Vert _2 e_i, \quad i=1,2,\dots ,p. \end{aligned}$$
(37)

We can further simplify the product \( z ^\top (\frac{Z^\top Z}{n-2})^{\dagger } z\) by taking an average over the index i,

$$\begin{aligned} \begin{aligned} z^\top \left( \frac{Z^\top Z}{n-2}\right) ^\dagger z&\mathop {=}\limits ^d \frac{1}{p} \Vert z \Vert _2^2 \sum _{i=1,...,p} e_i^\top \left( \frac{Z^\top Z}{n-2}\right) ^{\dagger } e_i\\&= \frac{1}{p}\Vert {z}\Vert ^2_2 \text {tr}\left( \left( \frac{Z^\top Z}{n-2}\right) ^\dagger \right) , \end{aligned} \end{aligned}$$
(38)

where we get the isotropicity of Wishart distribution in Equation (10). \(\square \)

1.2 B.2 Proofs of Lemmas 6, 7 and 8

Proof of Lemma 6

We use the eigenvalue decomposition of \(Z^\top Z=U^\top D U\) to simplify the left-hand side of Eq. (11),

$$\begin{aligned} \begin{aligned} \textrm{tr}[(Z^\top Z)^\dagger ]&=\text {tr}\left( U D^\dagger U^\top \right) \\&=\text {tr}\left( D^\dagger \right) \\&=\sum \limits _{s \in \lambda (Z^\top Z),s\ne 0} \frac{1}{s}. \end{aligned} \end{aligned}$$

The result above implies that the trace of the pseudo-inverse of \(Z^\top Z\) is equal to the sum of the reciprocal of its eigenvalues. By the same arguments on \(ZZ^\top \), we can show that

$$\begin{aligned} \textrm{tr}[(ZZ^\top )^\dagger ] =\sum \limits _{s \in \lambda (ZZ^\top ),s\ne 0} \frac{1}{s}. \end{aligned}$$

Then we deduce the desired result by the fact that the set of non-zero eigenvalues of \(ZZ^\top \) matches that of \(Z^\top Z\). \(\square \)

Proof of Lemma 7

We first compute the limit of \(\frac{1}{\sqrt{p}}\mu _d^\top z\). The linear combination of multivariate normal random vector \(z\sim \mathcal {N}(0,I_p)\) is a normal random variable, namely, \(\frac{1}{\sqrt{p}}\mu _d^\top z \sim \mathcal {N}(0, \frac{1}{p}{\mu }_d^\top \mu _d)\). From the concentration inequality of the normal random variable [57], we have

$$\begin{aligned} \mathbb {P}\left( \left| \frac{1}{\sqrt{p}}\mu _d^\top z\right| \ge \frac{\epsilon }{\sqrt{p}}\Vert \mu _d\Vert _2\right) \le 2 \frac{1}{\sqrt{2\pi }\epsilon } e^{-\epsilon ^2/2} \quad \text {for all }x\ge 0. \end{aligned}$$
(39)

Combining Eq. (39), \(e^{-x}<\frac{1}{x}\) for \(x>0\) and \(\Vert \mu _d\Vert _2 < 2\Delta \) for a sufficiently large p, the sum of the probabilities of \(\frac{1}{\sqrt{p}}\vert \mu _d^\top {z}\vert > \epsilon \) is finite, for any positive \(\epsilon > 0\), i.e.,

$$\begin{aligned} \begin{aligned} \sum _{p=1}^{\infty }\mathbb {P}\left( \frac{1}{\sqrt{p}}\vert \mu _d^\top {z}\vert > \epsilon \right)&\le \sum _{p=1}^{\infty } \frac{\Vert \mu _d\Vert _2}{\sqrt{2\pi p}\epsilon } e^{-(\epsilon ^2p)/(2\Vert \mu _d\Vert _2^2)}\\&<\sum _{p=1}^{\infty }\frac{2\Vert \mu _d\Vert _2^3}{\sqrt{2\pi }\epsilon ^3}~\frac{1}{p^{3/2}}\\&<\infty . \end{aligned} \end{aligned}$$

By the Borel–Cantelli lemma, we have

$$\begin{aligned} \frac{1}{\sqrt{p}}\mu _d^\top {z} \overset{\mathrm{a.s.}}{\longrightarrow }0. \end{aligned}$$

We next consider the limit of \(\frac{1}{p}z_\ell ^\top z_\ell \). Since \(z_\ell \) satisfies the chi-squared distribution independently with expectation \(\mathbb {E}[z_{\ell ,i}^2]=1\) and finite variance \(\mathbb {V}ar (z_{\ell ,i}^2)=2\), we know the average of the squared elements in \(z_\ell \) converges to the expectation almost surely by the strong law of the large numbers, namely,

$$\begin{aligned} \frac{1}{p}z_\ell ^\top z_\ell \overset{\mathrm{a.s.}}{\longrightarrow }1. \end{aligned}$$

By the same arguments given above, \(n_\ell \) satisfies the binomial distribution \(B(n_9+n_1,\pi _\ell )\) which is composed of \(n_0+n_1\) independent Bernoulli distribution with expectation \(\pi _0\) and Variance \(\pi _0\pi _1\). From the strong law of large numbers, we have

$$\begin{aligned} \frac{n_\ell }{n_0+n_1}\overset{\mathrm{a.s.}}{\longrightarrow }\pi _\ell . \end{aligned}$$

\(\square \)

Proof of Lemma 8

We first derive the expression of \(m(\zeta )\). The Marchenko-Pastur law is supported on a compact subset of \(\mathcal {R}^+\), i.e., \(\textrm{supp}(F_{\gamma })\subset [a,b]\) where

$$\begin{aligned} a=(1-\sqrt{\gamma })^2, \quad \text {and} \quad b=(1+\sqrt{\gamma })^2. \end{aligned}$$

Let \(\{z_k\}\) be a sequence of complex numbers such that \(\textrm{Im}(z_k)>0, \textrm{Re}(z_k)=\zeta \) for any k and \(\lim _{k\rightarrow \infty }z_k=\zeta \). Consider the sequence of integral

$$\begin{aligned} \int _{a}^b \frac{1}{s-z_k} dF_\gamma (s). \end{aligned}$$

For any k, \(0<\gamma <1\) and \(s>a\), we have

$$\begin{aligned} \left| \frac{1}{s-z_k}\right| \le \left| \frac{1}{s}\right|<\frac{1}{(1-\sqrt{\gamma })^2}<\infty . \end{aligned}$$

By the dominated convergence theorem, we have

$$\begin{aligned} \int \frac{1}{s-\zeta } dF_\gamma (s)&=\int _{a}^b \lim _{k\rightarrow \infty }\frac{1}{s-z_k} dF_\gamma (s)=\lim _{k\rightarrow \infty } \int _{a}^b \frac{1}{s-z_k} dF_\gamma (s)\nonumber \\&=\lim _{k\rightarrow \infty } \int \frac{1}{s-z_k} dF_\gamma (s). \quad \end{aligned}$$
(40)

To compute \(\int \frac{1}{s-z_k} dF_\gamma (s)\), [2, Lemma 3.11] gives

$$\begin{aligned} \int \frac{1}{s-z_k} dF_\gamma (s)= \frac{1-\gamma -z_k+\sqrt{(z_k-\gamma -1)^2-4\gamma }}{2\gamma z_k}. \end{aligned}$$
(41)

According to the definition of the square root of complex numbers in [2, Eq. (2.3.2)], the real part of \(\sqrt{(z_k-\gamma -1)^2-4\gamma }\) has the same sign as that of \(z_k-\gamma -1\). Since \(\textrm{Re}(z_k)=\zeta \le 0, \gamma >0\), the real part of \(\sqrt{(z_k-\gamma -1)^2-4\gamma }\) is negative and gives

$$\begin{aligned} \lim _{k\rightarrow \infty } \sqrt{(z_k-\gamma -1)^2-4\gamma }= -\sqrt{(\zeta -\gamma -1)^2-4\gamma }. \end{aligned}$$
(42)

Substituting (42) and (41) into (40) gives rise to (12).

We then compute m(0). When substituting \(\zeta =0\) into (12), both the numerator and the denominator are 0. Here we apply L’Hospital’s rule:

$$\begin{aligned} m(0)&=\lim _{\zeta \rightarrow 0}\frac{1-\gamma -\zeta -\sqrt{(\zeta -\gamma -1)^2-4\gamma }}{2\gamma \zeta } \\&=\lim _{\zeta \rightarrow 0} \frac{1}{2\gamma } \left( -1-\frac{\zeta -\gamma -1}{\sqrt{(\zeta -\gamma -1)^2-4\gamma }}\right) \\&=\frac{1}{2\gamma } \left( -1-\frac{-\gamma -1}{\sqrt{(-\gamma -1)^2-4\gamma }}\right) \nonumber \\&=\frac{1}{2\gamma } \left( -1+\frac{1+\gamma }{1-\gamma }\right) \nonumber \\&=\frac{1}{1-\gamma }. \end{aligned}$$

We next derive the expression of \(\frac{d}{d\zeta }m(\zeta )\). To derive the expression, we first show that

$$\begin{aligned} \int \frac{1}{(s-\zeta )^2} dF_\gamma (s)= \lim _{z\rightarrow \zeta }\frac{d }{d z} \int \frac{1}{s-z} dF_\gamma (s) \quad \text {for }z\in \mathbb {C}\text { with }\textrm{Re}(z)=\zeta , \textrm{Im}(z)>0. \end{aligned}$$
(43)

Let \(\{h_k\}\) be a set of complex numbers such that \(\vert \textrm{Re}(h_k)\vert \le \vert \zeta \vert /2\) for any k and \(\lim _{k\rightarrow \infty }h_k=0\). For any k and \(s\ge a\), we have

$$\begin{aligned} \left| \frac{1}{(s-z-h_k)(s-z)}\right| \le \frac{1}{(1-\sqrt{\gamma })^4}<\infty . \end{aligned}$$

By the dominated convergence theorem, we have

$$\begin{aligned} \frac{\partial }{\partial z} \int \frac{1}{s-z} dF_\gamma (s)=&\lim _{k\rightarrow \infty }\frac{1}{h_k}\left[ \int \frac{1}{s-(z+h_k)} dF_\gamma (s)-\int \frac{1}{s-z} dF_\gamma (s)\right] \\ =&\lim _{k\rightarrow \infty }\frac{1}{h_k}\int _{a}^{b} \frac{1}{s-z-h_k}-\frac{1}{s-z} dF_\gamma (s) \\ =&\lim _{k\rightarrow \infty }\int _{a}^{b} \frac{1}{(s-z-h_k)(s-z)} dF_\gamma (s) \\ =&\int _{a}^{b}\frac{1}{(s-z)^2} dF_\gamma (s), \end{aligned}$$

where the second equality holds since \(F_{\gamma }(s)\) is supported on [ab].

Since

$$\begin{aligned} \left| \frac{1}{(s-z)^2}\right| \le \frac{1}{(1-\sqrt{\gamma })^4}<\infty , \end{aligned}$$

for any \(s\ge a\), we have

$$\begin{aligned} \int \frac{1}{(s-\zeta )^2}dF_{\gamma }(s)&= \int _{a}^{b} \lim _{z\rightarrow \zeta }\frac{1}{(s-z)^2}dF_{\gamma }(s) =\lim _{z\rightarrow \zeta } \int _{a}^{b} \frac{1}{(s-z)^2}dF_{\gamma }(s)\\&= \lim _{z\rightarrow \lambda }\frac{\partial }{\partial z} \int \frac{1}{s-z} dF_\gamma (s). \end{aligned}$$

Using (41), we have

$$\begin{aligned}&\frac{\partial }{\partial z} \int \frac{1}{s-z} dF_\gamma (s) \nonumber \\ {}&= \frac{(2\gamma z)\left[ -1+\frac{z-\gamma -1}{\sqrt{(z-\gamma -1)^2-4\gamma }}\right] -(2\gamma )\left( 1-\gamma -z+\sqrt{(z-\gamma -1)^2-4\gamma }\right) }{4\gamma ^2z^2} \nonumber \\&=\frac{(2\gamma )(\gamma -1)+\frac{(2\gamma z)(z-\gamma -1)}{\sqrt{(z-\gamma -1)^2-4\gamma }}-2\gamma \sqrt{(z-\gamma -1)^2-4\gamma }}{4\gamma ^2z^2}. \end{aligned}$$
(44)

Letting \(z\rightarrow \zeta \) in (44) and recall that the real part of \(\sqrt{(z-\gamma -1)^2-4\gamma }\) is negative, one gets (13).

To compute \(\frac{d}{d\zeta }m(0)\), by L’Hopital’s rule, we deduce

$$\begin{aligned} \frac{d}{d\zeta }m(0)= \lim _{\zeta \rightarrow 0} \frac{(2\gamma )(\gamma -1)-\frac{(2\gamma z)(\zeta -\gamma -1)}{\sqrt{(\zeta -\gamma -1)^2-4\gamma }}+2\gamma \sqrt{(\zeta -\gamma -1)^2-4\gamma }}{4\gamma ^2\zeta ^2}=\frac{1}{(1-\gamma )^3} . \end{aligned}$$
(45)

\(\square \)

C Proofs in Sect. 4

1.1 C.1 Proof of Theorem 3

Proof of Theorem 3

The proof uses the same technique as in the Theorem 2, the misclassification error is the same as (14), and we only need to show the limits of \(q_0\) and \(q_1\). By the change of variables formula in (15) and Lemma  8, we deduce

$$\begin{aligned}&\hat{\beta }^\top (\hat{\alpha } - \mu _0) \nonumber \\&= - \frac{1}{2} (\mu _d + \frac{1}{\sqrt{n_0}}z_0 - \frac{1}{\sqrt{n_1}}z_1)^\top \left( \frac{Z^\top Z}{n-2} + \lambda I_p\right) ^\dagger (\mu _d - \frac{1}{\sqrt{n_0}} z_0 - \frac{1}{\sqrt{n_1}} z_1)\nonumber \\&=\frac{1}{2n_0} z_0^\top \left( \frac{Z^\top Z}{n-2}+ \lambda I_p\right) ^\dagger z_0 - \frac{1}{2}(\mu _d - z_1/\sqrt{n_1})^\top \left( \frac{Z^\top Z }{n-2}+ \lambda I_p\right) ^\dagger (\mu _d - z_1/\sqrt{n_1}) \nonumber \\&\overset{d}{=}\ \left[ -\frac{1}{2}\Vert \mu _d -\tfrac{1}{\sqrt{n_1}}z_1\Vert _2^2 + \frac{1}{2n_0} \Vert z_0\Vert _2^2 \right] \times \frac{1}{p} \text {tr}\left( \frac{Z^\top Z}{n-2} + \lambda I_p\right) ^{\dagger } \nonumber \\&\overset{\mathrm{a.s.}}{\longrightarrow }-\frac{1}{2} (\Delta ^2 + \gamma _0 -\gamma _1) m(-\lambda ). \end{aligned}$$
(46)

Similarly we can derive

$$\begin{aligned} \Vert \widehat{\beta }\Vert _\Sigma ^2 =&(\widehat{\mu }_0-\widehat{\mu }_1)^\top (\widehat{\Sigma }+\lambda I_p)^\dagger \Sigma (\widehat{\Sigma }+ \lambda I_p)^\dagger (\widehat{\mu }_0-\widehat{\mu }_1)\nonumber \\ =&({\mu }_d + \frac{1}{\sqrt{n_0}}{z}_0 - \frac{1}{\sqrt{n_1}}{z}_1)^\top \left( \left( \frac{Z^\top Z}{n-2} + \lambda I_p\right) ^{\dagger }\right) ^2 ({\mu }_d + \frac{1}{\sqrt{n_0}}{z}_0 - \frac{1}{\sqrt{n_1}}{z}_1)\nonumber \\ \overset{d}{=}\ {}&\Vert {\mu _d + \frac{1}{\sqrt{n_0}}z_0 - \frac{1}{\sqrt{n_1}}z_1}\Vert _2^2 \times \frac{1}{p} \text {tr}\left( \left( \frac{Z^\top Z}{n-2}+\lambda I_p\right) ^\dagger \right) ^2\nonumber \\ \overset{\mathrm{a.s.}}{\longrightarrow }&(\Delta ^2 + \gamma _0 + \gamma _1)m'(-\lambda ). \end{aligned}$$
(47)

Combining (46) and (47), as well as putting the threshold term \(\ln (n_1/n_0)\) back, we obtain

$$\begin{aligned} q_0 \overset{\mathrm{a.s.}}{\longrightarrow }\frac{-\frac{1}{2}(\Delta ^2 -\gamma _0+\gamma _1)m(-\lambda )+\ln {\frac{\gamma _0}{\gamma _1}}}{\sqrt{(\Delta ^2 + \gamma _1 + \gamma _0)m'(-\lambda )}}. \end{aligned}$$

The same argument of analyzing \(q_0\) applies to \(q_1\) and therefore, we have

$$\begin{aligned} q_1 \overset{\mathrm{a.s.}}{\longrightarrow }\frac{-\frac{1}{2}(\Delta ^2 + \gamma _0 - \gamma _1)m(-\lambda )+\ln {\frac{\gamma _1}{\gamma _0}}}{\sqrt{(\Delta ^2 + \gamma _1 + \gamma _0)m'(-\lambda )}}. \end{aligned}$$

We complete the proof by substituting \(q_0,q_1\) above into (14). \(\square \)

1.2 C.2 Proof of Regularized Phase Transition in Sect. 4.2

In this section, we show with a strong regularization, the phase transition phenomenon will vanish. Denote the asymptotic misclassification error in Theorem 3 as

$$\begin{aligned} \mathcal {R}_\lambda (\gamma _0, \gamma _1) = \sum _{\ell =0,1} \Phi \left( \frac{g(\gamma _0, \gamma _1, \ell )m(-\lambda )+(-1)^\ell \ln {\frac{\gamma _0}{\gamma _1}}}{k(\gamma _0, \gamma _1)\sqrt{m'(-\lambda )}}\right) , \end{aligned}$$

and we use the shorthand \(\mathcal {R}_\lambda (\gamma )\) to denote \(\mathcal {R}_\lambda (\gamma _0,\gamma _1)\) with the balanced data, i.e., \(\gamma _0=\gamma _1 = 2\gamma \),

$$\begin{aligned} \mathcal {R}_\lambda (\gamma ) := \mathcal {R}_\lambda (2\gamma ,2\gamma ) = \Phi \left( \frac{-\Delta ^2 m(-\lambda )}{2\sqrt{(\Delta ^2+4\gamma )m'(-\lambda )})}\right) . \end{aligned}$$

We show the phase transition phenomenon vanishes with a strong regularization, namely,

$$\begin{aligned} \frac{\partial }{\partial \gamma _1} \mathcal {R}_\lambda (\gamma _0,\gamma _1)\,\vert \,_{\gamma _0=\gamma _1=2\gamma }> 0 \quad \hbox { for a strong}\ \lambda >0. \end{aligned}$$

To see the result above, we need to show that \(\frac{\partial \mathcal {R}_\lambda (\gamma )}{\partial \gamma }>0\) with a strong regularization. Specifically, invoking Chain rule and by some manipulation, we have

$$\begin{aligned} \frac{\partial }{\partial \gamma _1}\mathcal {R}_\lambda (\gamma _0, \gamma _1)\,\vert \,_{\gamma _0=\gamma _1=2\gamma } = \frac{\partial \mathcal {R}_\lambda (\gamma ) }{\partial \gamma }\frac{\partial \gamma }{\partial \gamma _1}\,\vert \,_{\gamma _0=\gamma _1=2\gamma }. \end{aligned}$$

By Mathematica Software [35], we check

$$\begin{aligned} \frac{\partial \mathcal {R}_\lambda (\gamma )}{\partial \gamma } = \phi \left( \frac{-\Delta ^2m(-\lambda )}{2\sqrt{(\Delta ^2+4\gamma )m'(-\lambda )}}\right) \underbrace{\frac{\partial }{\partial \gamma } \left( \frac{-\Delta ^2m(-\lambda )}{2\sqrt{(\Delta ^2+4\gamma )m'(-\lambda )}}\right) }_{(\star )}, \end{aligned}$$

where \(\phi \) is the pdf of the standard normal distribution. By Mathematica Software [35], the denominator of \((\star )\) is also always positive, and is given by

$$\begin{aligned}&2 \sqrt{2} \gamma ^3 \lambda ^3 \frac{\left[ (\gamma +\lambda +1)^2-4 \gamma \right] ^{3/2}\left( 4 \gamma +\Delta ^2\right) ^{3/2}}{\left[ \gamma \lambda ^2 \left( (\gamma +\lambda +1)^2-4 \gamma \right) \right] ^{3/2}} \\ {}&\times \Big [\gamma ^3 + \gamma ^2 (\sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+2 \lambda -3)\\&\quad +\gamma \Big (-2 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}+\lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+\lambda ^2+3\Big )\\&\quad +(\lambda +1) (\sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-\lambda -1)\Big ]^{3/2}. \end{aligned}$$

As a result, the sign of \(\frac{\partial \mathcal {R}_\lambda (\gamma )}{\partial \gamma }\) is determined by the numerator of \((\star )\) given as

$$\begin{aligned}&\Delta ^2 \Big \{\gamma (\lambda +1) \Big [\Delta ^2 \Big (-3 \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2} -2 \lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2 +\lambda ^2}\\&+\lambda ^2+5 \lambda +4\Big )+8 (\lambda +1)^2 \left( \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-\lambda -1\right) \Big ]\\&+\gamma ^2 \Big [\Delta ^2 \Big (2 \lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+3 \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}\\ {}&-6 \lambda -6\Big ) -4 (\lambda +1) \Big (2 \lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-9 \lambda -9\\ {}&+7 \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}\Big )\Big ] +\gamma ^3 \Big [4 \lambda \Big (\sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}\\ {}&+\lambda \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+\lambda ^2-9\Big )+36 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2} \\ {}&+\Delta ^2 \left( -\sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+\lambda +4\right) -64\Big ] +\gamma ^4 \Big [-\Delta ^2+56 \\ {}&{} -20 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}+4 \lambda \left( 2 \sqrt{\gamma ^2+2 \gamma (\lambda -1)+(\lambda +1)^2}+3 \lambda -4\right) \Big ] \\&+4 \gamma ^5 \left[ \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}+3 \lambda -6\right] +4 \gamma ^6 \\&+\Delta ^2 (\lambda +1)^3 \left( \sqrt{2 (\gamma +1) \lambda +(\gamma -1)^2+\lambda ^2}-\lambda -1\right) \Big \}. \end{aligned}$$

Combining the denominator and numerator, \(\frac{\partial }{\partial \gamma } \mathcal {R}_\lambda (\gamma )\) is positive only when one of the following case happens,

  1. 1.

    \(\Delta>0 ~\textrm{and}~ 0<\gamma \le 1~\textrm{and}~\lambda >0\).

  2. 2.

    \(\Delta>0 ~\textrm{and}~ \gamma \ge \frac{\sqrt{\Delta ^2+4}}{2}+1~\textrm{and}~ \lambda >0\).

  3. 3.

    \(\Delta >0 ~\textrm{and}~ 1<\gamma <\frac{\sqrt{\Delta ^2+4}}{2}+1 ~\textrm{and}~ \lambda \ge \text {the smallest real root of} \big [{\#1}^4 (32 \gamma +4 \Delta ^2)+{\#1}^3 (96 \gamma ^2+8 \gamma \Delta ^2+128 \gamma +16 \Delta ^2) +{\#1}^2 (96 \gamma ^3+112 \gamma ^2+16 \gamma \Delta ^2+192 \gamma +\Delta ^4+24 \Delta ^2) +{\#1} (-8 \gamma ^3 \Delta ^2-16 \gamma ^2 \Delta ^2+32 \gamma ^4-96 \gamma ^3-64 \gamma ^2-2 \gamma \Delta ^4+8 \gamma \Delta ^2+128 \gamma +2 \Delta ^4+16 \Delta ^2) -4 \gamma ^4 \Delta ^2 +16 \gamma ^3 \Delta ^2+\gamma ^2 \Delta ^4-16 \gamma ^2 \Delta ^2-16 \gamma ^4+64 \gamma ^3- 80 \gamma ^2-2 \gamma \Delta ^4+32 \gamma +\Delta ^4+4 \Delta ^2\big ]\).

Consequently, we deduce that the misclassification error increases when \(\gamma \) grows in the interval \(\left( 0, 1\right) \) or \(\left( \frac{\sqrt{\Delta ^2+4}}{2}, \infty \right) \); when \(\gamma \) grows in \(\left( 1,\frac{\sqrt{\Delta ^2+4}}{2}\right) \), the misclassification error decreases when \(\lambda \) is small, yet increases when \(\lambda \) is large. For example, when \(\Delta ^2=9\) and \(\lambda =1\), \(\mathcal {R}_\lambda (\gamma )\) increases monotonically with respect to \(\gamma \), and the peaking phenomenon disappears. Meanwhile we have the instantaneous derivative \(\frac{\partial \mathcal {R}_\lambda (\gamma _0,\gamma _1)}{\partial \gamma _1}\,\vert \,_{\gamma _0=\gamma _1=2\gamma } >0\) for any \(\gamma _0\), which implies that the phase transition phenomenon vanishes.

The commands of Mathematica are provided as follows.

In[1]: \(m(\gamma \_,\lambda \_)\text {:=}\frac{\sqrt{(\gamma +\lambda +1)^2-4 \gamma }+\gamma -\lambda -1}{2 \gamma \lambda }\)

In[2]: \(R(\gamma \_,\lambda \_,\Delta \_)\text {:=}-\frac{\Delta ^2 m(\gamma ,\lambda )}{2 \sqrt{-\left( 4 \gamma +\Delta ^2\right) \frac{\partial m(\gamma ,\lambda )}{\partial \lambda }}}\)

In[3]: \(\text {de}(\gamma \_,\lambda \_,\Delta \_)\text {:=}\)

\(\text {Evaluate}\left[ \text {Denominator}\left[ \text {FullSimplify}\left[ \text {Together}\left[ \frac{\partial R(\gamma ,\lambda ,\Delta )}{\partial \gamma }\right] \right] \right] \right] \)

In[4]: \(\text {nu}(\gamma \_ ,\lambda \_ ,\Delta \_ )\text {:=}\)

           \(\text {Evaluate}\left[ \text {Numerator}\left[ \text {FullSimplify}\left[ \text {Together}\left[ \frac{\partial R(\gamma ,\lambda ,\Delta )}{\partial \gamma }\right] \right] \right] \right] \)

In[5]: \(\text {Reduce}[\text {de}(\gamma ,\lambda ,\Delta )\ge 0\wedge \lambda>0\wedge \gamma>0\wedge \Delta >0,\{\gamma ,\lambda \}]\)

In[6]: \(\text {Reduce}[\text {nu}(\gamma ,\lambda ,\Delta )\ge 0\wedge \lambda>0\wedge \gamma>0\wedge \Delta >0,\{\gamma ,\lambda \}]\)

In[7]: \(\text {Reduce}[\text {nu}(\gamma ,1,3)\ge 0\wedge \gamma >0,\{\gamma \}]\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, J., Chen, M., Liu, H. et al. High dimensional binary classification under label shift: phase transition and regularization. Sampl. Theory Signal Process. Data Anal. 21, 32 (2023). https://doi.org/10.1007/s43670-023-00071-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s43670-023-00071-9

Keywords

Mathematics Subject Classification

Navigation