Abstract
It has been reported that using unlabeled data together with labeled data to construct a discriminant function works successfully in practice. However, theoretical studies have implied that unlabeled data can sometimes adversely affect the performance of discriminant functions. Therefore, it is important to know what situations call for the use of unlabeled data. In this paper, asymptotic relative efficiency is presented as the measure for comparing analyses with and without unlabeled data under the heteroscedastic normality assumption. The linear discriminant function maximizing the area under the receiver operating characteristic curve is considered. Asymptotic relative efficiency is evaluated to investigate when and how unlabeled data contribute to improving discriminant performance under several conditions. The results show that asymptotic relative efficiency depends mainly on the heteroscedasticity of the covariance matrices and the stochastic structure of observing the labels of the cases.
Similar content being viewed by others
References
Airoldi J-P, Flury BD, Salvioni M (1995) Discrimination between two species of Microtus using both classified and unclassified observations. J Theor Biol 177:247–262
Anderson TW, Bahadur RR (1962) Classification into two multivariate normal distributions with different covariance matrices. Ann Math Stat 33:420–431
Boldea O, Magnus JR (2009) Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc 104:1539–1549
Brefeld U, Scheffer T (2005) AUC maximizing support vector learning. In: Proceedings of ICML workshop on ROC Analysis in Machine Learning
Castelli V, Cover TM (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Tran Inform Theory 42:2102–2117
Chang YCI (2013) Maximizing and ROC-type measure via linear combination of markers when the gold reference is continuous. Stat Med 136:1893–1903
Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning. MIT Press, Cambridge
Cozman FG, Cohen I (2002) Unlabeled data can degrade classification performance of generative classifiers. In: Fifteenth International Frolida Artificial Intelligence Society Conference, pp 327–331
Cozman FG, Cohen I, Cirelo MC (2003) Semi-supervised learning of mixture models. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), pp 99–106
Efron B (1975) The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc 70:892–898
Eguchi S, Copas J (2002) A class of logistic-type discriminant functions. Biometrika 1:1–22
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188
Fujisawa H (2006) Robust estimation in the normal mixture model. J Stat Plann Inference 136:3989–4011
Hanley JA, McNeil B (1982) The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology 143:29–36
Hayashi K, Takai K (2015) Finite-sample analysis of impacts of unlabelled data and their labelling mechanisms in linear discriminant analysis. Communications in Statistics—Simulation and Computation (in press). doi:10.1080/03610918.2014.957847
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, Upper Saddle River
Kawakita M, Kanamori T (2013) Semi-supervised learning with density-ratio estimation. Mach Learn 91:189–209
Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979
Lehmann EL (1999) Elements of large sample theory. Springer, New York
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 73:821–826
Magnus JR, Neudecker H (1999) Matrix differential calculus with applications in statistics and econometrics. Wiley, New York
McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition, 2nd edn. Wiley, New York
McLachlan GJ, Scot D (1995) Asymptotic relative efficiency of the linear discriminant function under partial nonrandom classification of the training data. J Stat Comp Simul 52:415–426
Oba S, Ishii S (2006) Semi-supervised discovery of differential genes. BMC Bioinform 7:1–13
O’Neill TJ (1978) Normal discrimination with unclassified observations. J Am Stat Assoc 73:821–826
Pepe MS, Thompson ML (2000) Combining diagnostic test results to increasing accuracy. Biostatistics 1:123–140
Rosset S, Zhu J, Zou H, Hastie T (2005) A method for inferring label sampling mechanisms in semi-supervised learning. In: Advances in Neural Information Processing Systems, 17, MIT Press Cambridge, MA
Sokolovska N, Cappé O, Yvon F (2008) The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), pp 984–991
Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355
Takai K, Hayashi K (2014) Effects of unlabeled data on classification error in normal discriminant analysis. J Stat Plann Inference 147:66–83
Takai K, Kano Y (2013) Asymptotic inference with incomplete data. Commun Stat Theor Methods 42:2474–2490
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML), pp 912–919
Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Press, San Rafael
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{C,Ind}\)
In the appendices, we give the details of the asymptotic covariance matrices of estimators of \({\varvec{\vartheta }}\) based on four score functions. First, we derive the asymptotic covariance matrix of the estimator of \({\varvec{\theta }}\) based on \(S_\mathrm{C,Ind}({\varvec{\theta }})\). This is obtained by calculating \(\mathrm{E}\left[ \frac{\partial ^2}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}'}S_\mathrm{C,Ind}({\varvec{\theta }})\right] \) and taking its inverse. Denote this matrix by \(\bar{{\varvec{\Lambda }}}_\mathrm{C,Ind}\). Then, \({\varvec{\Lambda }}_\mathrm{C,Ind}\), appearing in the equation for ARE\(_\mathrm{Ind}\), is obtained by eliminating the first column and row of \(\bar{{\varvec{\Lambda }}}_\mathrm{C,Ind}\). The other asymptotic covariance matrices \({\varvec{\Lambda }}_\mathrm{P,Ind}\), \({\varvec{\Lambda }}_\mathrm{C,Dep}\) and \({\varvec{\Lambda }}_\mathrm{P,Dep}\) are obtained in the same manner.
Because the feature-independent labeling mechanism indicates that the labeling probability \(\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]=\mathrm{P}[R=1;{\varvec{\phi }}]=\gamma \) is the same for all \({\varvec{x}}\), the expectation above is equivalent to the standard argument of the Fisher information up to the factor \(\gamma \) (e.g., see Magnus and Neudecker 1999). Therefore, the asymptotic variance is analytically represented as
where \(\mathbf{L}_\mathrm{Ind}\) is the matrix defined as
Clearly \(\bar{{\varvec{\Lambda }}}_\mathrm{C,Ind}\) is non-singular unless either \({\varvec{\Sigma }}_1\) or \({\varvec{\Sigma }}_0\) is singular.
Appendix B: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Ind}\)
We derive the asymptotic covariance matrix of estimator \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Ind}\) under the feature-independent labeling mechanism. Together with the result in Appendix A, the asymptotic covariance matrix of the above estimator, denoted by \(\bar{{\varvec{\Lambda }}}_\mathrm{P,Ind}\), is obtained by taking the expectation of the second derivative of the score function \(S_\mathrm{P,Ind}({\varvec{\theta }})\) and taking its inverse. Because the information corresponding to the first term of \(S_\mathrm{P,Ind}({\varvec{\theta }})\) has already been given by \(\mathbf{L}_\mathrm{Ind}\), it suffices to calculate the information of the unlabeled data. That is,
with \(\mathbf{U}_\mathrm{Ind} = (1-\gamma )\int _{\mathbf {R}^d} {{\varvec{a}}}({\varvec{x}}){{\varvec{a}}}({\varvec{x}})'f({\varvec{x}};{\varvec{\theta }})d{\varvec{x}}\), where
The form of \(\mathbf{U}_\mathrm{Ind}\) is equivalent to a result given in Boldea and Magnus (2009) up to constant \((1-\gamma )\). Because integration in \(\mathbf{U}_\mathrm{Ind}\) cannot be expressed in an analytic form, the Monte Carlo approximation is needed for its calculation.
Appendix C: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{C,Dep}\)
As well as the feature-independent labeling mechanism, we derive the asymptotic covariance matrix of the estimator of \({\varvec{\theta }}\) based on \(S_\mathrm{C,Dep}({\varvec{\theta }})\) under the feature-dependent labeling mechanism. This situation differs from that for the feature-independent labeling mechanism in that term \(1/\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]\) is included in the score functions. As the labeling probability \(\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]\) is not constant for the feature-dependent labeling mechanisms, the information matrices do not have an explicit representation, unlike \(\mathbf{L}_\mathrm{Ind}\). Therefore, a computation method such as the Monte Carlo integral is needed to calculate these information matrices.
The asymptotic variance of estimator \(\hat{{\varvec{\vartheta }}}_\mathrm{C,Dep}\) is obtained as follows:
where \(\mathbf{L}_\mathrm{Dep}= \int _{\mathbf {R}^d}\left\{ \pi _1f_1({\varvec{x}};{\varvec{\theta }})\mathbf{B}_{1}({\varvec{x}})+\pi _0f_0({\varvec{x}};{\varvec{\theta }})\mathbf{B}_{0}({\varvec{x}}) \right\} d{\varvec{x}}\),
\(\mathbf{B}_{\ell }({\varvec{x}}) = \displaystyle \frac{{{\varvec{b}}_\ell ({\varvec{x}}){\varvec{b}}_\ell ({\varvec{x}})'}}{\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]}\), \(\ell =0,1\), \({\varvec{b}}_{1}({\varvec{x}}) = \begin{pmatrix} \pi _1^{-1}\\ {\varvec{d}}_1({\varvec{x}}) \\ -\frac{1}{2}{\varvec{v}}_1({\varvec{x}})'{} \mathbf{D} \\ {\varvec{0}}_{d} \\ {\varvec{0}}_{d(d+1)/2} \end{pmatrix}\), \({\varvec{b}}_{0}({\varvec{x}}) = \begin{pmatrix} -\pi _0^{-1}\\ {\varvec{0}}_{d} \\ {\varvec{0}}_{d(d+1)/2} \\ {\varvec{d}}_0({\varvec{x}}) \\ -\frac{1}{2}{\varvec{v}}_0({\varvec{x}})'{} \mathbf{D} \end{pmatrix}\) and \({\varvec{0}}_d\) is the d-dimensional zero vector.
Appendix D: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Dep}\)
Besides the direct calculation of the asymptotic covariance matrix of estimator \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Dep}\), it suffices to calculate the expectation of the second derivative of the second term on the right-hand side of \(S_\mathrm{P,Dep}({\varvec{\theta }})\). Then, the asymptotic covariance matrix, denoted by \(\bar{{\varvec{\Lambda }}}_\mathrm{P,Dep}\), is obtained from
where \(\mathbf{U}_\mathrm{Dep} = \int _{\mathbf {R}^d} \frac{{{\varvec{a}}}({\varvec{x}}){{\varvec{a}}}({\varvec{x}})'}{\mathrm{P}[R=0|{\varvec{x}};{\varvec{\phi }}]}f({\varvec{x}};{\varvec{\theta }})d{\varvec{x}}. \)
Rights and permissions
About this article
Cite this article
Hayashi, K. Asymptotic comparison of semi-supervised and supervised linear discriminant functions for heteroscedastic normal populations. Adv Data Anal Classif 12, 315–339 (2018). https://doi.org/10.1007/s11634-016-0266-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0266-6
Keywords
- Area under the ROC curve
- Labeling mechanism
- Linear discriminant function
- Missing data
- Receiver operating characteristic curve
- Semi-supervised learning