Skip to main content
Log in

Asymptotic comparison of semi-supervised and supervised linear discriminant functions for heteroscedastic normal populations

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

It has been reported that using unlabeled data together with labeled data to construct a discriminant function works successfully in practice. However, theoretical studies have implied that unlabeled data can sometimes adversely affect the performance of discriminant functions. Therefore, it is important to know what situations call for the use of unlabeled data. In this paper, asymptotic relative efficiency is presented as the measure for comparing analyses with and without unlabeled data under the heteroscedastic normality assumption. The linear discriminant function maximizing the area under the receiver operating characteristic curve is considered. Asymptotic relative efficiency is evaluated to investigate when and how unlabeled data contribute to improving discriminant performance under several conditions. The results show that asymptotic relative efficiency depends mainly on the heteroscedasticity of the covariance matrices and the stochastic structure of observing the labels of the cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Airoldi J-P, Flury BD, Salvioni M (1995) Discrimination between two species of Microtus using both classified and unclassified observations. J Theor Biol 177:247–262

    Article  Google Scholar 

  • Anderson TW, Bahadur RR (1962) Classification into two multivariate normal distributions with different covariance matrices. Ann Math Stat 33:420–431

    Article  MathSciNet  MATH  Google Scholar 

  • Boldea O, Magnus JR (2009) Maximum likelihood estimation of the multivariate normal mixture model. J Am Stat Assoc 104:1539–1549

    Article  MathSciNet  MATH  Google Scholar 

  • Brefeld U, Scheffer T (2005) AUC maximizing support vector learning. In: Proceedings of ICML workshop on ROC Analysis in Machine Learning

  • Castelli V, Cover TM (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Tran Inform Theory 42:2102–2117

    Article  MathSciNet  MATH  Google Scholar 

  • Chang YCI (2013) Maximizing and ROC-type measure via linear combination of markers when the gold reference is continuous. Stat Med 136:1893–1903

    Article  MathSciNet  Google Scholar 

  • Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning. MIT Press, Cambridge

    Book  Google Scholar 

  • Cozman FG, Cohen I (2002) Unlabeled data can degrade classification performance of generative classifiers. In: Fifteenth International Frolida Artificial Intelligence Society Conference, pp 327–331

  • Cozman FG, Cohen I, Cirelo MC (2003) Semi-supervised learning of mixture models. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), pp 99–106

  • Efron B (1975) The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc 70:892–898

    Article  MathSciNet  MATH  Google Scholar 

  • Eguchi S, Copas J (2002) A class of logistic-type discriminant functions. Biometrika 1:1–22

    Article  MathSciNet  MATH  Google Scholar 

  • Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188

    Article  Google Scholar 

  • Fujisawa H (2006) Robust estimation in the normal mixture model. J Stat Plann Inference 136:3989–4011

    Article  MathSciNet  MATH  Google Scholar 

  • Hanley JA, McNeil B (1982) The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology 143:29–36

    Article  Google Scholar 

  • Hayashi K, Takai K (2015) Finite-sample analysis of impacts of unlabelled data and their labelling mechanisms in linear discriminant analysis. Communications in Statistics—Simulation and Computation (in press). doi:10.1080/03610918.2014.957847

  • Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  • Kawakita M, Kanamori T (2013) Semi-supervised learning with density-ratio estimation. Mach Learn 91:189–209

    Article  MathSciNet  MATH  Google Scholar 

  • Komori O (2011) A boosting method for maximization of the area under the ROC curve. Ann Inst Stat Math 63:961–979

    Article  MathSciNet  MATH  Google Scholar 

  • Lehmann EL (1999) Elements of large sample theory. Springer, New York

    Book  MATH  Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Ma S, Huang J (2005) Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 73:821–826

    Google Scholar 

  • Magnus JR, Neudecker H (1999) Matrix differential calculus with applications in statistics and econometrics. Wiley, New York

    MATH  Google Scholar 

  • McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • McLachlan GJ, Scot D (1995) Asymptotic relative efficiency of the linear discriminant function under partial nonrandom classification of the training data. J Stat Comp Simul 52:415–426

    Article  MathSciNet  MATH  Google Scholar 

  • Oba S, Ishii S (2006) Semi-supervised discovery of differential genes. BMC Bioinform 7:1–13

    Article  Google Scholar 

  • O’Neill TJ (1978) Normal discrimination with unclassified observations. J Am Stat Assoc 73:821–826

    Article  MathSciNet  MATH  Google Scholar 

  • Pepe MS, Thompson ML (2000) Combining diagnostic test results to increasing accuracy. Biostatistics 1:123–140

    Article  MATH  Google Scholar 

  • Rosset S, Zhu J, Zou H, Hastie T (2005) A method for inferring label sampling mechanisms in semi-supervised learning. In: Advances in Neural Information Processing Systems, 17, MIT Press Cambridge, MA

  • Sokolovska N, Cappé O, Yvon F (2008) The asymptotics of semi-supervised learning in discriminative probabilistic models. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), pp 984–991

  • Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355

    Article  MathSciNet  MATH  Google Scholar 

  • Takai K, Hayashi K (2014) Effects of unlabeled data on classification error in normal discriminant analysis. J Stat Plann Inference 147:66–83

    Article  MathSciNet  MATH  Google Scholar 

  • Takai K, Kano Y (2013) Asymptotic inference with incomplete data. Commun Stat Theor Methods 42:2474–2490

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML), pp 912–919

  • Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Press, San Rafael

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenichi Hayashi.

Appendices

Appendix A: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{C,Ind}\)

In the appendices, we give the details of the asymptotic covariance matrices of estimators of \({\varvec{\vartheta }}\) based on four score functions. First, we derive the asymptotic covariance matrix of the estimator of \({\varvec{\theta }}\) based on \(S_\mathrm{C,Ind}({\varvec{\theta }})\). This is obtained by calculating \(\mathrm{E}\left[ \frac{\partial ^2}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}'}S_\mathrm{C,Ind}({\varvec{\theta }})\right] \) and taking its inverse. Denote this matrix by \(\bar{{\varvec{\Lambda }}}_\mathrm{C,Ind}\). Then, \({\varvec{\Lambda }}_\mathrm{C,Ind}\), appearing in the equation for ARE\(_\mathrm{Ind}\), is obtained by eliminating the first column and row of \(\bar{{\varvec{\Lambda }}}_\mathrm{C,Ind}\). The other asymptotic covariance matrices \({\varvec{\Lambda }}_\mathrm{P,Ind}\), \({\varvec{\Lambda }}_\mathrm{C,Dep}\) and \({\varvec{\Lambda }}_\mathrm{P,Dep}\) are obtained in the same manner.

Because the feature-independent labeling mechanism indicates that the labeling probability \(\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]=\mathrm{P}[R=1;{\varvec{\phi }}]=\gamma \) is the same for all \({\varvec{x}}\), the expectation above is equivalent to the standard argument of the Fisher information up to the factor \(\gamma \) (e.g., see Magnus and Neudecker 1999). Therefore, the asymptotic variance is analytically represented as

$$\begin{aligned} \bar{{\varvec{\Lambda }}}_\mathrm{C,Ind} = \mathbf{L}_\mathrm{Ind}^{-1} , \end{aligned}$$

where \(\mathbf{L}_\mathrm{Ind}\) is the matrix defined as

$$\begin{aligned} \frac{1}{\gamma } \begin{pmatrix} (\pi _1\pi _0)^{-1} &{} \mathbf {0}^T &{} \mathbf {0}^T &{} \mathbf {0}^T &{} \mathbf {0}^T\\ \mathbf {0} &{} \pi _1{\varvec{\Sigma }}_1^{-1} &{} O &{} O &{} O\\ \mathbf {0} &{} O &{} \displaystyle \frac{\pi _1}{2}{} \mathbf{D}'({\varvec{\Sigma }}_1\otimes {\varvec{\Sigma }}_1)^{-1}{} \mathbf{D} &{} O &{} O\\ \mathbf {0} &{} O &{} O &{} \pi _0{\varvec{\Sigma }}_0^{-1} &{} O \\ \mathbf {0} &{} O &{} O &{} O &{} \displaystyle \frac{\pi _0}{2}{} \mathbf{D}'({\varvec{\Sigma }}_0\otimes {\varvec{\Sigma }}_0)^{-1}{} \mathbf{D} \end{pmatrix} . \end{aligned}$$

Clearly \(\bar{{\varvec{\Lambda }}}_\mathrm{C,Ind}\) is non-singular unless either \({\varvec{\Sigma }}_1\) or \({\varvec{\Sigma }}_0\) is singular.

Appendix B: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Ind}\)

We derive the asymptotic covariance matrix of estimator \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Ind}\) under the feature-independent labeling mechanism. Together with the result in Appendix A, the asymptotic covariance matrix of the above estimator, denoted by \(\bar{{\varvec{\Lambda }}}_\mathrm{P,Ind}\), is obtained by taking the expectation of the second derivative of the score function \(S_\mathrm{P,Ind}({\varvec{\theta }})\) and taking its inverse. Because the information corresponding to the first term of \(S_\mathrm{P,Ind}({\varvec{\theta }})\) has already been given by \(\mathbf{L}_\mathrm{Ind}\), it suffices to calculate the information of the unlabeled data. That is,

$$\begin{aligned} \bar{{\varvec{\Lambda }}}_\mathrm{P,Ind} = (\mathbf{L}_\mathrm{Ind}+\mathbf{U}_\mathrm{Ind})^{-1} \end{aligned}$$

with \(\mathbf{U}_\mathrm{Ind} = (1-\gamma )\int _{\mathbf {R}^d} {{\varvec{a}}}({\varvec{x}}){{\varvec{a}}}({\varvec{x}})'f({\varvec{x}};{\varvec{\theta }})d{\varvec{x}}\), where

$$\begin{aligned}&{{\varvec{a}}}({\varvec{x}}) = \begin{pmatrix} a_{\bullet }({\varvec{x}}) \\ a_1({\varvec{x}}){\varvec{c}}_1({\varvec{x}}) \\ a_0({\varvec{x}}){\varvec{c}}_0({\varvec{x}}) \end{pmatrix},\\&a_\ell ({\varvec{x}})=\frac{\pi _\ell f_\ell ({\varvec{x}};{\varvec{\theta }})}{\pi _1f_1({\varvec{x}};{\varvec{\theta }})+\pi _0f_0({\varvec{x}};{\varvec{\theta }})},\ \ {\varvec{c}}_\ell ({\varvec{x}}) = \begin{pmatrix} {\varvec{d}}_\ell ({\varvec{x}}) \\ -\frac{1}{2}D'{\varvec{v}}_\ell ({\varvec{x}}) \end{pmatrix} ,\\&{\varvec{d}}_\ell ({\varvec{x}})={\varvec{\Sigma }}_{\ell }^{-1}({\varvec{x}}-{\varvec{\mu }}_\ell ),\ \ {\varvec{v}}_\ell ({\varvec{x}})=\mathrm{vec}({\varvec{\Sigma }}_{\ell }^{-1}-{\varvec{d}}_\ell ({\varvec{x}}){\varvec{d}}_\ell ({\varvec{x}})'),\ \ \ell =1,0\\&\mathrm{and}\ \ a_{\bullet }({\varvec{x}}) = \frac{a_1({\varvec{x}})}{\pi _1}-\frac{a_0({\varvec{x}})}{\pi _0}. \end{aligned}$$

The form of \(\mathbf{U}_\mathrm{Ind}\) is equivalent to a result given in Boldea and Magnus (2009) up to constant \((1-\gamma )\). Because integration in \(\mathbf{U}_\mathrm{Ind}\) cannot be expressed in an analytic form, the Monte Carlo approximation is needed for its calculation.

Appendix C: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{C,Dep}\)

As well as the feature-independent labeling mechanism, we derive the asymptotic covariance matrix of the estimator of \({\varvec{\theta }}\) based on \(S_\mathrm{C,Dep}({\varvec{\theta }})\) under the feature-dependent labeling mechanism. This situation differs from that for the feature-independent labeling mechanism in that term \(1/\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]\) is included in the score functions. As the labeling probability \(\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]\) is not constant for the feature-dependent labeling mechanisms, the information matrices do not have an explicit representation, unlike \(\mathbf{L}_\mathrm{Ind}\). Therefore, a computation method such as the Monte Carlo integral is needed to calculate these information matrices.

The asymptotic variance of estimator \(\hat{{\varvec{\vartheta }}}_\mathrm{C,Dep}\) is obtained as follows:

$$\begin{aligned} \bar{{\varvec{\Lambda }}}_\mathrm{C,Dep} = \mathbf{L}_\mathrm{Dep}^{-1}, \end{aligned}$$

where \(\mathbf{L}_\mathrm{Dep}= \int _{\mathbf {R}^d}\left\{ \pi _1f_1({\varvec{x}};{\varvec{\theta }})\mathbf{B}_{1}({\varvec{x}})+\pi _0f_0({\varvec{x}};{\varvec{\theta }})\mathbf{B}_{0}({\varvec{x}}) \right\} d{\varvec{x}}\),

\(\mathbf{B}_{\ell }({\varvec{x}}) = \displaystyle \frac{{{\varvec{b}}_\ell ({\varvec{x}}){\varvec{b}}_\ell ({\varvec{x}})'}}{\mathrm{P}[R=1|{\varvec{x}};{\varvec{\phi }}]}\), \(\ell =0,1\), \({\varvec{b}}_{1}({\varvec{x}}) = \begin{pmatrix} \pi _1^{-1}\\ {\varvec{d}}_1({\varvec{x}}) \\ -\frac{1}{2}{\varvec{v}}_1({\varvec{x}})'{} \mathbf{D} \\ {\varvec{0}}_{d} \\ {\varvec{0}}_{d(d+1)/2} \end{pmatrix}\), \({\varvec{b}}_{0}({\varvec{x}}) = \begin{pmatrix} -\pi _0^{-1}\\ {\varvec{0}}_{d} \\ {\varvec{0}}_{d(d+1)/2} \\ {\varvec{d}}_0({\varvec{x}}) \\ -\frac{1}{2}{\varvec{v}}_0({\varvec{x}})'{} \mathbf{D} \end{pmatrix}\) and \({\varvec{0}}_d\) is the d-dimensional zero vector.

Appendix D: Asymptotic covariance matrix of \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Dep}\)

Besides the direct calculation of the asymptotic covariance matrix of estimator \(\hat{{\varvec{\vartheta }}}_\mathrm{P,Dep}\), it suffices to calculate the expectation of the second derivative of the second term on the right-hand side of \(S_\mathrm{P,Dep}({\varvec{\theta }})\). Then, the asymptotic covariance matrix, denoted by \(\bar{{\varvec{\Lambda }}}_\mathrm{P,Dep}\), is obtained from

$$\begin{aligned} \bar{{\varvec{\Lambda }}}_\mathrm{P,Dep} = (\mathbf{L}_\mathrm{Dep}+\mathbf{U}_\mathrm{Dep})^{-1}, \end{aligned}$$

where \(\mathbf{U}_\mathrm{Dep} = \int _{\mathbf {R}^d} \frac{{{\varvec{a}}}({\varvec{x}}){{\varvec{a}}}({\varvec{x}})'}{\mathrm{P}[R=0|{\varvec{x}};{\varvec{\phi }}]}f({\varvec{x}};{\varvec{\theta }})d{\varvec{x}}. \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hayashi, K. Asymptotic comparison of semi-supervised and supervised linear discriminant functions for heteroscedastic normal populations. Adv Data Anal Classif 12, 315–339 (2018). https://doi.org/10.1007/s11634-016-0266-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0266-6

Keywords

Mathematics Subject Classification

Navigation