Skip to main content

Variational discriminant analysis with variable selection

Abstract

A fast Bayesian method that seamlessly fuses classification and hypothesis testing via discriminant analysis is developed. Building upon the original discriminant analysis classifier, modelling components are added to identify discriminative variables. A combination of cake priors and a novel form of variational Bayes we call reverse collapsed variational Bayes gives rise to variable selection that can be directly posed as a multiple hypothesis testing approach using likelihood ratio statistics. Some theoretical arguments are presented showing that Chernoff-consistency (asymptotically zero type I and type II error) is maintained across all hypotheses. We apply our method on some publicly available genomics datasets and show that our method performs well in practice for its computational cost. An R package VaDA has also been made available on Github.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  • Ahdesmäki, M., Strimmer, K.: Feature selection in omics prediction problems using CAT score and false discovery rate control. Ann. Appl. Stat. 4(1), 503–519 (2010)

    MathSciNet  MATH  Google Scholar 

  • Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J.J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., Staudt, L.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)

    Google Scholar 

  • Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., AJ, L.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96(12), 6745–6750 (1999)

    Google Scholar 

  • Benjamini, Y., Daniel, Y.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165–1188 (2001)

    MathSciNet  MATH  Google Scholar 

  • Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  • Bickel, P.J., Levina, E.: Some theory for Fisher’s linear discriminant function, ‘Naïve Bayes’ and some alternatives when there are many more variables than observations. Bernoulli 10(6), 989–1010 (2004)

    MathSciNet  MATH  Google Scholar 

  • Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet processes. Bayesian Anal. 1(1), 121–144 (2006)

    MathSciNet  MATH  Google Scholar 

  • Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)

    MathSciNet  Google Scholar 

  • Bonferroni, C.E.: Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze (1936)

  • Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    MATH  Google Scholar 

  • Cai, T., Liu, W.: A direct estimation approach to sparse linear discriminant analysis. J. Am. Stat. Assoc. 106(496), 1566–1577 (2011)

    MathSciNet  MATH  Google Scholar 

  • Carvalho, C.M., Polson, N.G., Scott, J.G.: The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480 (2010)

    MathSciNet  MATH  Google Scholar 

  • Chen, Y., Feng, J.: Efficient method for Moore–Penrose inverse problems involving symmetric structure based on group theory. J. Comput. Civ. Eng. 28(2), 182–190 (2014)

    Google Scholar 

  • Chicco, D.: Ten quick tips for machine learning in computational biology. BioData Min. 10(35), 1–17 (2017)

    Google Scholar 

  • Clemmensen, L.: On Discriminant Analysis Techniques and Correlation Structures in High Dimensions. Technical report-2013(4). Technical University of Denmark (DTU), Kgs. Lyngby (2013)

  • Clemmensen, L., Kuhn, M.: sparseLDA: Sparse Discriminant Analysis. R package version 0.1-9 (2016)

  • Clemmensen, L., Witten, D., Hastie, T., Ersboll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)

    MathSciNet  Google Scholar 

  • Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  • Courrieu, P.: Fast computation of Moore–Penrose inverse matrices. Neural Inf. Process. Lett. Rev. 8(2), 25–29 (2005)

    Google Scholar 

  • Craig-Shapiro, R., Kuhn, M., Xiong, C., Pickering, E.H., Liu, J., Misko, T.P., Perrin, R.J., Bales, K.R., Soares, H., Fagan, A.M., David, M.H.: Multiplexed immunoassay panel identifies novel CSF biomarkers for Alzheimer’s disease diagnosis and prognosis. PLoS ONE 6, e18850 (2011)

    Google Scholar 

  • Donoho, D., Jin, J.: Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 32(3), 962–994 (2004)

    MathSciNet  MATH  Google Scholar 

  • Donoho, D., Jin, J.: Higher criticism thresholding. optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. 105(39), 14790–14795 (2008)

    MATH  Google Scholar 

  • Duarte Silva, P.A.: Two group classification with high-dimensional correlated data: a factor model approach. Comput. Stat. Data Anal. 55(11), 2975–2990 (2011)

    MathSciNet  MATH  Google Scholar 

  • Duarte Silva, P.A.: HiDimDA: High Dimensional Discriminant Analysis. R package version 0.2-4 (2015)

  • Dudoit, S., Fridyland, J., Speed, T.P.: Comparison of discrimination methods for classification of tumours using gene expression data. J. Am. Stat. Assoc. 97(457), 77–87 (2002)

    MATH  Google Scholar 

  • Eddelbuettel, D.: Seamless R and C++ Integration with Rcpp. Springer, Berlin (2013)

    MATH  Google Scholar 

  • Erickson, B.J., Kirk, S., Lee, Y., Bathe, O., Kearns, M., Gerdes, C., Rieger-Christ, K., Lemmerman, J.: Radiology data from The Cancer Genome Atlas Liver Hepatocellular Carcinoma [TCGA-LIHC] collection. The Cancer Imaging Archive (2016). http://doi.org/10.7937/K9/TCIA.2016.IMMQW8UQ

  • Fan, J., Fan, Y.: High-dimensional classification using features annealed independence rules. Ann. Stat. 36(6), 2605–2637 (2008)

    MathSciNet  MATH  Google Scholar 

  • Fan, J., Lv, J.: A selective overview of variable selection in high dimensional feature space. Stat. Sin. 20(1), 101–148 (2010)

    MathSciNet  MATH  Google Scholar 

  • Fernández-Delgado, M., Cernadas, E., Barro, S.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. 15, 3133–3181 (2014)

    MathSciNet  MATH  Google Scholar 

  • Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)

    Google Scholar 

  • Fisher, T., Sun, X.: Improved stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Comput. Stat. Data Anal. 55(1), 1909–1918 (2011)

    MathSciNet  MATH  Google Scholar 

  • Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84(405), 165–175 (1989)

    MathSciNet  Google Scholar 

  • Friguet, C., Kloareg, M., Causeur, D.: A factor model approach to multiple testing under dependence. J. Am. Stat. Assoc. 104(488), 1406–1415 (2009)

    MathSciNet  MATH  Google Scholar 

  • Genuer, R., Poggi, J., Tuleau-Malot, C., Villa-Vialaneix, N.: Random forests for big data. Big Data Res. 9, 28–46 (2017)

    Google Scholar 

  • Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)

    Google Scholar 

  • Guo, Y., Hastie, T., Tibshirani, R.: Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8(1), 86–100 (2007)

    MATH  Google Scholar 

  • Guo, Y., Hastie, T., Tibshirani, R.: RDA: Shrunken Centroids Regularized Discriminant Analysis. R package version 1.0.2-2.1 (2018)

  • Hastie, T., Tibshirani, R., Narasimhan, B., Chu, G.: PAMR: Prediction Analysis for Microarrays. R package version 1.55 (2014)

  • Helleputte, T.: LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library. R package version 2.10-8 (2017)

  • Jordan, M.I.: On statistics, computation and scalability. Bernoulli 19(4), 1378–1390 (2013)

    MathSciNet  MATH  Google Scholar 

  • Jorissen, R.N., Lipton, L., Gibbs, P., Chapman, M., Desai, J., Jones, I.T., Yeatman, T.J., East, P., Tomlinson, I.P., Verspaget, H.W., Aaltonen, L.A., Kruhoffer, M., Orntoft, T.F., Andersen, C.L., Sieber, O.M.: DNA copy-number alterations underlie gene expression differences between microsatellite stable and unstable colorectal cancers. Clin. Cancer Res. 14(24), 8061–8069 (2008)

    Google Scholar 

  • Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., Benesty, M., Lescarbeau, R., Ziem, A., Scrucca, L., Tang, Y., Candan, C., Hunt, T.: Caret: Classification and Regression Training. R package version 6.0-84 (2019)

  • Lim, T.S., Loh, W.Y., Shih, Y.S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40, 203–229 (2000)

    MATH  Google Scholar 

  • Liu, J.J., Cutler, G., Li, W., Pan, Z., Peng, S., Hoey, T., Chen, L., Ling, X.B.: Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics 21(11), 2691–2697 (2005)

    Google Scholar 

  • Luts, J., Ormerod, J.T.: Mean field variational Bayesian inference for support vector machine classification. Comput. Stat. Data Anal. 73, 163–176 (2014)

    MathSciNet  MATH  Google Scholar 

  • Mai, Q., Zou, H., Yuan, M.: A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 99(1), 29–42 (2012)

    MathSciNet  MATH  Google Scholar 

  • Mai, Q., Yang, Y., Zou, H.: Multiclass sparse discriminant analysis. arXiv (2015)

  • Marks, S., Dunn, O.: Discriminant functions when covariance matrices are unequal. J. Am. Stat. Assoc. 69(346), 555–559 (1974)

    MATH  Google Scholar 

  • Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (BBA) Protein Struct. 405(2), 442–451 (1975)

    Google Scholar 

  • Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64(2), 140–153 (2010)

    MathSciNet  MATH  Google Scholar 

  • Ormerod, J.T., Stewart, M., Yu, W., Romanes, S.: Bayesian hypothesis test with diffused priors: Can we have our cake and eat it too? arXiv (2017)

  • Pan, R., Wang, H.O., Li, R.: Ultrahigh-dimensional multiclass linear discriminant analysis by pairwise sure independence screening. J. Am. Stat. Assoc. 111(513), 169–179 (2016)

    MathSciNet  Google Scholar 

  • Perthame, E., Friguet, C., Causeur, D.: Stability of feature selection in classification issues for high-dimensional correlated data. Stat. Comput. 26(4), 783–796 (2016)

    MathSciNet  MATH  Google Scholar 

  • Perthame, E., Friguet, C., Causeur, D.: FADA: Variable Selection for Supervised Classification in High Dimension. R package version 1.3.3 (2018)

  • Reif, M., Shafait, F., Dengel, A.: Prediction of classifier training time including parameter optimization. Annu. Conf. Artif. Intell. KI2011, 260–271 (2011)

    Google Scholar 

  • Safo, S.E., Ahn, J.: General sparse multi-class linear discriminant analysis. Comput. Stat. Data Anal. 99, 81–90 (2016)

    MathSciNet  MATH  Google Scholar 

  • Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584 (1995)

    Google Scholar 

  • Shao, J., Wang, Y., Deng, X., Wang, S.: Sparse linear discriminant analysis with applications to high dimensional data. Ann. Stat. 39(2), 1241–1265 (2011)

    MATH  Google Scholar 

  • Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., DAmico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., Golub, T.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002)

    Google Scholar 

  • Srivastava, S., Gupta, M.R., Frigyik, B.A.: Bayesian quadratic discriminant analysis. J. Mach. Learn. Res. 8(Jun), 1277–1305 (2007)

    MathSciNet  MATH  Google Scholar 

  • Storey, J.D.: The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 31(6), 2013–2035 (2003)

    MathSciNet  MATH  Google Scholar 

  • Teh, Y.W., Newman, D., Welling, M.: A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In: Jordan, M.I., LeCun, Y., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1353–1360. MIT Press, Cambridge (2007)

    Google Scholar 

  • Thomaz, C., Kitani, E., Gillies, D.: A maximum uncertainty lda-based approach for limited sample size problems—with applications to face recognition. J. Braz. Comput. Soc. 12(2), 7–18 (2006)

    Google Scholar 

  • Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat. Sci. 18(1), 104–117 (2003)

    MathSciNet  MATH  Google Scholar 

  • van der Maaten, L., Hinton, G.: Visualising data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2017)

    MATH  Google Scholar 

  • Wang, Y., Blei, D.: Frequentist consistency of variational bayes. J. Am. Stat. Assoc. 9, 1–15 (2018). https://doi.org/10.1080/01621459.2018.1473776

    Article  MATH  Google Scholar 

  • Witten, D.: Classification and clustering of sequencing data using a Poisson model. Ann. Appl. Stat. 5(4), 2493–2518 (2011)

    MathSciNet  MATH  Google Scholar 

  • Witten, D.: penalizedLDA: Penalized Classification Using Fisher’s Linear Discriminant. R package version 1.1 (2015)

  • Witten, D., Tibshirani, R.: Penalized classification using Fisher’s linear discriminant. J. R. Stat. Soc. Ser. B 73(5), 754–772 (2011)

    MathSciNet  MATH  Google Scholar 

  • Xu, P., Brock, G.N., Parrish, R.S.: Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Comput. Stat. Data Anal. 53, 1674–1687 (2009)

    MathSciNet  MATH  Google Scholar 

  • Zavorka, S., Perrett, J.: Minimum sample size considerations for two-group linear and quadratic discriminant analysis with rare populations. Commun. Stat. Simul. Comput. 43(7), 1726–1739 (2014)

    MathSciNet  MATH  Google Scholar 

  • Zhang, A., Zhou, H.: Theoretical and computational guarantees of mean field variational inference for community detection. arXiv (2017)

  • Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Exp. Syst. Appl. 82, 128–150 (2017)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Rachel Wang (University of Sydney), the associate editor, and the anonymous reviewers for their valuable feedback to greatly improve this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weichang Yu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 287 KB)

Supplementary material 2 (pdf 178 KB)

Appendix

Appendix

VQDA derivations

In the VQDA setting (\(\sigma _{j 1}^2 \ne \sigma _{j 0}^2\)) the posterior distribution of \(\gamma _j\) given \({\mathcal {D}}\) and \(\mathbf{x}_{n+1}\) may be expressed as

$$\begin{aligned} p(\gamma _j \; | \; {\mathcal {D}}, \mathbf{x}_{n+1}) = \frac{ p(\gamma _j, {\mathcal {D}}, \mathbf{x}_{n+1})}{ p({\mathcal {D}}, \mathbf{x}_{n+1}) }. \end{aligned}$$

By letting \(h \rightarrow \infty \), the marginal likelihood of the data in the denominator is of the same form as Eq. (9) with the exception that

$$\begin{aligned} {\varvec{\theta }}_1 = ({\varvec{\mu }}_1, {\varvec{\mu }}_0, {\varvec{\mu }}, {\varvec{\sigma }}_1^2, {\varvec{\sigma }}_0^2, {\varvec{\sigma }}^2, \rho _y, \rho _\gamma ), \end{aligned}$$

and

$$\begin{aligned}&\lambda _{\text {LRT}} (\widetilde{\mathbf{x}}_j, x_{n+1,j}, \mathbf{y}, y_{n+1}) \\&\quad = (n+1) \log (\widehat{\sigma }_j^2) - (n_1 + y_{n+1}) \log (\widehat{\sigma }_{j1}^2) \\&\qquad - (n_0 + 1 - y_{n+1}) \log (\widehat{\sigma }_{j0}^2), \end{aligned}$$

where

$$\begin{aligned} \widehat{\sigma }_{j 1}^2 =&\tfrac{1}{n_1+y_{n+1}} \bigg [ ||\mathbf{y}^\mathrm{T}\{\widetilde{\mathbf{x}}_{j} - \widehat{\mu }_{j1}\mathbf{1}\}||^2 \\&\quad + y_{n+1} (x_{n+1,j} - \widehat{\mu }_{j1})^2 \bigg ],\\ \widehat{\sigma }_{j 0}^2 =&\tfrac{1}{n_0 +1 - y_{n+1}} \bigg [ ||(\mathbf{1}- \mathbf{y})^\mathrm{T}\{\widetilde{\mathbf{x}}_{j} - \widehat{\mu }_{j0}\mathbf{1}\}||^2 \\&\quad + (1 - y_{n+1}) (x_{n+1,j} - \widehat{\mu }_{j0})^2 \bigg ], \end{aligned}$$

and the \(j{\text {th}}\) entry of \({\varvec{\lambda }}_{\text {Bayes}}\) is (as \(h \rightarrow \infty \))

$$\begin{aligned}&\lambda _{\text {Bayes}} (\widetilde{\mathbf{x}}_j, x_{n+1,j}, \mathbf{y}, y_{n+1}) \\&\quad \rightarrow \lambda _{\text {LRT}}(\widetilde{\mathbf{x}}_j, x_{n+1,j}, \mathbf{y}, y_{n+1}) + \log (n_1 + y_{n+1} ) \\&\qquad + \log (n_0 + 1 - y_{n+1} ) - \log (2) - 3\log (n+1) \\&\qquad -\, 2\xi \{(n+1)/2\} + 2\xi \{(n_1 + y_{n+1})/2\} \\&\qquad +\, 2\xi \{ (n_0 + 1 - y_{n+1})/2 \}, \\&= \lambda _{\text {LRT}}(\widetilde{\mathbf{x}}_j, x_{n+1,j}, \mathbf{y}, y_{n+1}) - 2 \log (n+1) \\&\qquad +\,O(n_0^{-1} + n_1^{-1}), \end{aligned}$$

where \(\xi (x) = \log \varGamma (x) + x - x \log (x) - \tfrac{1}{2} \log (2 \pi )\). Since the calculation of the marginal likelihood involves a combinatorial sum over \(2^{p+1}\) binary combinations, exact Bayesian inference is also computationally impractical in the VQDA setting.

Table 2 Iterative scheme for obtaining the parameters in the optimal densities \(q(\varvec{\gamma }, y_{n+1}, \ldots , y_{n+m})\) in VQDA

Similar to VLDA, we will use RCVB to approximate the posterior \(p({\varvec{\gamma }}, y_{n+1} | \mathbf{x}, \mathbf{x}_{n+1}, \mathbf{y})\) by

$$\begin{aligned} q(y_{n+1}, {\varvec{\gamma }}) = q (y_{n+1}) \prod _{j=1}^{p} q_j (\gamma _j). \end{aligned}$$

This yields the approximate posterior for \(\gamma _j\) as

$$\begin{aligned}&q_j(\gamma _j) \propto \int \exp \big [ {\mathbb {E}}_{-q_j} \{ \log p({\mathcal {D}}, \mathbf{x}_{n+1}, y_{n+1},{\varvec{\gamma }}, {\varvec{\theta }}_1) \} \big ] \text {d} {\varvec{\theta }}_1, \\&\quad \propto \exp \bigg [ {\mathbb {E}}_{-q_j} \Big \{ \log {\mathcal {B}}(a_\gamma + \mathbf{1}^\mathrm{T}{\varvec{\gamma }}, b_\gamma + p - \mathbf{1}^\mathrm{T}{\varvec{\gamma }}) \Big \} \\&\qquad + \tfrac{\gamma _j}{2} {\mathbb {E}}_{-q_j} \Big \{ {\varvec{\lambda }}_{\text {Bayes}} (\widetilde{\mathbf{x}}_j, x_{n+1,j}, \mathbf{y}, y_{n+1}) \Big \} \bigg ]. \end{aligned}$$

For a sufficiently large n, we can avoid the need to evaluate the expectation \( {\mathbb {E}}_{-q_j} \Big \{ {\varvec{\lambda }}_{\text {Bayes}} (\widetilde{\mathbf{x}}_j, x_{n+1,j}, \mathbf{y}, y_{n+1}) \Big \}\) by applying Taylor’s expansion to obtain the approximation

$$\begin{aligned}&{\mathbb {E}}_{-q_j} \log (a_\gamma + \mathbf{1}^\mathrm{T} {\varvec{\gamma }}_{-j}) \approx \log (a_\gamma + \mathbf{1}^\mathrm{T} \mathbf{w}_{-j}), \nonumber \\&{\mathbb {E}}_{-q_j} \log (b_\gamma + p - \mathbf{1}^\mathrm{T} {\varvec{\gamma }}_{-j} - 1) \nonumber \\&\quad \approx \log (b_\gamma + p - \mathbf{1}^\mathrm{T} \mathbf{w}_{-j} - 1), \nonumber \\&\widehat{\sigma }_{j 1}^2 \approx \tfrac{1}{n_1} ||\mathbf{y}^\mathrm{T}\{\widetilde{\mathbf{x}}_{j} - \widehat{\mu }_{j1}\mathbf{1}\}||^2, \nonumber \\&\widehat{\sigma }_{j 0}^2 \approx \tfrac{1}{n_0} ||(\mathbf{1}- \mathbf{y})^\mathrm{T}\{\widetilde{\mathbf{x}}_{j} - \widehat{\mu }_{j0}\mathbf{1}\}||^2, \nonumber \\&\widehat{\sigma }_{j}^2 \approx \tfrac{1}{n} ||\widetilde{\mathbf{x}}_{j} - \widehat{\mu }_{j}\mathbf{1}||^2, \nonumber \\&\widehat{\mu }_{j 1} \approx \tfrac{1}{n_1} \mathbf{y}^\mathrm{T} \widetilde{\mathbf{x}}_j , \;\; \widehat{\mu }_{j 0} \approx \tfrac{1}{n_0} (\mathbf{1}- \mathbf{y})^\mathrm{T} \widetilde{\mathbf{x}}_j, \;\; \widehat{\mu }_{j} \approx \tfrac{1}{n} \mathbf{1}^\mathrm{T} \widetilde{\mathbf{x}}_j, \end{aligned}$$
(16)

and, similar to VLDA, \({\varvec{\lambda }}_{\text {Bayes}}\) does not depend on the new observation \((\mathbf{x}_{n+1}, y_{n+1})\). By using the approximation in (16), we have

$$\begin{aligned} w_j&= \frac{q_j(\gamma _j = 1)}{q_j(\gamma _j = 1) + q_j(\gamma _j = 0)}, \\&\approx \text {expit}\bigg [ \log (a_\gamma {+} \mathbf{1}^\mathrm{T} \mathbf{w}_{-j}) {-} \log (b_\gamma + p - \mathbf{1}^\mathrm{T} \mathbf{w}_{-j} - 1) \\&\quad + \tfrac{1}{2} \log (\tfrac{n_1 n_0}{2}) + \xi (\tfrac{n_1}{2}) + \xi (\tfrac{n_0}{2}) - \xi (\tfrac{n}{2}) \\&\quad - \tfrac{3}{2} \log (n+1) + \tfrac{1}{2} \lambda _{\text {LRT}} (\widetilde{\mathbf{x}}_j, y_{n+1}) \bigg ], \\&= \text {expit}\bigg [ \text {penalty}_{QDA,j} + \tfrac{1}{2} \lambda _{\text {LRT}} (\widetilde{\mathbf{x}}_j, y_{n+1}) \bigg ], \end{aligned}$$

To obtain the approximate density for \(y_{n+1}\), we integrate analytically over \({\varvec{\theta }}_1\) to obtain

$$\begin{aligned} q(y_{n+1})&\propto \int \exp \big [ {\mathbb {E}}_{-y} \{ \log p({\mathcal {D}}, \mathbf{x}_{n+1}, y_{n+1},{\varvec{\gamma }}, {\varvec{\theta }}_1) \} \big ] \text {d} {\varvec{\theta }}_1, \\&\propto \exp \bigg [ \log {\mathcal {B}}(a_y + n_1 + y_{n+1}, b_y + n_0 + 1 - y_{n+1}) \\&\quad + \mathbf{1}^\mathrm{T} \mathbf{w}\Big \{ \log \varGamma (\tfrac{n_1 + y_{n+1}}{2}) + \log \varGamma (\tfrac{n_0 + 1 - y_{n+1}}{2}) \Big \} \\&\quad + \tfrac{1}{2} \mathbf{w}^\mathrm{T} \Big \{ \log {\varvec{\phi }}(\mathbf{x}_{n+1}; \widehat{{\varvec{\mu }}}_1, \widehat{{\varvec{\sigma }}}_{1}^2) - \log {\varvec{\phi }}(\mathbf{x}_{n+1}; \widehat{{\varvec{\mu }}}_0, \widehat{{\varvec{\sigma }}}_{0}^2) \Big \} \bigg ], \end{aligned}$$

where the \(j{\text {th}}\) element of the \(p \times 1\) vector \({\varvec{\phi }}(\mathbf{x}_{n+1}; \widehat{{\varvec{\mu }}}_k, \widehat{{\varvec{\sigma }}}_{k}^2)\) is the Gaussian density

$$\begin{aligned} \phi (x_{n+1,j}; \widehat{\mu }_{j k}, \widehat{\sigma }_{j k}^2), \end{aligned}$$

and the \(\log \) prefix denotes an element-wise \(\log \) of a vector.

In the general case with m new observations, we may apply Taylor’s expansion results from (16) to compute the approximate classification probability for \(y_{n+i}\) as

$$\begin{aligned} \widetilde{y}_i&= \frac{q(y_{n+i} = 1)}{q(y_{n+i} = 1) + q(y_{n+i} = 0)}, \\&\approx \text {expit}\bigg [ \log \big ( \tfrac{n_1}{n_0} \big ) + \mathbf{1}^\mathrm{T} \mathbf{w}\big \{ \log \varGamma (\tfrac{n_1 + 1}{2}) - \log \varGamma (\tfrac{n_1}{2}) \\&\quad + \log \varGamma (\tfrac{n_0 + 1}{2}) - \log \varGamma (\tfrac{n_0}{2}) \big \} \\&\quad + \tfrac{1}{2} \mathbf{w}^\mathrm{T} \Big \{ \log {\varvec{\phi }}(\mathbf{x}_{n+i}; \widehat{{\varvec{\mu }}}_1, \widehat{{\varvec{\sigma }}}_{1}^2) - \log {\varvec{\phi }}(\mathbf{x}_{n+i}; \widehat{{\varvec{\mu }}}_0, \widehat{{\varvec{\sigma }}}_{0}^2) \Big \} \bigg ]. \end{aligned}$$

The RCVB algorithm for VQDA may be found in Table 2.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yu, W., Ormerod, J.T. & Stewart, M. Variational discriminant analysis with variable selection. Stat Comput 30, 933–951 (2020). https://doi.org/10.1007/s11222-020-09928-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09928-8

Keywords

  • Discriminant analysis
  • Variational Bayes approximation
  • Variable selection
  • Cake priors
  • Multiple hypothesis tests
  • Classification
  • Fast algorithms