Skip to main content
Log in

Bayesian Discriminant Analysis Using a High Dimensional Predictor

  • Published:
Sankhya A Aims and scope Submit manuscript

Abstract

We consider the problem of Bayesian discriminant analysis using a high dimensional predictor. In this setting, the underlying precision matrices can be estimated with reasonable accuracy only if some appropriate additional structure like sparsity is assumed. We induce a prior on the precision matrix through a sparse prior on its Cholesky decomposition. For computational ease, we use shrinkage priors to induce sparsity on the off-diagonal entries of the Cholesky decomposition matrix and exploit certain conditional conjugacy structure. We obtain the contraction rate of the posterior distribution for the mean and the precision matrix respectively using the Euclidean and the Frobenius distance, and show that under some milder restriction on the growth of the dimension, the misclassification probability of the Bayesian classification procedure converges to that of the oracle classifier for both linear and quadratic discriminant analysis. Extensive simulations show that the proposed Bayesian methods perform very well. An application to identify cancerous breast tumorbased on image data obtained using find needle aspirate is considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Banerjee, S. and Ghosal, S. (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics 8, 2, 2111–2137.

    Article  MathSciNet  MATH  Google Scholar 

  • Banerjee, S. and Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis 136, 147–162.

    Article  MathSciNet  MATH  Google Scholar 

  • Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). Dirichlet–Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 512, 1479–1490.

    Article  MathSciNet  MATH  Google Scholar 

  • Bhadra, A., Datta, J., Polson, N. G. and Willard, B. (2017). The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis 12, 4, 1105–1131.

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel, P. J. and Levina, E. (2008a). Covariance regularization by thresholding. The Annals of Statistics 36, 6, 2577–2604.

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel, P. J. and Levina, E. (2008b). Regularized estimation of large covariance matrices. Annals of Statistics, 199–227.

  • Cai, T. T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics 38, 4, 2118–2144.

    Article  MathSciNet  MATH  Google Scholar 

  • Cai, T., Liu, W. and Luo, X. (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 494, 594–607.

    Article  MathSciNet  MATH  Google Scholar 

  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97, 2, 465–480.

    Article  MathSciNet  MATH  Google Scholar 

  • Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. Artificial Intelligence and Statistics, p. 73–80.

  • Du, X. and Ghosal, S. (2017). Multivariate Gaussian network structure learning. Journal of Statistical Planning and Inference (to appear).

  • Fan, J. and Fan, X. (2008). High dimensional classification using features annealed independence rules. The Annals of Statistics 36, 6, 2605–2637.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 3, 432–441.

    Article  MATH  Google Scholar 

  • George, E. I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 423, 881–889.

    Article  Google Scholar 

  • Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference, 44. Cambridge University Press, Cambridge.

    Book  MATH  Google Scholar 

  • Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis 5, 1, 171–188.

    Article  MathSciNet  MATH  Google Scholar 

  • Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, 1, 85–98.

    Article  MathSciNet  MATH  Google Scholar 

  • Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 33, 2, 730–773.

    Article  MathSciNet  MATH  Google Scholar 

  • Izenman, A. J. (2008). Modern Multivariate Statistical Techniques. Regression, Classification and Manifold Learning. Springer texts in statistics, Springer-Verlag, New York.

    MATH  Google Scholar 

  • Khare, K., Oh, S. -Y. and Rajaratnam, B. (2015). A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77, 4, 803–825.

    Article  MathSciNet  MATH  Google Scholar 

  • Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis 88, 2, 365–411.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10, Oct, 2295–2328.

    MathSciNet  MATH  Google Scholar 

  • Mahalanobis, P. C. (1925). Analysis of race-mixture in Bengal. Proceedings of the Indian Science Congress.

  • Mahalanobis, P. C. (1928). Statistical study of the Chinese head. Man in India 8, 107–122.

    Google Scholar 

  • Mahalanobis, P. C. (1930). A statistical study of certain anthropometric measurements from Sweden. Biometrika 22, 94–108.

    Article  MATH  Google Scholar 

  • Mahalanobis, P. (1930). On test and measures of group divergence. Journal of Asiatic Society Bengal 26, 541–588.

    Google Scholar 

  • Mahalanobis, P. C. (1931). Anthropological observations on Anglo-Indians of Calcutta, Part II: Analysis of Anglo-India head length. Rec. Indian Museum, 23.

  • Mahalanobis, P. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Science, India 2, 49–55.

    MATH  Google Scholar 

  • Mahalanobis, P. C. (1949). Historical note on the d 2-statistics. Appendix I: Anthropometric survey of the United Provinces, 1941: a statistical study. Sankhyā: The Indian Journal of Statistics 9, 237–239.

    Google Scholar 

  • Mahalanobis, P. C., Majumdar, D. N., Yeatts, M. W. M. and Rao, C. R. (1949). Anthropometric survey of the United Provinces, 1941: a statistical study. Sankhyā: The Indian Journal of Statistics 3, 1, 89–324.

    MATH  Google Scholar 

  • Majumdar, D. N., Rao, C. R. and Mahalanobis, P. C. (1958). Bengal anthropometric survey, 1945: A statistical study. Sankhyā: The Indian Journal of Statistics 19, 201–408.

    MATH  Google Scholar 

  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 1436–1462.

  • Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 404, 1023–1032.

    Article  MathSciNet  MATH  Google Scholar 

  • Mulgrave, J. J. and Ghosal, S. (2018). Bayesian inference in nonparanormal graphical models. arXiv:1806.04334.

  • Peng, J., Wang, P., Zhou, N. and Zhu, J. (2012). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 486, 735–746.

    Article  MathSciNet  MATH  Google Scholar 

  • Roc̆ková, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable selection. Journal of the American Statistical Association 109, 506, 828–846.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, H. (2010). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis 7, 4, 867–886.

    Article  MathSciNet  MATH  Google Scholar 

  • Wei, R. and Ghosal, S. (2017). Contraction properties of shrinkage priors in logistic regression Preprint at http://www4.stat.ncsu.edu/ghoshal/papers.

  • Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 1, 19–35.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingqi Du.

Additional information

Research of the second author is partially supported by NSF grant DMS-1510238.

Appendix

Appendix

Proof of Proposition 1.

The (i,j)th element of a precision matrix Ω = LDLT is zero if and only if \({\sum }_{k = 1}^{\min (i,j)}l_{ik}l_{jk}\neq 0\). Now

$$\begin{array}{@{}rcl@{}} {\Pi}\left( \sum\limits_{k = 1}^{\min(i,j)}l_{ik}l_{jk}\neq0\right)&= & {\Pi}(l_{ik}l_{jk}\neq0 \text{ for some} k)\\ &= & 1 - {\Pi}(l_{ik}l_{jk}= 0\text{ for all} k)\\ &= & 1 - \prod\limits_{k = 1}^{\min(i,j)}{\Pi}(l_{ik}l_{jk}= 0)\\ &= & 1 - \prod\limits_{k = 1}^{\min(i,j)}\left\{1-{\Pi}(l_{ik}l_{jk}\neq0)\right\}\\ &= & 1 - \prod\limits_{k = 1}^{\min(i,j)}\left\{1-{\Pi}(l_{ik}\neq0){\Pi}(l_{jk}\neq0)\right\}\\ &= & 1 - (1-\pi_{i}\pi_{j})^{\min(i,j)}. \end{array} $$

Suppose that i and j are of the same order. Then the above equation can be written as

$$(1-\pi_{i}\pi_{j})^{\min(i,j)}\approx(1-{\pi_{i}^{2}})^{i}\approx1-i{\pi_{i}^{2}}. $$

In order to assume the same sparsity for each i, we need to \(i{\pi _{i}^{2}}\) to remain constant in i, that is, \(\pi _{i}\sim {C_{p}}/{\sqrt {i}}\). □

Proof of Theorem 1.

The proof of the theorem uses an argument similar to that used in the proof of the Neyman-Pearson lemma. Observe that

$$\begin{array}{@{}rcl@{}} &&r(\phi^{*})-r(\phi)\\ &&= \text{E} [Y(1-\phi^{*}(X))+(1-Y)\phi^{*}(X) ]-\text{E} [Y(1-\phi(X))+(1-Y)\phi(X) ]\\ &&= \text{E} [Y (\phi(X)-\phi^{*}(X) )+(1-Y) (\phi^{*}(X)-\phi(X) ) ]\\ &&= \text{E} [ (\phi(X)-\phi^{*}(X) )(2Y-1) ]\\ &&= \text{E} [\text{E} ((\phi(X)-\phi^{*}(X) )(2Y-1) )|X )]\\ &&= \text{E} [ (\phi(X)-\phi^{*}(X) ) (2\text{E}(Y|X)-1 ) ]. \end{array} $$

Clearly

$$ \text{E}(Y|X)=\frac{\pi p_{1}(X)}{\pi p_{1}(X)+(1-\pi)p_{0}(X)}>\frac{1}{2} $$
(7.2)

if and only if πp1(X) > (1 − π)p0(X). Since in this case ϕ(X) = 1 ≥ ϕ(X), we have

$$ (\phi(X)-\phi^{*}(X))(2\text{E}(Y|X)-1)\le 0. $$
(7.3)

On the other hand, when the expression in (7.2) is less than or equal to 1/2, ϕ(X) = 0 ≤ ϕ(X), again leading to (7.3). Therefore, r(ϕ) ≤ r(ϕ).

Alternatively, we can view ϕ as the Bayes decision rule for parameter space (p1,p0), action space (1,0) and prior distribution (π,1 − π) based on observation X and r(⋅) as the Bayes risk, which is minimized by the Bayes rule ϕ. □

To prove Theorem 2, we need the following lemma, which is an extension of Lemma A.1 of Banerjee and Ghosal (2015) to accommodate non-zero means.

Lemma 1.

Let q andqrespectively denote the densities of Nm(μ− 1) and Nm(μ′− 1) and\(h(q,q^{\prime })=\|\sqrt {q}-\sqrt {q^{\prime }}\|_{2}\)standfor their Hellinger distance, where the dimension m is potentially growing but theeigenvalues of Ω remains bounded between two fixed positive numbers. Then there exist positiveconstantsc0(depending on Ω)andδ0such that\( h^{2}(q,q^{\prime })\leq c_{0}\left (\|\mu -\mu ^{\prime }\|^{2}+\|{\Omega }-{\Omega }^{\prime }{\|_{F}^{2}}\right )\)andifh(q,q) < δ0,then\( \left (\|\mu -\mu ^{\prime }\|^{2}+\|{\Omega }-{\Omega }^{\prime }{\|_{F}^{2}}\right )\leq c_{0} h^{2}(q,q^{\prime })\).Moreoverc0can be taken to be a constant multiple of max{∥Ω∥S,∥Ω− 1S}.

Proof.

Let λi, i = 1,⋯ ,m, be the eigenvalues of the matrix A = Ω− 1/2ΩΩ− 1/2. Then the squared Hellinger distance h2(q,q) between q and q is given by

$$\begin{array}{@{}rcl@{}} && 1-\frac{\det({\Omega})^{-1/4}\det({\Omega}^{\prime})^{-1/4}}{\det(\frac{{\Omega}^{-1}+{\Omega}^{\prime-1}}{2})^{1/2}}\exp\left\{ - \frac18(\mu - \mu^{\prime})^{T}\left( \frac{{\Omega}^{-1} + {\Omega}^{\prime-1}}{2}\right)^{-1}(\mu - \mu^{\prime})\right\}\\ & = & 1 - 2^{m} \prod\limits_{i = 1}^{m} (\lambda_{i}^{1/2} + \lambda_{i}^{-1/2})^{-1/2} \exp\left\{ - \frac14(\mu - \mu^{\prime})^{T}\left( {{\Omega}^{-1} + {\Omega}^{\prime-1}}\right)^{-1}(\mu - \mu^{\prime})\right\}\\ &\le & \left[1-2^{m} \prod\limits_{i = 1}^{m} (\lambda_{i}^{1/2}+\lambda_{i}^{-1/2})^{-1/2}\right] +\frac14(\mu-\mu^{\prime})^{T}{\Omega} (\mu-\mu^{\prime}). \end{array} $$

Clearly the second term is bounded by a multiple of ∥μμ2. The first term has been bounded by a multiple of \(\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\) in Lemma A.1 (ii) of Banerjee and Ghosal (2015).

For the converse, observe that \(1-2^{m} {\prod }_{i = 1}^{m} (\lambda _{i}^{1/2}+\lambda _{i}^{-1/2})^{-1/2}\) is bounded by h2(q,q). Let δ = h(q,q) < δ0 for a sufficiently small δ0, so that by arguments used in the proof of Part (ii) of Lemma A.1 of Banerjee and Ghosal (2015), it follows that \(\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\le \frac {1}{2} c_{0} [1-2^{m} {\prod }_{i = 1}^{m} (\lambda _{i}^{1/2}+\lambda _{i}^{-1/2})^{-1/2}]\le c_{0} h^{2}(q,q^{\prime })\) for some constant c0 > 0. As the Frobenius norm dominates the spectral norm, this in particular implies that ∥Ω−ΩS is also small, and hence the eigenvalues of Ω also lie between two fixed positive numbers. Now h(q,q) < δ0 also implies that

$$1-\exp\left\{-\frac18(\mu-\mu^{\prime})^{T}\left( \frac{{\Omega}^{-1}+{\Omega}^{\prime-1}}{2}\right)^{-1}(\mu-\mu^{\prime})\right\}<{\delta_{0}^{2}},$$

which implies that

$$(\mu-\mu^{\prime})^{T}\left( \frac{{\Omega}^{-1}+{\Omega}^{\prime-1}}{2}\right)^{-1}(\mu-\mu^{\prime})\lesssim h^{2}(q,q^{\prime}).$$

Hence \(\|\mu -\mu ^{\prime }\|^{2}\le \frac {1}{2} c_{0} h^{2}(q,q^{\prime })\). This gives the desired bound. □

Proof of Theorem 2.

To obtain the posterior contraction rate 𝜖n of (π,μ0,μ101) at \((\pi ^{*},\mu _{0}^{*},\mu _{1}^{*},{\Omega }_{0}^{*}, {\Omega }_{1}^{*})\), we apply the general theory of posterior contraction rate described in Ghosal and van der Vaart (2017), by executing the steps stated below.

  1. (i)

    Find a sieve \(\mathcal {P}_{n}\subset \mathcal {P}\) such that \({\Pi }(\mathcal {P}_{n}^{c})\le e^{-M n{\epsilon _{n}^{2}}}\) for a sufficiently large constant M > 0 and the 𝜖n-metric entropy of \(\mathcal {P}_{n}\) is bounded by a constant multiple of \(n{\epsilon _{n}^{2}}\).

  2. (ii)

    Show that the prior probability of an \({\epsilon _{n}^{2}}\)-neighborhood of the true density in the Kullback-Leibler sense is at least \(e^{-c n{\epsilon _{n}^{2}}}\) for some constant c > 0.

This gives the posterior contraction rate 𝜖n in terms of the Hellinger distance on the densities, which can be converted to the Euclidean distance on the mean and Frobenius distance on the precision matrix in view of Lemma 1.

Let \(q_{\mu _{0},{\Omega }_{0}}\) and \(q_{\mu _{1},{\Omega }_{1}}\) denote the two possible densities of an observation conditional on the classification information. The learning of π is solely based on N, and the posterior for π is clearly concentrating at its true value π at rate \(1/\sqrt {n}\). Thus, to simplify notations, in the remaining analysis, we may treat π as given to be π and establish posterior concentration for (μ0,μ101) based on N samples from Class 1 and nN from Class 0. Then the average squared Hellinger distance is expressed as

$$ \frac{n-N}{n} h^{2}\left( p_{\mu_{0},{\Omega}_{0}},p_{\mu_{0}^{*},{\Omega}_{0}^{*}}\right)+\frac{N}{n}h^{2}\left( p_{\mu_{1},{\Omega}_{1}},p_{\mu_{1}^{*},{\Omega}_{1}^{*}}\right). $$
(7.4)

As N/nπ almost surely and 0 < π < 1, it suffices to establish posterior concentration for (μ00) and (μ11) separately. Therefore we use a generic notation (μ,Ω) to denote either, and we establish the rate of posterior contraction for μ and Ω here. The analysis of posterior concentration is similar to that of Theorem 3.1 of Banerjee and Ghosal (2015) with the difference being that there is an extra mean parameter. Moreover, unlike them we use a prior on the precision matrix through its Cholesky decomposition, and the sparsity of off-diagonal entries is only approximate since we do not use point-mass priors. However, posterior contraction for a point mass prior can be recovered from our results since it corresponds to the limiting case v0 → 0 in the spike-and-slab prior.

Observe that for (μ,L,D),(μ,L,D) with all entries of μ,μ,L,L,D,D bounded by B, for Ω = LDLT, Ω = LDLT, we can be write \(\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\) as

$$\begin{array}{@{}rcl@{}} &&\sum\limits_{i,j = 1}^{p} [\sum\limits_{k = 1}^{p} (d_{k} l_{ik}l_{jk}-d_{k}^{\prime} l^{\prime}_{ik}l^{\prime}_{jk})]^{2}\\ &&~= \sum\limits_{i,j = 1}^{p} [\sum\limits_{k = 1}^{p} \{(d_{k}-d_{k}^{\prime}) l_{ik} l_{jk}+d_{k}^{\prime} (l_{ik}-l^{\prime}_{ik})l_{jk}+d_{k}^{\prime} l_{ik}^{\prime}(l_{jk}-l^{\prime}_{jk})\}]^{2} \end{array} $$

which can be bounded as

$$\begin{array}{@{}rcl@{}} &&3p \sum\limits_{i,j = 1}^{p} \sum\limits_{k = 1}^{p} \{(d_{k}-d_{k}^{\prime})^{2} l_{ik}^{2} l_{jk}^{2}+d_{k}^{\prime 2} (l_{ik}-l^{\prime}_{ik})^{2} l_{jk}^{2} + d_{k}^{\prime 2} l_{ik}^{\prime 2} (l_{jk}-l^{\prime}_{jk})^{2}\}\\ && ~\le 3 B^{4} p^{2} \{p \sum\limits_{k = 1}^{p} (d_{k}-d_{k}^{\prime})^{2} +\sum\limits_{i = 1}^{p} \sum\limits_{k = 1}^{p} (l_{ik}-l^{\prime}_{ik})^{2} + \sum\limits_{j = 1}^{p} \sum\limits_{k = 1}^{p} (l_{jk}-l^{\prime}_{jk})^{2}\}\\ && ~= 3B^{4} p^{2} [ p\|D-D^{\prime}\|_{F}^{2}+ 2\|L-L^{\prime}\|_{F}^{2}]. \end{array} $$

Hence if ∥DD𝜖n/(3B2p2 + ν), ∥LL𝜖n/(3B2p2 + ν), where ∥⋅∥ stands for the maximum norm for a vector or matrix, then

$$\|{\Omega}-{\Omega}^{\prime}\|_{F}^{2}\le 3B^{4} p^{2} [ p^{2}\|D-D^{\prime}\|_{\infty}^{2}+ 2p^{2}\|L-L^{\prime}\|_{\infty}^{2}] \le \frac{9B^{4} p^{4}{\epsilon_{n}^{2}}}{(3B^{2} p^{2+\nu})^{2}} = \frac{{\epsilon_{n}^{2}}}{p^{2\nu}}.$$

Further, ∥Ω∥S ≤tr(Ω) ≤ p2B2pν without loss of generality, by increasing ν if necessary. Define the effective edges to be the set {(i,j) : |lij| > 𝜖n/pν,i > j}. Consider the sieve \(\mathcal {P}_{n}\) consisting of L with maximum number of effective edges at most \(\bar {r}\) a sufficiently large multiple of \(n{\epsilon _{n}^{2}}/\log n\) and each entry of μ, D and L is bounded by \(B\in [b_{1}n{\epsilon _{n}^{2}},b_{1}n{\epsilon _{n}^{2}}+ 1]\) in absolute value. Then the 𝜖n/pν-metric entropy of \(\mathcal {P}_{n}\) with respect to the Euclidean norm for μ and Frobenius norm for D and L is given by

$$\log\left\{\left( \frac{B p^{1/2+\nu}}{\epsilon_{n}}\right)^{p}\sum\limits_{j = 1}^{\bar{r}} \binom{\binom{p}{2}}{j} \left( \frac{3B^{2} p^{2+\nu}}{\epsilon_{n}}\right)^{j} \left( \frac{3B^{2}{p}^{2+\nu}}{\epsilon_{n}}\right)^{p}\right\}, $$

where in the left part of the inequality, the first \(({B\sqrt {p}}/{\epsilon _{n}})^{p}\) is from the mean parameter, the second is from the off-diagonal entries of L and the last is from the diagonal elements of D. Note that for a component with span at most 𝜖n/pν, only one point is needed for a covering. In view of Lemma 1, the Hellinger distance between the corresponding densities is bounded by 𝜖n since max{∥Ω∥S,∥ΩS,∥Ω− 1S,∥Ω′− 1S}≤ pν. Thus the entropy is bounded by a multiple of

$$\log\left\{\bar{r}\left( \frac{3B^{2}{p}^{2+\nu}}{\epsilon_{n}}\right)^{\bar{r}+ 2p} {p}^{2\bar{r}}\right\}\lesssim (\bar{r}+p)(\log p+\log B+\log (1/\epsilon_{n})).$$

For our choice of \(\bar {r}\) and B, this shows that the metric entropy is bounded by a constant multiple of \(n{\epsilon _{n}^{2}}\).

Now we bound \({\Pi }(\mathcal {P}_{n}^{c})\). Let R be the number of effective edges of L. Then \(R\sim \text {Bin}(\binom {p-1}{2},\eta _{0})\), where \(\eta _{0} ={\Pi } (|l|>\epsilon _{n}/p)\le p^{-b^{\prime }}\). Then from the tail estimate of the binomial distribution, \({\Pi }(\bar {R}>\bar {r})\le e^{-c \bar {r}\log \bar {r}}\) for some c > 0. Using the condition on the prior and the choice of B, we have

$$ {\Pi}(\mathcal{P}_{n}^{c})\leq {\Pi}(\bar{R}>\bar{r})+ 2 p^{2} \exp(-b_{1} n{\epsilon_{n}^{2}})\lesssim e^{-Cn{\epsilon_{n}^{2}}} , $$
(7.5)

where C can be chosen as large as we like, by simply choosing b1 large enough. This verifies the required conditions on the sieve.

Finally to check the prior concentration rate,

$$ {\Pi}\{B(p^{*},\epsilon_{n})\}:={\Pi}\{p:K(p^{*},p){\leq\epsilon_{n}^{2}},V(p^{*},p){\leq\epsilon_{n}^{2}}\}\geq\exp(-n{\epsilon_{n}^{2}}), $$
(7.6)

where \(K(p^{*},p)=\int p^{*}\log (p^{*}/p)\), \(V(p^{*},p)=\int p^{*}\{\log (p^{*},p)\}^{2}\). Note that for X ∼Np(μ,Σ) and a p × p symmetric matrix A, we have

$$\text{E}(X^{T}AX)=\text{tr}(A{\Sigma})+\mu^{T}A\mu,\quad \text{var}(X^{T}AX)= 2\text{tr}(A{\Sigma} A{\Sigma})+ 4\mu^{T}A{\Sigma} A\mu. $$

We use the above result to find the expressions for K(p,p) and V (p,p). Denoting the eigenvalues of the matrix Ω∗− 1/2ΩΩ∗− 1/2 by λi, i = 1,…,p, we have K(p0,p) given by

$$\begin{array}{@{}rcl@{}} && \frac{1}{2}\log\det\frac{{\Omega}^{*}}{\Omega} -\frac{1}{2}\text{E}_{\mu^{*},{\Omega}^{*}}\left\{(X - \mu^{*})^{T}{\Omega}^{*}(X - \mu^{*})-(X-\mu)^{T}{\Omega}(X-\mu)\right\}\\ &=&\frac{1}{2}\log\det\frac{{\Omega}^{*}}{\Omega}-\frac{p}{2}+\frac{1}{2}\text{E}_{\mu^{*},{\Omega}^{*}}\left\{(X-\mu)^{T}{\Omega}(X-\mu)\right\}\\ &=& -\frac{1}{2}\sum\limits_{i = 1}^{p} (\log \lambda_{i}-1+\lambda_{i})+\frac{1}{2}(\mu^{*}-\mu)^{T}{\Omega}(\mu^{*}-\mu)\\ &\leq & -\frac{1}{2}\sum\limits_{i = 1}^{p}\log \lambda_{i}-\frac{1}{2}\sum\limits_{i = 1}^{p}(1-\lambda_{i})+\frac{1}{2}\|\mu^{*}-{\mu\|_{2}^{2}}\|{\Omega}\|_{S}. \end{array} $$

Now \({\sum }_{i = 1}^{p} (1-\lambda _{i})^{2} =\|I-{{\Omega }^{*}}^{-1/2}{\Omega }{\Omega }^{*-1/2}\|_{F}^{2}\), so if ∥I −Ω− 1/2ΩΩ∗− 1/2F is sufficiently small, then maxi|1 − λi| < 1, and hence \({\sum }_{i = 1}^{p} (1-\lambda _{i}-\log \lambda _{i})\lesssim {\sum }_{i = 1}(1-\lambda _{i})^{2}\), leading to the relation

$$ K(p^{*},p)\leq{\sum}_{i = 1}^{p}(1-\lambda_{i})^{2}+\|\mu^{*}-{\mu\|_{2}^{2}}\|{\Omega}\|_{S}. $$
(7.7)

Also, V (p,p) is given by

$$\begin{array}{@{}rcl@{}} && \!\frac14\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{-(X-\mu^{*})^{T}{\Omega}^{*}(X-\mu^{*})+(X-\mu)^{T}{\Omega}(X-\mu)\right\}\\ & = & \!\frac14\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{(X-\mu^{*})^{T}({\Omega}-{\Omega}^{*})(X-\mu^{*})+ 2(\mu-\mu^{*})^{T}{\Omega}(X-\mu^{*})\right\}\\ &\!\leq\! & \!\frac{1}{2}\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{(X - \mu^{*})^{T}({\Omega} - {\Omega}^{*})(X - \mu^{*})\right\} + 2\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{(\mu - \mu^{*})^{T}{\Omega}(X - \mu^{*})\right\}\\ & = & \!\text{tr}\left\{({\Omega}-{\Omega}^{*}){\Omega}^{*-1}({\Omega}-{\Omega}^{*}){\Omega}^{*-1}\right\}+(\mu-\mu^{*})^{T}{\Omega}{\Omega}^{*-1}{\Omega}(\mu-\mu^{*})\\ & = & \!\text{tr}(I_{p}-{\Omega}^{*-1/2}{\Omega}{\Omega}^{*-1/2})^{2}+(\mu-\mu^{*})^{T}{\Omega}{\Omega}^{*-1}{\Omega}(\mu-\mu^{*})\\ &\!\lesssim\! & \!{\sum}_{i = 1}^{p}(1-\lambda_{i})^{2}+\|\mu^{*}-\mu\|^{2}{\|{\Omega}\|_{S}^{2}}\|{\Omega}^{*-1}\|_{S}. \end{array} $$

By the assumption that Ω∗− 1 has bounded spectral norm, we have

$$\sum\limits_{i = 1}^{p}(1-\lambda_{i})^{2} = \| I_{p}-{\Omega}^{*-1/2}{\Omega}{\Omega}^{*-1/2}{\|_{F}^{2}} \leq \|{\Omega}^{*-1}{\|_{S}^{2}}\|{\Omega}-{\Omega}^{*}{\|_{F}^{2}}, $$

implying that for some sufficiently small constant c > 0,

$${\Pi}\{p:K(p^{*},p){\leq\epsilon_{n}^{2}}, V(p^{*},p){\leq\epsilon_{n}^{2}}\} \geq {\Pi}\{\|{\Omega}-{\Omega}^{*}{\|_{F}^{2}}\leq c{\epsilon_{n}^{2}}, \|\mu-\mu^{*}\|^{2}\leq c{\epsilon_{n}^{2}}\}. $$

Furthermore, we have

$$\begin{array}{@{}rcl@{}} && \|{\Omega}-{\Omega}^{*}{\|_{F}^{2}}\\ &= & \| LDL^{T}-L^{*}D^{*}L^{*T}{\|_{F}^{2}}\\ &\leq & 3\| LDL^{T}-LDL^{*T}{\|_{F}^{2}}+ 3\| LDL^{*T}-LD^{*}L^{*T}{\|_{F}^{2}}+ 3\| LD^{*}L^{*T}-L^{*}D^{*}L^{*T}{\|_{F}^{2}}\\ &\leq & 3\| L{\|_{S}^{2}}\| D{\|_{S}^{2}}\| L-L^{*}{\|_{F}^{2}}+ 3\| L{\|_{S}^{2}}\| L^{*}{\|_{S}^{2}}\| D-D^{*}{\|_{F}^{2}}+ 3\| L^{*}{\|_{S}^{2}}\| D^{*}{\|_{S}^{2}}\| L-L^{*}{\|_{F}^{2}}. \end{array} $$

Since the Frobenius norm dominates the spectral norm and ∥DS and ∥LS are bounded by some constant, so are ∥DS and ∥LS if ∥DDF and ∥LLF are small. Thus for some sufficiently small constant c > 0,

$$\begin{array}{@{}rcl@{}} && {\Pi}\{p:K(p^{*},p){\leq\epsilon_{n}^{2}},V(p^{*},p){\leq\epsilon_{n}^{2}}\}\\ &\geq & {\Pi}\{\| L-L^{*}{\|_{F}^{2}}\leq c^{\prime}{\epsilon_{n}^{2}}, \| D-D^{*}{\|_{F}^{2}}\leq c^{\prime}{\epsilon_{n}^{2}}, \|\mu-\mu^{*}\|^{2}\leq c^{\prime}{\epsilon_{n}^{2}}\}\\ &\geq & {\Pi}\{\| L - L^{*}\|_{\infty}\!\leq\! c^{\prime}\epsilon_{n}/p, \| D - D^{*}\|_{\infty}\leq c^{\prime}\epsilon_{n}/\sqrt{p}, \|\mu-\mu^{*}\|_{\infty}\leq c^{\prime}\epsilon_{n}/\sqrt{p}\}. \end{array} $$

In the actual prior, we constraint L and D such that Ω− 1 has spectral norm bounded by B, but such a constraint can only increase the probability of the Kullback-Leibler neighborhood of the true density since Ω satisfies the required constraints. Therefore, we may pretend that the components of L and D are independently distributed. Then the above expression simplifies in terms of products of marginal probabilities. Let ηi = π(|lij| > 𝜖n/p) be the probability that an element in the i th row of L is non-zero, and \(\zeta _{i}={\Pi }(| l_{ij}-l_{ij}^{*}| <c_{2}\epsilon _{n}/p)\) be the probability that this element is in the neighborhood of its true value when \(l_{ij}^{*}\ne 0\). Then

$${\Pi}\{\| L-L^{*}\|_{\infty}\leq c^{\prime}\epsilon_{n}/p\}=\prod\limits_{i = 1}^{p}\zeta_{i}^{s_{i}}(1-\eta_{i})^{i-1-s_{i}},$$

where si is the number of non-zero elements in the i th row of L. Note that s1 + ⋯ + sp = s. From the assumptions, \(\eta _{i}\le {p^{-b^{\prime }}}/\sqrt {i}\le p^{-b^{\prime }-1/2}\) for some b > 2 and \(\zeta _{i} ={\Pi } (| l_{ij}-l_{ij}^{*}| <c_{2}\epsilon _{n}/p ||l_{ij}|>\epsilon _{n}/p ){\Pi } (|l_{ij}|>\epsilon _{n}/p ) \ge \epsilon _{n} p^{-c^{\prime }}\) for some c > 0. Therefore, the lower bound for the above probability is given by

$$\begin{array}{@{}rcl@{}} \prod\limits_{i = 1}^{p}\zeta_{i}^{s_{i}}(1-\eta_{i})^{i-1-s_{i}}\geq\left( {c^{\prime}\epsilon_{n}}{p}^{-c^{\prime}}\right)^{s}\left( 1-{p^{-b^{\prime}-1/2}}\right)^{p(p-1)/2-s}\gtrsim e^{-c's\log (p/\epsilon_{n})}. \end{array} $$

Thus we have \({\Pi }\{\| L-L^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c^{\prime }\epsilon _{n}/p)^{s}.\) Similarly, we have \({\Pi }\{\| D-D^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c^{\prime }\epsilon _{n}/p)^{p}\) and \({\Pi }\{\| \mu -\mu ^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c\epsilon _{n}/p)^{p}\). Hence, the prior concentration rate condition holds as

$$(p+s)(\log p+\log({1}/{\epsilon_{n}}))\asymp n{\epsilon_{n}^{2}}$$

for the choice 𝜖n = n− 1/2(p + s)1/2(log n)1/2.

For LDA, the proof uses a similar idea but is notationally slightly more complicated because the same parameter Ω is shared by two groups and the two group of observations need to be considered together. In this case, the X-observations are not i.i.d., so we work with the average squared Hellinger distance (7.4) and dominate that by distances on μ0,μ1,Ω. The Kullback-Leibler divergences can also be bounded similarly. Then entropy and prior probability estimates of the same nature are established analogously. □

Proof of Theorem 3.

We first consider the case of LDA. The misclassification rate could be written as

$$\begin{array}{@{}rcl@{}} r&=&\pi^{*}{\Phi}\left( \frac{\frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-\frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-(\mu_{1}-\mu_{0})^{T}{\Omega}\mu_{1}^{*}+\log\frac{1-\pi}{\pi}}{\sqrt{(\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})}}\right)\\ & &+(1-\pi^{*}){\Phi}\left( \frac{\frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-\frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-(\mu_{0}-\mu_{1})^{T}{\Omega}\mu_{0}^{*}+\log\frac{\pi}{1-\pi}}{\sqrt{(\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})}}\right). \end{array} $$

Let

$$\begin{array}{@{}rcl@{}} C_{1} &=& \frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-\frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-(\mu_{1}-\mu_{0})^{T}{\Omega}\mu_{1}^{*},\\ C_{0} &=& \frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-\frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-(\mu_{0}-\mu_{1})^{T}{\Omega}\mu_{0}^{*},\\ C_{2} &=& \sqrt{(\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})}. \end{array} $$

Then

$$\begin{array}{@{}rcl@{}} && | r(\pi^{*},\mu_{1}^{*},\mu_{0}^{*},{\Omega}^{*})-r(\pi,\mu_{1},\mu_{0},{\Omega})| \\ &= &\Big|\pi^{*}{\Phi}\left( \frac{C_{1}+\log\frac{1-\pi}{\pi}}{C_{2}}\right)+(1-\pi^{*}){\Phi}\left( \frac{C_{0}+\log\frac{\pi}{1-\pi}}{C_{2}}\right)\\ && -\pi^{*}{\Phi}\left( \frac{C_{1}^{*}+\log\frac{1-\pi^{*}}{\pi^{*}}}{C_{2}^{*}}\right)-(1-\pi^{*}){\Phi}\left( \frac{C_{0}^{*}+\log\frac{\pi^{*}}{1-\pi^{*}}}{C_{2}^{*}}\right)\Big|\\ &\leq & \Big|{\Phi}\left( \frac{C_{1}+\log\frac{1-\pi}{\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{1}^{*}+\log\frac{1-\pi^{*}}{\pi^{*}}}{C_{2}^{*}}\right)\Big|\\ && +\Big|{\Phi}\left( \frac{C_{0}+\log\frac{\pi}{1-\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{0}^{*}+\log\frac{\pi^{*}}{1-\pi^{*}}}{C_{2}^{*}}\right)\Big|. \end{array} $$

We have

$$\begin{array}{@{}rcl@{}} | {C_{2}^{2}}-C_{2}^{*2}| &= &| (\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}^{*}{\Sigma}^{*}{\Omega}^{*}(\mu_{1}^{*}-\mu_{0}^{*})|\\ &\leq &| (\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})|\\ && +| (\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}^{*}-\mu_{0}^{*})|\\ && +| (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}-{\Omega}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|\\ && +| (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}^{*}-{\Omega}^{*}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|. \end{array} $$

We obtain that

$$\begin{array}{@{}rcl@{}} \lefteqn{| (\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})|}\\ &&\leq \|(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*}){\|\|{\Omega}\|_{S}^{2}}\|{\Sigma}^{*}\|_{S}\|\mu_{1}-\mu_{0}\|_{2}\\ &&= O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right), \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \lefteqn{ (\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}^{*}-\mu_{0}^{*})|}\\ &&\leq \|(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})\|_{2}{\|{\Omega}\|_{S}^{2}}\|{\Sigma}^{*}\|_{S}\|\mu_{1}^{*}-\mu_{0}^{*}\|_{2}\\ && = O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right). \end{array} $$

Let \(\bar {R}\) be the number of effective edges in L from \(\mathcal {P}_{n}\). From the proof of Theorem 2 it follows that the number of effective edges in Ω is bounded by \(\bar {R}\) and by the choice of \(\mathcal {P}_{n}\), with posterior probability tending to one in probability, \(\bar {R} \lesssim n\epsilon ^{2}/\log n=O(p+s)\). Let A = Ω −Ω, then the number of non-zero elements in A is also O(p + s) with posterior probability tending to one in probability. Thus,

$$\begin{array}{@{}rcl@{}} \lefteqn{| (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}-{\Omega}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|}\\ &&\leq \|{\Omega}\|_{S}\|{\Sigma}^{*}\|_{S}\sum\limits_{i,j = 1}^{p}| a_{ij}||\mu_{1,i}^{*}-\mu_{0,i}^{*}||\mu_{1,j}^{*}-\mu_{0,j}^{*}| \end{array} $$

which is \(O(n^{-1/2}(p+s)\sqrt {\log p})\) because

$$\sum\limits_{i,j = 1}^{p}| a_{ij}|\leq \sqrt{p+s}\| {\Omega}-{\Omega}^{*}\|_{F}=O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right).$$

Similarly, it follows that

$$\begin{array}{@{}rcl@{}} | (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}^{*}-{\Omega}^{*}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|=O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right). \end{array} $$

Thus, \(| {C_{2}^{2}}-C_{2}^{*2}|=O\left (n^{-1/2}(p+s)\sqrt {\log p}\right )=o(1)\). Hence by Assumption (A4), \({C_{2}^{2}}\geq M_{0}>0\). Therefore,

$$\begin{array}{@{}rcl@{}} && | r(\pi^{*},\mu_{1}^{*},\mu_{0}^{*},{\Omega}^{*})-r(\pi,\mu_{1},\mu_{0},{\Omega})|\\ &\leq & \left|{\Phi}\left( \frac{C_{1}+\log\frac{1-\pi}{\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{1}^{*}+\log\frac{1-\pi^{*}}{\pi^{*}}}{C_{2}^{*}}\right)\right|\\ && +\left|{\Phi}\left( \frac{C_{0}+\log\frac{\pi}{1-\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{0}^{*}+\log\frac{\pi^{*}}{1-\pi^{*}}}{C_{2}^{*}}\right)\right|\\ &\leq & \frac{1}{\sqrt{M}}\left( | C_{1}-C_{1}^{*}|+| C_{0}-C_{0}^{*}|\right.\\ && \left.+|\log\frac{1-\pi}{\pi}-\log\frac{1-\pi^{*}}{\pi^{*}}|+|\log\frac{\pi}{1-\pi}-\log\frac{\pi^{*}}{1-\pi^{*}}|\right). \end{array} $$

We know that

$$\begin{array}{@{}rcl@{}} {|\mu_{1}^{T}}{\Omega}\mu_{1}-\mu_{1}^{*T}{\Omega}^{*}\mu_{1}^{*}|\leq& {|\mu_{1}^{T}}{\Omega}\mu_{1}-{\mu_{1}^{T}}{\Omega}\mu_{1}^{*}|+{|\mu_{1}^{T}}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}\mu_{1}^{*}|\\ & +|\mu_{1}^{*T}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}^{*}\mu_{1}^{*}|. \end{array} $$

Similar to the proof above,

$$|\mu_{1}^{*T}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}^{*}\mu_{1}^{*}|\leq \sum\limits_{i,j = 1}^{p}| a_{ij}||\mu_{1,i}^{*}||\mu_{1,j}^{*}| \asymp{\sum}_{i,j = 1}^{p}| a_{ij}|\leq \sqrt{p+s}\| A\|_{F},$$

which is \( O(n^{-1/2} (p+s)\sqrt {\log p})\). Also,

$${|\mu_{1}^{T}}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}\mu_{1}^{*}|\leq\| \mu_{1}-\mu_{1}^{*}\|\|{\Omega}\|_{S}\|\mu_{1}^{*}\|=O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right), $$
$${|\mu_{1}^{T}}{\Omega}\mu_{1}-{\mu_{1}^{T}}{\Omega}_{S}\mu_{1}^{*}|\leq\| \mu_{1}-\mu_{1}^{*}\|\|{\Omega}\|_{S}\|\mu_{1}\|=O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right). $$

Therefore, we have that

$${|\mu_{k}^{T}}{\Omega}\mu_{k}-\mu_{k}^{*T}{\Omega}^{*}\mu_{k}^{*}|=O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right), k = 0,1.$$

In addition,

$$\begin{array}{@{}rcl@{}} && |(\mu_{1}-\mu_{0})^{T}{\Omega}\mu_{1}^{*}-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}^{*}\mu_{1}^{*}|\\ &\leq&|(\mu_{1} - \mu_{0})^{T}{\Omega}\mu_{1}^{*} - (\mu_{1}^{*} - \mu_{0}^{*})^{T}{\Omega}\mu_{1}^{*}|+|(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}\mu_{1}^{*}-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}^{*}\mu_{1}^{*}|\\ &\leq & \|(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})\|\|{\Omega}\|_{S}\|\mu_{1}^{*}\|+\|\mu_{1}^{*}-\mu_{0}^{*}\|\|\mu_{1}^{*}\|\|{\Omega}-{\Omega}^{*}\|_{F}\\ &= & O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right)+O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right)\\ &= & O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right). \end{array} $$

Therefore, \(\max \{| C_{1}-C_{1}^{*}|,| C_{0}-C_{0}^{*}|\} =O(n^{-1/2}(p+s)\sqrt {\log p})\to 0\) under the condition that (p + s)2 log p = o(n).

For QDA, the misclassification rate does not have an explicit expression but (2.5) leads to upper bounds of similar nature. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, X., Ghosal, S. Bayesian Discriminant Analysis Using a High Dimensional Predictor. Sankhya A 80 (Suppl 1), 112–145 (2018). https://doi.org/10.1007/s13171-018-0140-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-018-0140-z

Keywords and phrases

AMS (2000) subject classification

Navigation