Bayesian Discriminant Analysis Using a High Dimensional Predictor

Du, Xingqi; Ghosal, Subhashis

doi:10.1007/s13171-018-0140-z

Bayesian Discriminant Analysis Using a High Dimensional Predictor

Published: 15 August 2018

Volume 80, pages 112–145, (2018)
Cite this article

Sankhya A Aims and scope Submit manuscript

87 Accesses
2 Citations
Explore all metrics

Abstract

We consider the problem of Bayesian discriminant analysis using a high dimensional predictor. In this setting, the underlying precision matrices can be estimated with reasonable accuracy only if some appropriate additional structure like sparsity is assumed. We induce a prior on the precision matrix through a sparse prior on its Cholesky decomposition. For computational ease, we use shrinkage priors to induce sparsity on the off-diagonal entries of the Cholesky decomposition matrix and exploit certain conditional conjugacy structure. We obtain the contraction rate of the posterior distribution for the mean and the precision matrix respectively using the Euclidean and the Frobenius distance, and show that under some milder restriction on the growth of the dimension, the misclassification probability of the Bayesian classification procedure converges to that of the oracle classifier for both linear and quadratic discriminant analysis. Extensive simulations show that the proposed Bayesian methods perform very well. An application to identify cancerous breast tumorbased on image data obtained using find needle aspirate is considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Proximal methods for sparse optimal scoring and discriminant analysis

Article 21 December 2022

The effect of intrinsic dimension on the Bayes-error of projected quadratic discriminant classification

Article Open access 02 June 2023

Unbiased predictive risk estimation of the Tikhonov regularization parameter: convergence with increasing rank approximations of the singular value decomposition

Article 14 June 2019

References

Banerjee, S. and Ghosal, S. (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics 8, 2, 2111–2137.
Article MathSciNet MATH Google Scholar
Banerjee, S. and Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis 136, 147–162.
Article MathSciNet MATH Google Scholar
Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). Dirichlet–Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 512, 1479–1490.
Article MathSciNet MATH Google Scholar
Bhadra, A., Datta, J., Polson, N. G. and Willard, B. (2017). The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis 12, 4, 1105–1131.
Article MathSciNet MATH Google Scholar
Bickel, P. J. and Levina, E. (2008a). Covariance regularization by thresholding. The Annals of Statistics 36, 6, 2577–2604.
Article MathSciNet MATH Google Scholar
Bickel, P. J. and Levina, E. (2008b). Regularized estimation of large covariance matrices. Annals of Statistics, 199–227.
Cai, T. T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics 38, 4, 2118–2144.
Article MathSciNet MATH Google Scholar
Cai, T., Liu, W. and Luo, X. (2011). A constrained ℓ ₁ minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 494, 594–607.
Article MathSciNet MATH Google Scholar
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97, 2, 465–480.
Article MathSciNet MATH Google Scholar
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. Artificial Intelligence and Statistics, p. 73–80.
Du, X. and Ghosal, S. (2017). Multivariate Gaussian network structure learning. Journal of Statistical Planning and Inference (to appear).
Fan, J. and Fan, X. (2008). High dimensional classification using features annealed independence rules. The Annals of Statistics 36, 6, 2605–2637.
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 3, 432–441.
Article MATH Google Scholar
George, E. I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 423, 881–889.
Article Google Scholar
Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference, 44. Cambridge University Press, Cambridge.
Book MATH Google Scholar
Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis 5, 1, 171–188.
Article MathSciNet MATH Google Scholar
Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, 1, 85–98.
Article MathSciNet MATH Google Scholar
Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 33, 2, 730–773.
Article MathSciNet MATH Google Scholar
Izenman, A. J. (2008). Modern Multivariate Statistical Techniques. Regression, Classification and Manifold Learning. Springer texts in statistics, Springer-Verlag, New York.
MATH Google Scholar
Khare, K., Oh, S. -Y. and Rajaratnam, B. (2015). A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77, 4, 803–825.
Article MathSciNet MATH Google Scholar
Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis 88, 2, 365–411.
Article MathSciNet MATH Google Scholar
Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10, Oct, 2295–2328.
MathSciNet MATH Google Scholar
Mahalanobis, P. C. (1925). Analysis of race-mixture in Bengal. Proceedings of the Indian Science Congress.
Mahalanobis, P. C. (1928). Statistical study of the Chinese head. Man in India 8, 107–122.
Google Scholar
Mahalanobis, P. C. (1930). A statistical study of certain anthropometric measurements from Sweden. Biometrika 22, 94–108.
Article MATH Google Scholar
Mahalanobis, P. (1930). On test and measures of group divergence. Journal of Asiatic Society Bengal 26, 541–588.
Google Scholar
Mahalanobis, P. C. (1931). Anthropological observations on Anglo-Indians of Calcutta, Part II: Analysis of Anglo-India head length. Rec. Indian Museum, 23.
Mahalanobis, P. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Science, India 2, 49–55.
MATH Google Scholar
Mahalanobis, P. C. (1949). Historical note on the d ²-statistics. Appendix I: Anthropometric survey of the United Provinces, 1941: a statistical study. Sankhyā: The Indian Journal of Statistics 9, 237–239.
Google Scholar
Mahalanobis, P. C., Majumdar, D. N., Yeatts, M. W. M. and Rao, C. R. (1949). Anthropometric survey of the United Provinces, 1941: a statistical study. Sankhyā: The Indian Journal of Statistics 3, 1, 89–324.
MATH Google Scholar
Majumdar, D. N., Rao, C. R. and Mahalanobis, P. C. (1958). Bengal anthropometric survey, 1945: A statistical study. Sankhyā: The Indian Journal of Statistics 19, 201–408.
MATH Google Scholar
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 1436–1462.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 404, 1023–1032.
Article MathSciNet MATH Google Scholar
Mulgrave, J. J. and Ghosal, S. (2018). Bayesian inference in nonparanormal graphical models. arXiv:1806.04334.
Peng, J., Wang, P., Zhou, N. and Zhu, J. (2012). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 486, 735–746.
Article MathSciNet MATH Google Scholar
Roc̆ková, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable selection. Journal of the American Statistical Association 109, 506, 828–846.
Article MathSciNet MATH Google Scholar
Wang, H. (2010). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis 7, 4, 867–886.
Article MathSciNet MATH Google Scholar
Wei, R. and Ghosal, S. (2017). Contraction properties of shrinkage priors in logistic regression Preprint at http://www4.stat.ncsu.edu/ghoshal/papers.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 1, 19–35.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

North Carolina State University, Raleigh, USA
Xingqi Du & Subhashis Ghosal

Authors

Xingqi Du
View author publications
You can also search for this author in PubMed Google Scholar
Subhashis Ghosal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingqi Du.

Additional information

Research of the second author is partially supported by NSF grant DMS-1510238.

Appendix

Proof of Proposition 1.

The (i,j)th element of a precision matrix Ω = LDL^T is zero if and only if ${\sum }_{k = 1}^{\min (i,j)}l_{ik}l_{jk}\neq 0$. Now

$$\begin{array}{@{}rcl@{}} {\Pi}\left( \sum\limits_{k = 1}^{\min(i,j)}l_{ik}l_{jk}\neq0\right)&= & {\Pi}(l_{ik}l_{jk}\neq0 \text{ for some} k)\\ &= & 1 - {\Pi}(l_{ik}l_{jk}= 0\text{ for all} k)\\ &= & 1 - \prod\limits_{k = 1}^{\min(i,j)}{\Pi}(l_{ik}l_{jk}= 0)\\ &= & 1 - \prod\limits_{k = 1}^{\min(i,j)}\left\{1-{\Pi}(l_{ik}l_{jk}\neq0)\right\}\\ &= & 1 - \prod\limits_{k = 1}^{\min(i,j)}\left\{1-{\Pi}(l_{ik}\neq0){\Pi}(l_{jk}\neq0)\right\}\\ &= & 1 - (1-\pi_{i}\pi_{j})^{\min(i,j)}. \end{array} $$

Suppose that i and j are of the same order. Then the above equation can be written as

$$(1-\pi_{i}\pi_{j})^{\min(i,j)}\approx(1-{\pi_{i}^{2}})^{i}\approx1-i{\pi_{i}^{2}}. $$

In order to assume the same sparsity for each i, we need to $i{\pi _{i}^{2}}$ to remain constant in i, that is, $\pi _{i}\sim {C_{p}}/{\sqrt {i}}$. □

Proof of Theorem 1.

The proof of the theorem uses an argument similar to that used in the proof of the Neyman-Pearson lemma. Observe that

$$\begin{array}{@{}rcl@{}} &&r(\phi^{*})-r(\phi)\\ &&= \text{E} [Y(1-\phi^{*}(X))+(1-Y)\phi^{*}(X) ]-\text{E} [Y(1-\phi(X))+(1-Y)\phi(X) ]\\ &&= \text{E} [Y (\phi(X)-\phi^{*}(X) )+(1-Y) (\phi^{*}(X)-\phi(X) ) ]\\ &&= \text{E} [ (\phi(X)-\phi^{*}(X) )(2Y-1) ]\\ &&= \text{E} [\text{E} ((\phi(X)-\phi^{*}(X) )(2Y-1) )|X )]\\ &&= \text{E} [ (\phi(X)-\phi^{*}(X) ) (2\text{E}(Y|X)-1 ) ]. \end{array} $$

Clearly

$$ \text{E}(Y|X)=\frac{\pi p_{1}(X)}{\pi p_{1}(X)+(1-\pi)p_{0}(X)}>\frac{1}{2} $$

(7.2)

if and only if πp₁(X) > (1 − π)p₀(X). Since in this case ϕ^∗(X) = 1 ≥ ϕ(X), we have

$$ (\phi(X)-\phi^{*}(X))(2\text{E}(Y|X)-1)\le 0. $$

(7.3)

On the other hand, when the expression in (7.2) is less than or equal to 1/2, ϕ^∗(X) = 0 ≤ ϕ(X), again leading to (7.3). Therefore, r(ϕ^∗) ≤ r(ϕ).

Alternatively, we can view ϕ^∗ as the Bayes decision rule for parameter space (p₁,p₀), action space (1,0) and prior distribution (π,1 − π) based on observation X and r(⋅) as the Bayes risk, which is minimized by the Bayes rule ϕ^∗. □

To prove Theorem 2, we need the following lemma, which is an extension of Lemma A.1 of Banerjee and Ghosal (2015) to accommodate non-zero means.

Lemma 1.

Let q andq^′respectively denote the densities of N_m(μ,Ω^− 1) and N_m(μ^′,Ω^′− 1) and$h(q,q^{\prime })=\|\sqrt {q}-\sqrt {q^{\prime }}\|_{2}$standfor their Hellinger distance, where the dimension m is potentially growing but theeigenvalues of Ω remains bounded between two fixed positive numbers. Then there exist positiveconstantsc₀(depending on Ω)andδ₀such that$ h^{2}(q,q^{\prime })\leq c_{0}\left (\|\mu -\mu ^{\prime }\|^{2}+\|{\Omega }-{\Omega }^{\prime }{\|_{F}^{2}}\right )$andifh(q,q^′) < δ₀,then$ \left (\|\mu -\mu ^{\prime }\|^{2}+\|{\Omega }-{\Omega }^{\prime }{\|_{F}^{2}}\right )\leq c_{0} h^{2}(q,q^{\prime })$.Moreoverc₀can be taken to be a constant multiple of max{∥Ω∥_S,∥Ω^− 1∥_S}.

Proof.

Let λ_i, i = 1,⋯ ,m, be the eigenvalues of the matrix A = Ω^− 1/2Ω^′Ω^− 1/2. Then the squared Hellinger distance h²(q,q^′) between q and q^′ is given by

$$\begin{array}{@{}rcl@{}} && 1-\frac{\det({\Omega})^{-1/4}\det({\Omega}^{\prime})^{-1/4}}{\det(\frac{{\Omega}^{-1}+{\Omega}^{\prime-1}}{2})^{1/2}}\exp\left\{ - \frac18(\mu - \mu^{\prime})^{T}\left( \frac{{\Omega}^{-1} + {\Omega}^{\prime-1}}{2}\right)^{-1}(\mu - \mu^{\prime})\right\}\\ & = & 1 - 2^{m} \prod\limits_{i = 1}^{m} (\lambda_{i}^{1/2} + \lambda_{i}^{-1/2})^{-1/2} \exp\left\{ - \frac14(\mu - \mu^{\prime})^{T}\left( {{\Omega}^{-1} + {\Omega}^{\prime-1}}\right)^{-1}(\mu - \mu^{\prime})\right\}\\ &\le & \left[1-2^{m} \prod\limits_{i = 1}^{m} (\lambda_{i}^{1/2}+\lambda_{i}^{-1/2})^{-1/2}\right] +\frac14(\mu-\mu^{\prime})^{T}{\Omega} (\mu-\mu^{\prime}). \end{array} $$

Clearly the second term is bounded by a multiple of ∥μ − μ^′∥². The first term has been bounded by a multiple of $\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}$ in Lemma A.1 (ii) of Banerjee and Ghosal (2015).

For the converse, observe that $1-2^{m} {\prod }_{i = 1}^{m} (\lambda _{i}^{1/2}+\lambda _{i}^{-1/2})^{-1/2}$ is bounded by h²(q,q^′). Let δ = h(q,q^′) < δ₀ for a sufficiently small δ₀, so that by arguments used in the proof of Part (ii) of Lemma A.1 of Banerjee and Ghosal (2015), it follows that $\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\le \frac {1}{2} c_{0} [1-2^{m} {\prod }_{i = 1}^{m} (\lambda _{i}^{1/2}+\lambda _{i}^{-1/2})^{-1/2}]\le c_{0} h^{2}(q,q^{\prime })$ for some constant c₀ > 0. As the Frobenius norm dominates the spectral norm, this in particular implies that ∥Ω−Ω^′∥_S is also small, and hence the eigenvalues of Ω^′ also lie between two fixed positive numbers. Now h(q,q^′) < δ₀ also implies that

$$1-\exp\left\{-\frac18(\mu-\mu^{\prime})^{T}\left( \frac{{\Omega}^{-1}+{\Omega}^{\prime-1}}{2}\right)^{-1}(\mu-\mu^{\prime})\right\}<{\delta_{0}^{2}},$$

which implies that

$$(\mu-\mu^{\prime})^{T}\left( \frac{{\Omega}^{-1}+{\Omega}^{\prime-1}}{2}\right)^{-1}(\mu-\mu^{\prime})\lesssim h^{2}(q,q^{\prime}).$$

Hence $\|\mu -\mu ^{\prime }\|^{2}\le \frac {1}{2} c_{0} h^{2}(q,q^{\prime })$. This gives the desired bound. □

Proof of Theorem 2.

To obtain the posterior contraction rate 𝜖_n of (π,μ₀,μ₁,Ω₀,Ω₁) at $(\pi ^{*},\mu _{0}^{*},\mu _{1}^{*},{\Omega }_{0}^{*}, {\Omega }_{1}^{*})$, we apply the general theory of posterior contraction rate described in Ghosal and van der Vaart (2017), by executing the steps stated below.

(i)
Find a sieve $\mathcal {P}_{n}\subset \mathcal {P}$ such that ${\Pi }(\mathcal {P}_{n}^{c})\le e^{-M n{\epsilon _{n}^{2}}}$ for a sufficiently large constant M > 0 and the 𝜖_n-metric entropy of $\mathcal {P}_{n}$ is bounded by a constant multiple of $n{\epsilon _{n}^{2}}$.
(ii)
Show that the prior probability of an ${\epsilon _{n}^{2}}$-neighborhood of the true density in the Kullback-Leibler sense is at least $e^{-c n{\epsilon _{n}^{2}}}$ for some constant c > 0.

This gives the posterior contraction rate 𝜖_n in terms of the Hellinger distance on the densities, which can be converted to the Euclidean distance on the mean and Frobenius distance on the precision matrix in view of Lemma 1.

Let $q_{\mu _{0},{\Omega }_{0}}$ and $q_{\mu _{1},{\Omega }_{1}}$ denote the two possible densities of an observation conditional on the classification information. The learning of π is solely based on N, and the posterior for π is clearly concentrating at its true value π^∗ at rate $1/\sqrt {n}$. Thus, to simplify notations, in the remaining analysis, we may treat π as given to be π^∗ and establish posterior concentration for (μ₀,μ₁,Ω₀,Ω₁) based on N samples from Class 1 and n − N from Class 0. Then the average squared Hellinger distance is expressed as

$$ \frac{n-N}{n} h^{2}\left( p_{\mu_{0},{\Omega}_{0}},p_{\mu_{0}^{*},{\Omega}_{0}^{*}}\right)+\frac{N}{n}h^{2}\left( p_{\mu_{1},{\Omega}_{1}},p_{\mu_{1}^{*},{\Omega}_{1}^{*}}\right). $$

(7.4)

As N/n → π^∗ almost surely and 0 < π^∗ < 1, it suffices to establish posterior concentration for (μ₀,Ω₀) and (μ₁,Ω₁) separately. Therefore we use a generic notation (μ,Ω) to denote either, and we establish the rate of posterior contraction for μ and Ω here. The analysis of posterior concentration is similar to that of Theorem 3.1 of Banerjee and Ghosal (2015) with the difference being that there is an extra mean parameter. Moreover, unlike them we use a prior on the precision matrix through its Cholesky decomposition, and the sparsity of off-diagonal entries is only approximate since we do not use point-mass priors. However, posterior contraction for a point mass prior can be recovered from our results since it corresponds to the limiting case v₀ → 0 in the spike-and-slab prior.

Observe that for (μ,L,D),(μ^′,L^′,D^′) with all entries of μ,μ^′,L,L^′,D,D^′ bounded by B, for Ω = LDL^T, Ω^′ = L^′D^′L^′T, we can be write $\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}$ as

$$\begin{array}{@{}rcl@{}} &&\sum\limits_{i,j = 1}^{p} [\sum\limits_{k = 1}^{p} (d_{k} l_{ik}l_{jk}-d_{k}^{\prime} l^{\prime}_{ik}l^{\prime}_{jk})]^{2}\\ &&~= \sum\limits_{i,j = 1}^{p} [\sum\limits_{k = 1}^{p} \{(d_{k}-d_{k}^{\prime}) l_{ik} l_{jk}+d_{k}^{\prime} (l_{ik}-l^{\prime}_{ik})l_{jk}+d_{k}^{\prime} l_{ik}^{\prime}(l_{jk}-l^{\prime}_{jk})\}]^{2} \end{array} $$

which can be bounded as

$$\begin{array}{@{}rcl@{}} &&3p \sum\limits_{i,j = 1}^{p} \sum\limits_{k = 1}^{p} \{(d_{k}-d_{k}^{\prime})^{2} l_{ik}^{2} l_{jk}^{2}+d_{k}^{\prime 2} (l_{ik}-l^{\prime}_{ik})^{2} l_{jk}^{2} + d_{k}^{\prime 2} l_{ik}^{\prime 2} (l_{jk}-l^{\prime}_{jk})^{2}\}\\ && ~\le 3 B^{4} p^{2} \{p \sum\limits_{k = 1}^{p} (d_{k}-d_{k}^{\prime})^{2} +\sum\limits_{i = 1}^{p} \sum\limits_{k = 1}^{p} (l_{ik}-l^{\prime}_{ik})^{2} + \sum\limits_{j = 1}^{p} \sum\limits_{k = 1}^{p} (l_{jk}-l^{\prime}_{jk})^{2}\}\\ && ~= 3B^{4} p^{2} [ p\|D-D^{\prime}\|_{F}^{2}+ 2\|L-L^{\prime}\|_{F}^{2}]. \end{array} $$

Hence if ∥D − D^′∥_∞≤ 𝜖_n/(3B²p^{2 + ν}), ∥L − L^′∥_∞≤ 𝜖_n/(3B²p^{2 + ν}), where ∥⋅∥_∞ stands for the maximum norm for a vector or matrix, then

$$\|{\Omega}-{\Omega}^{\prime}\|_{F}^{2}\le 3B^{4} p^{2} [ p^{2}\|D-D^{\prime}\|_{\infty}^{2}+ 2p^{2}\|L-L^{\prime}\|_{\infty}^{2}] \le \frac{9B^{4} p^{4}{\epsilon_{n}^{2}}}{(3B^{2} p^{2+\nu})^{2}} = \frac{{\epsilon_{n}^{2}}}{p^{2\nu}}.$$

Further, ∥Ω∥_S ≤tr(Ω) ≤ p²B² ≤ p^ν without loss of generality, by increasing ν if necessary. Define the effective edges to be the set {(i,j) : |l_ij| > 𝜖_n/p^ν,i > j}. Consider the sieve $\mathcal {P}_{n}$ consisting of L with maximum number of effective edges at most $\bar {r}$ a sufficiently large multiple of $n{\epsilon _{n}^{2}}/\log n$ and each entry of μ, D and L is bounded by $B\in [b_{1}n{\epsilon _{n}^{2}},b_{1}n{\epsilon _{n}^{2}}+ 1]$ in absolute value. Then the 𝜖_n/p^ν-metric entropy of $\mathcal {P}_{n}$ with respect to the Euclidean norm for μ and Frobenius norm for D and L is given by

$$\log\left\{\left( \frac{B p^{1/2+\nu}}{\epsilon_{n}}\right)^{p}\sum\limits_{j = 1}^{\bar{r}} \binom{\binom{p}{2}}{j} \left( \frac{3B^{2} p^{2+\nu}}{\epsilon_{n}}\right)^{j} \left( \frac{3B^{2}{p}^{2+\nu}}{\epsilon_{n}}\right)^{p}\right\}, $$

where in the left part of the inequality, the first $({B\sqrt {p}}/{\epsilon _{n}})^{p}$ is from the mean parameter, the second is from the off-diagonal entries of L and the last is from the diagonal elements of D. Note that for a component with span at most 𝜖_n/p^ν, only one point is needed for a covering. In view of Lemma 1, the Hellinger distance between the corresponding densities is bounded by 𝜖_n since max{∥Ω∥_S,∥Ω^′∥_S,∥Ω^− 1∥_S,∥Ω^′− 1∥_S}≤ p^ν. Thus the entropy is bounded by a multiple of

$$\log\left\{\bar{r}\left( \frac{3B^{2}{p}^{2+\nu}}{\epsilon_{n}}\right)^{\bar{r}+ 2p} {p}^{2\bar{r}}\right\}\lesssim (\bar{r}+p)(\log p+\log B+\log (1/\epsilon_{n})).$$

For our choice of $\bar {r}$ and B, this shows that the metric entropy is bounded by a constant multiple of $n{\epsilon _{n}^{2}}$.

Now we bound ${\Pi }(\mathcal {P}_{n}^{c})$. Let R be the number of effective edges of L. Then $R\sim \text {Bin}(\binom {p-1}{2},\eta _{0})$, where $\eta _{0} ={\Pi } (|l|>\epsilon _{n}/p)\le p^{-b^{\prime }}$. Then from the tail estimate of the binomial distribution, ${\Pi }(\bar {R}>\bar {r})\le e^{-c \bar {r}\log \bar {r}}$ for some c > 0. Using the condition on the prior and the choice of B, we have

$$ {\Pi}(\mathcal{P}_{n}^{c})\leq {\Pi}(\bar{R}>\bar{r})+ 2 p^{2} \exp(-b_{1} n{\epsilon_{n}^{2}})\lesssim e^{-Cn{\epsilon_{n}^{2}}} , $$

(7.5)

where C can be chosen as large as we like, by simply choosing b₁ large enough. This verifies the required conditions on the sieve.

Finally to check the prior concentration rate,

$$ {\Pi}\{B(p^{*},\epsilon_{n})\}:={\Pi}\{p:K(p^{*},p){\leq\epsilon_{n}^{2}},V(p^{*},p){\leq\epsilon_{n}^{2}}\}\geq\exp(-n{\epsilon_{n}^{2}}), $$

(7.6)

where $K(p^{*},p)=\int p^{*}\log (p^{*}/p)$, $V(p^{*},p)=\int p^{*}\{\log (p^{*},p)\}^{2}$. Note that for X ∼N_p(μ,Σ) and a p × p symmetric matrix A, we have

$$\text{E}(X^{T}AX)=\text{tr}(A{\Sigma})+\mu^{T}A\mu,\quad \text{var}(X^{T}AX)= 2\text{tr}(A{\Sigma} A{\Sigma})+ 4\mu^{T}A{\Sigma} A\mu. $$

We use the above result to find the expressions for K(p^∗,p) and V (p^∗,p). Denoting the eigenvalues of the matrix Ω^∗− 1/2ΩΩ^∗− 1/2 by λ_i, i = 1,…,p, we have K(p₀,p) given by

$$\begin{array}{@{}rcl@{}} && \frac{1}{2}\log\det\frac{{\Omega}^{*}}{\Omega} -\frac{1}{2}\text{E}_{\mu^{*},{\Omega}^{*}}\left\{(X - \mu^{*})^{T}{\Omega}^{*}(X - \mu^{*})-(X-\mu)^{T}{\Omega}(X-\mu)\right\}\\ &=&\frac{1}{2}\log\det\frac{{\Omega}^{*}}{\Omega}-\frac{p}{2}+\frac{1}{2}\text{E}_{\mu^{*},{\Omega}^{*}}\left\{(X-\mu)^{T}{\Omega}(X-\mu)\right\}\\ &=& -\frac{1}{2}\sum\limits_{i = 1}^{p} (\log \lambda_{i}-1+\lambda_{i})+\frac{1}{2}(\mu^{*}-\mu)^{T}{\Omega}(\mu^{*}-\mu)\\ &\leq & -\frac{1}{2}\sum\limits_{i = 1}^{p}\log \lambda_{i}-\frac{1}{2}\sum\limits_{i = 1}^{p}(1-\lambda_{i})+\frac{1}{2}\|\mu^{*}-{\mu\|_{2}^{2}}\|{\Omega}\|_{S}. \end{array} $$

Now ${\sum }_{i = 1}^{p} (1-\lambda _{i})^{2} =\|I-{{\Omega }^{*}}^{-1/2}{\Omega }{\Omega }^{*-1/2}\|_{F}^{2}$, so if ∥I −Ω^∗^− 1/2ΩΩ^∗− 1/2∥_F is sufficiently small, then maxi|1 − λ_i| < 1, and hence ${\sum }_{i = 1}^{p} (1-\lambda _{i}-\log \lambda _{i})\lesssim {\sum }_{i = 1}(1-\lambda _{i})^{2}$, leading to the relation

$$ K(p^{*},p)\leq{\sum}_{i = 1}^{p}(1-\lambda_{i})^{2}+\|\mu^{*}-{\mu\|_{2}^{2}}\|{\Omega}\|_{S}. $$

(7.7)

Also, V (p^∗,p) is given by

$$\begin{array}{@{}rcl@{}} && \!\frac14\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{-(X-\mu^{*})^{T}{\Omega}^{*}(X-\mu^{*})+(X-\mu)^{T}{\Omega}(X-\mu)\right\}\\ & = & \!\frac14\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{(X-\mu^{*})^{T}({\Omega}-{\Omega}^{*})(X-\mu^{*})+ 2(\mu-\mu^{*})^{T}{\Omega}(X-\mu^{*})\right\}\\ &\!\leq\! & \!\frac{1}{2}\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{(X - \mu^{*})^{T}({\Omega} - {\Omega}^{*})(X - \mu^{*})\right\} + 2\text{Var}_{\mu^{*},{\Omega}^{*}}\left\{(\mu - \mu^{*})^{T}{\Omega}(X - \mu^{*})\right\}\\ & = & \!\text{tr}\left\{({\Omega}-{\Omega}^{*}){\Omega}^{*-1}({\Omega}-{\Omega}^{*}){\Omega}^{*-1}\right\}+(\mu-\mu^{*})^{T}{\Omega}{\Omega}^{*-1}{\Omega}(\mu-\mu^{*})\\ & = & \!\text{tr}(I_{p}-{\Omega}^{*-1/2}{\Omega}{\Omega}^{*-1/2})^{2}+(\mu-\mu^{*})^{T}{\Omega}{\Omega}^{*-1}{\Omega}(\mu-\mu^{*})\\ &\!\lesssim\! & \!{\sum}_{i = 1}^{p}(1-\lambda_{i})^{2}+\|\mu^{*}-\mu\|^{2}{\|{\Omega}\|_{S}^{2}}\|{\Omega}^{*-1}\|_{S}. \end{array} $$

By the assumption that Ω^∗− 1 has bounded spectral norm, we have

$$\sum\limits_{i = 1}^{p}(1-\lambda_{i})^{2} = \| I_{p}-{\Omega}^{*-1/2}{\Omega}{\Omega}^{*-1/2}{\|_{F}^{2}} \leq \|{\Omega}^{*-1}{\|_{S}^{2}}\|{\Omega}-{\Omega}^{*}{\|_{F}^{2}}, $$

implying that for some sufficiently small constant c > 0,

$${\Pi}\{p:K(p^{*},p){\leq\epsilon_{n}^{2}}, V(p^{*},p){\leq\epsilon_{n}^{2}}\} \geq {\Pi}\{\|{\Omega}-{\Omega}^{*}{\|_{F}^{2}}\leq c{\epsilon_{n}^{2}}, \|\mu-\mu^{*}\|^{2}\leq c{\epsilon_{n}^{2}}\}. $$

Furthermore, we have

$$\begin{array}{@{}rcl@{}} && \|{\Omega}-{\Omega}^{*}{\|_{F}^{2}}\\ &= & \| LDL^{T}-L^{*}D^{*}L^{*T}{\|_{F}^{2}}\\ &\leq & 3\| LDL^{T}-LDL^{*T}{\|_{F}^{2}}+ 3\| LDL^{*T}-LD^{*}L^{*T}{\|_{F}^{2}}+ 3\| LD^{*}L^{*T}-L^{*}D^{*}L^{*T}{\|_{F}^{2}}\\ &\leq & 3\| L{\|_{S}^{2}}\| D{\|_{S}^{2}}\| L-L^{*}{\|_{F}^{2}}+ 3\| L{\|_{S}^{2}}\| L^{*}{\|_{S}^{2}}\| D-D^{*}{\|_{F}^{2}}+ 3\| L^{*}{\|_{S}^{2}}\| D^{*}{\|_{S}^{2}}\| L-L^{*}{\|_{F}^{2}}. \end{array} $$

Since the Frobenius norm dominates the spectral norm and ∥D^∗∥_S and ∥L^∗∥_S are bounded by some constant, so are ∥D∥_S and ∥L∥_S if ∥D − D^∗∥_F and ∥L − L^∗∥_F are small. Thus for some sufficiently small constant c^′ > 0,

$$\begin{array}{@{}rcl@{}} && {\Pi}\{p:K(p^{*},p){\leq\epsilon_{n}^{2}},V(p^{*},p){\leq\epsilon_{n}^{2}}\}\\ &\geq & {\Pi}\{\| L-L^{*}{\|_{F}^{2}}\leq c^{\prime}{\epsilon_{n}^{2}}, \| D-D^{*}{\|_{F}^{2}}\leq c^{\prime}{\epsilon_{n}^{2}}, \|\mu-\mu^{*}\|^{2}\leq c^{\prime}{\epsilon_{n}^{2}}\}\\ &\geq & {\Pi}\{\| L - L^{*}\|_{\infty}\!\leq\! c^{\prime}\epsilon_{n}/p, \| D - D^{*}\|_{\infty}\leq c^{\prime}\epsilon_{n}/\sqrt{p}, \|\mu-\mu^{*}\|_{\infty}\leq c^{\prime}\epsilon_{n}/\sqrt{p}\}. \end{array} $$

In the actual prior, we constraint L and D such that Ω^− 1 has spectral norm bounded by B^′, but such a constraint can only increase the probability of the Kullback-Leibler neighborhood of the true density since Ω^∗ satisfies the required constraints. Therefore, we may pretend that the components of L and D are independently distributed. Then the above expression simplifies in terms of products of marginal probabilities. Let η_i = π(|l_ij| > 𝜖_n/p) be the probability that an element in the i th row of L is non-zero, and $\zeta _{i}={\Pi }(| l_{ij}-l_{ij}^{*}| <c_{2}\epsilon _{n}/p)$ be the probability that this element is in the neighborhood of its true value when $l_{ij}^{*}\ne 0$. Then

$${\Pi}\{\| L-L^{*}\|_{\infty}\leq c^{\prime}\epsilon_{n}/p\}=\prod\limits_{i = 1}^{p}\zeta_{i}^{s_{i}}(1-\eta_{i})^{i-1-s_{i}},$$

where s_i is the number of non-zero elements in the i th row of L^∗. Note that s₁ + ⋯ + s_p = s. From the assumptions, $\eta _{i}\le {p^{-b^{\prime }}}/\sqrt {i}\le p^{-b^{\prime }-1/2}$ for some b^′ > 2 and $\zeta _{i} ={\Pi } (| l_{ij}-l_{ij}^{*}| <c_{2}\epsilon _{n}/p ||l_{ij}|>\epsilon _{n}/p ){\Pi } (|l_{ij}|>\epsilon _{n}/p ) \ge \epsilon _{n} p^{-c^{\prime }}$ for some c^′ > 0. Therefore, the lower bound for the above probability is given by

$$\begin{array}{@{}rcl@{}} \prod\limits_{i = 1}^{p}\zeta_{i}^{s_{i}}(1-\eta_{i})^{i-1-s_{i}}\geq\left( {c^{\prime}\epsilon_{n}}{p}^{-c^{\prime}}\right)^{s}\left( 1-{p^{-b^{\prime}-1/2}}\right)^{p(p-1)/2-s}\gtrsim e^{-c's\log (p/\epsilon_{n})}. \end{array} $$

Thus we have ${\Pi }\{\| L-L^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c^{\prime }\epsilon _{n}/p)^{s}.$ Similarly, we have ${\Pi }\{\| D-D^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c^{\prime }\epsilon _{n}/p)^{p}$ and ${\Pi }\{\| \mu -\mu ^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c\epsilon _{n}/p)^{p}$. Hence, the prior concentration rate condition holds as

$$(p+s)(\log p+\log({1}/{\epsilon_{n}}))\asymp n{\epsilon_{n}^{2}}$$

for the choice 𝜖_n = n^− 1/2(p + s)^1/2(log n)^1/2.

For LDA, the proof uses a similar idea but is notationally slightly more complicated because the same parameter Ω is shared by two groups and the two group of observations need to be considered together. In this case, the X-observations are not i.i.d., so we work with the average squared Hellinger distance (7.4) and dominate that by distances on μ₀,μ₁,Ω. The Kullback-Leibler divergences can also be bounded similarly. Then entropy and prior probability estimates of the same nature are established analogously. □

Proof of Theorem 3.

We first consider the case of LDA. The misclassification rate could be written as

$$\begin{array}{@{}rcl@{}} r&=&\pi^{*}{\Phi}\left( \frac{\frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-\frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-(\mu_{1}-\mu_{0})^{T}{\Omega}\mu_{1}^{*}+\log\frac{1-\pi}{\pi}}{\sqrt{(\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})}}\right)\\ & &+(1-\pi^{*}){\Phi}\left( \frac{\frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-\frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-(\mu_{0}-\mu_{1})^{T}{\Omega}\mu_{0}^{*}+\log\frac{\pi}{1-\pi}}{\sqrt{(\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})}}\right). \end{array} $$

Let

$$\begin{array}{@{}rcl@{}} C_{1} &=& \frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-\frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-(\mu_{1}-\mu_{0})^{T}{\Omega}\mu_{1}^{*},\\ C_{0} &=& \frac{1}{2}{\mu_{0}^{T}}{\Omega}\mu_{0}-\frac{1}{2}{\mu_{1}^{T}}{\Omega}\mu_{1}-(\mu_{0}-\mu_{1})^{T}{\Omega}\mu_{0}^{*},\\ C_{2} &=& \sqrt{(\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})}. \end{array} $$

Then

$$\begin{array}{@{}rcl@{}} && | r(\pi^{*},\mu_{1}^{*},\mu_{0}^{*},{\Omega}^{*})-r(\pi,\mu_{1},\mu_{0},{\Omega})| \\ &= &\Big|\pi^{*}{\Phi}\left( \frac{C_{1}+\log\frac{1-\pi}{\pi}}{C_{2}}\right)+(1-\pi^{*}){\Phi}\left( \frac{C_{0}+\log\frac{\pi}{1-\pi}}{C_{2}}\right)\\ && -\pi^{*}{\Phi}\left( \frac{C_{1}^{*}+\log\frac{1-\pi^{*}}{\pi^{*}}}{C_{2}^{*}}\right)-(1-\pi^{*}){\Phi}\left( \frac{C_{0}^{*}+\log\frac{\pi^{*}}{1-\pi^{*}}}{C_{2}^{*}}\right)\Big|\\ &\leq & \Big|{\Phi}\left( \frac{C_{1}+\log\frac{1-\pi}{\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{1}^{*}+\log\frac{1-\pi^{*}}{\pi^{*}}}{C_{2}^{*}}\right)\Big|\\ && +\Big|{\Phi}\left( \frac{C_{0}+\log\frac{\pi}{1-\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{0}^{*}+\log\frac{\pi^{*}}{1-\pi^{*}}}{C_{2}^{*}}\right)\Big|. \end{array} $$

We have

$$\begin{array}{@{}rcl@{}} | {C_{2}^{2}}-C_{2}^{*2}| &= &| (\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}^{*}{\Sigma}^{*}{\Omega}^{*}(\mu_{1}^{*}-\mu_{0}^{*})|\\ &\leq &| (\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})|\\ && +| (\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}^{*}-\mu_{0}^{*})|\\ && +| (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}-{\Omega}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|\\ && +| (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}^{*}-{\Omega}^{*}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|. \end{array} $$

We obtain that

$$\begin{array}{@{}rcl@{}} \lefteqn{| (\mu_{1}-\mu_{0})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})|}\\ &&\leq \|(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*}){\|\|{\Omega}\|_{S}^{2}}\|{\Sigma}^{*}\|_{S}\|\mu_{1}-\mu_{0}\|_{2}\\ &&= O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right), \end{array} $$

and

$$\begin{array}{@{}rcl@{}} \lefteqn{ (\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}{\Sigma}^{*}{\Omega}(\mu_{1}^{*}-\mu_{0}^{*})|}\\ &&\leq \|(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})\|_{2}{\|{\Omega}\|_{S}^{2}}\|{\Sigma}^{*}\|_{S}\|\mu_{1}^{*}-\mu_{0}^{*}\|_{2}\\ && = O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right). \end{array} $$

Let $\bar {R}$ be the number of effective edges in L from $\mathcal {P}_{n}$. From the proof of Theorem 2 it follows that the number of effective edges in Ω is bounded by $\bar {R}$ and by the choice of $\mathcal {P}_{n}$, with posterior probability tending to one in probability, $\bar {R} \lesssim n\epsilon ^{2}/\log n=O(p+s)$. Let A = Ω −Ω^∗, then the number of non-zero elements in A is also O(p + s) with posterior probability tending to one in probability. Thus,

$$\begin{array}{@{}rcl@{}} \lefteqn{| (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}-{\Omega}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|}\\ &&\leq \|{\Omega}\|_{S}\|{\Sigma}^{*}\|_{S}\sum\limits_{i,j = 1}^{p}| a_{ij}||\mu_{1,i}^{*}-\mu_{0,i}^{*}||\mu_{1,j}^{*}-\mu_{0,j}^{*}| \end{array} $$

which is $O(n^{-1/2}(p+s)\sqrt {\log p})$ because

$$\sum\limits_{i,j = 1}^{p}| a_{ij}|\leq \sqrt{p+s}\| {\Omega}-{\Omega}^{*}\|_{F}=O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right).$$

Similarly, it follows that

$$\begin{array}{@{}rcl@{}} | (\mu_{1}^{*}-\mu_{0}^{*})^{T}({\Omega}{\Sigma}^{*}{\Omega}^{*}-{\Omega}^{*}{\Sigma}^{*}{\Omega}^{*})(\mu_{1}^{*}-\mu_{0}^{*})|=O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right). \end{array} $$

Thus, $| {C_{2}^{2}}-C_{2}^{*2}|=O\left (n^{-1/2}(p+s)\sqrt {\log p}\right )=o(1)$. Hence by Assumption (A4), ${C_{2}^{2}}\geq M_{0}>0$. Therefore,

$$\begin{array}{@{}rcl@{}} && | r(\pi^{*},\mu_{1}^{*},\mu_{0}^{*},{\Omega}^{*})-r(\pi,\mu_{1},\mu_{0},{\Omega})|\\ &\leq & \left|{\Phi}\left( \frac{C_{1}+\log\frac{1-\pi}{\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{1}^{*}+\log\frac{1-\pi^{*}}{\pi^{*}}}{C_{2}^{*}}\right)\right|\\ && +\left|{\Phi}\left( \frac{C_{0}+\log\frac{\pi}{1-\pi}}{C_{2}}\right)-{\Phi}\left( \frac{C_{0}^{*}+\log\frac{\pi^{*}}{1-\pi^{*}}}{C_{2}^{*}}\right)\right|\\ &\leq & \frac{1}{\sqrt{M}}\left( | C_{1}-C_{1}^{*}|+| C_{0}-C_{0}^{*}|\right.\\ && \left.+|\log\frac{1-\pi}{\pi}-\log\frac{1-\pi^{*}}{\pi^{*}}|+|\log\frac{\pi}{1-\pi}-\log\frac{\pi^{*}}{1-\pi^{*}}|\right). \end{array} $$

We know that

$$\begin{array}{@{}rcl@{}} {|\mu_{1}^{T}}{\Omega}\mu_{1}-\mu_{1}^{*T}{\Omega}^{*}\mu_{1}^{*}|\leq& {|\mu_{1}^{T}}{\Omega}\mu_{1}-{\mu_{1}^{T}}{\Omega}\mu_{1}^{*}|+{|\mu_{1}^{T}}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}\mu_{1}^{*}|\\ & +|\mu_{1}^{*T}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}^{*}\mu_{1}^{*}|. \end{array} $$

Similar to the proof above,

$$|\mu_{1}^{*T}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}^{*}\mu_{1}^{*}|\leq \sum\limits_{i,j = 1}^{p}| a_{ij}||\mu_{1,i}^{*}||\mu_{1,j}^{*}| \asymp{\sum}_{i,j = 1}^{p}| a_{ij}|\leq \sqrt{p+s}\| A\|_{F},$$

which is $ O(n^{-1/2} (p+s)\sqrt {\log p})$. Also,

$${|\mu_{1}^{T}}{\Omega}\mu_{1}^{*}-\mu_{1}^{*T}{\Omega}\mu_{1}^{*}|\leq\| \mu_{1}-\mu_{1}^{*}\|\|{\Omega}\|_{S}\|\mu_{1}^{*}\|=O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right), $$

$${|\mu_{1}^{T}}{\Omega}\mu_{1}-{\mu_{1}^{T}}{\Omega}_{S}\mu_{1}^{*}|\leq\| \mu_{1}-\mu_{1}^{*}\|\|{\Omega}\|_{S}\|\mu_{1}\|=O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right). $$

Therefore, we have that

$${|\mu_{k}^{T}}{\Omega}\mu_{k}-\mu_{k}^{*T}{\Omega}^{*}\mu_{k}^{*}|=O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right), k = 0,1.$$

In addition,

$$\begin{array}{@{}rcl@{}} && |(\mu_{1}-\mu_{0})^{T}{\Omega}\mu_{1}^{*}-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}^{*}\mu_{1}^{*}|\\ &\leq&|(\mu_{1} - \mu_{0})^{T}{\Omega}\mu_{1}^{*} - (\mu_{1}^{*} - \mu_{0}^{*})^{T}{\Omega}\mu_{1}^{*}|+|(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}\mu_{1}^{*}-(\mu_{1}^{*}-\mu_{0}^{*})^{T}{\Omega}^{*}\mu_{1}^{*}|\\ &\leq & \|(\mu_{1}-\mu_{0})-(\mu_{1}^{*}-\mu_{0}^{*})\|\|{\Omega}\|_{S}\|\mu_{1}^{*}\|+\|\mu_{1}^{*}-\mu_{0}^{*}\|\|\mu_{1}^{*}\|\|{\Omega}-{\Omega}^{*}\|_{F}\\ &= & O\left( \sqrt{\frac{p(p+s)\log p}{n}}\right)+O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right)\\ &= & O\left( \sqrt{\frac{(p+s)^{2}\log p}{n}}\right). \end{array} $$

Therefore, $\max \{| C_{1}-C_{1}^{*}|,| C_{0}-C_{0}^{*}|\} =O(n^{-1/2}(p+s)\sqrt {\log p})\to 0$ under the condition that (p + s)² log p = o(n).

For QDA, the misclassification rate does not have an explicit expression but (2.5) leads to upper bounds of similar nature. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, X., Ghosal, S. Bayesian Discriminant Analysis Using a High Dimensional Predictor. Sankhya A 80 (Suppl 1), 112–145 (2018). https://doi.org/10.1007/s13171-018-0140-z

Download citation

Received: 06 March 2018
Published: 15 August 2018
Issue Date: 30 December 2018
DOI: https://doi.org/10.1007/s13171-018-0140-z

Keywords and phrases

AMS (2000) subject classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Discriminant Analysis Using a High Dimensional Predictor

Abstract

Access this article

Similar content being viewed by others

Proximal methods for sparse optimal scoring and discriminant analysis

The effect of intrinsic dimension on the Bayes-error of projected quadratic discriminant classification

Unbiased predictive risk estimation of the Tikhonov regularization parameter: convergence with increasing rank approximations of the singular value decomposition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Proof of Proposition 1.

Proof of Theorem 1.

Lemma 1.

Proof.

Proof of Theorem 2.

Proof of Theorem 3.

Rights and permissions

About this article

Cite this article

Keywords and phrases

AMS (2000) subject classification

Navigation

Bayesian Discriminant Analysis Using a High Dimensional Predictor

Abstract

Access this article

Similar content being viewed by others

Proximal methods for sparse optimal scoring and discriminant analysis

The effect of intrinsic dimension on the Bayes-error of projected quadratic discriminant classification

Unbiased predictive risk estimation of the Tikhonov regularization parameter: convergence with increasing rank approximations of the singular value decomposition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Proof of Proposition 1.

Proof of Theorem 1.

Lemma 1.

Proof.

Proof of Theorem 2.

Proof of Theorem 3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords and phrases

AMS (2000) subject classification

Search

Navigation