Abstract
We consider the problem of Bayesian discriminant analysis using a high dimensional predictor. In this setting, the underlying precision matrices can be estimated with reasonable accuracy only if some appropriate additional structure like sparsity is assumed. We induce a prior on the precision matrix through a sparse prior on its Cholesky decomposition. For computational ease, we use shrinkage priors to induce sparsity on the off-diagonal entries of the Cholesky decomposition matrix and exploit certain conditional conjugacy structure. We obtain the contraction rate of the posterior distribution for the mean and the precision matrix respectively using the Euclidean and the Frobenius distance, and show that under some milder restriction on the growth of the dimension, the misclassification probability of the Bayesian classification procedure converges to that of the oracle classifier for both linear and quadratic discriminant analysis. Extensive simulations show that the proposed Bayesian methods perform very well. An application to identify cancerous breast tumorbased on image data obtained using find needle aspirate is considered.
Similar content being viewed by others
References
Banerjee, S. and Ghosal, S. (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics 8, 2, 2111–2137.
Banerjee, S. and Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis 136, 147–162.
Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). Dirichlet–Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 512, 1479–1490.
Bhadra, A., Datta, J., Polson, N. G. and Willard, B. (2017). The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis 12, 4, 1105–1131.
Bickel, P. J. and Levina, E. (2008a). Covariance regularization by thresholding. The Annals of Statistics 36, 6, 2577–2604.
Bickel, P. J. and Levina, E. (2008b). Regularized estimation of large covariance matrices. Annals of Statistics, 199–227.
Cai, T. T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics 38, 4, 2118–2144.
Cai, T., Liu, W. and Luo, X. (2011). A constrained ℓ 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106, 494, 594–607.
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97, 2, 465–480.
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. Artificial Intelligence and Statistics, p. 73–80.
Du, X. and Ghosal, S. (2017). Multivariate Gaussian network structure learning. Journal of Statistical Planning and Inference (to appear).
Fan, J. and Fan, X. (2008). High dimensional classification using features annealed independence rules. The Annals of Statistics 36, 6, 2605–2637.
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 3, 432–441.
George, E. I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88, 423, 881–889.
Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference, 44. Cambridge University Press, Cambridge.
Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis 5, 1, 171–188.
Huang, J. Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, 1, 85–98.
Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 33, 2, 730–773.
Izenman, A. J. (2008). Modern Multivariate Statistical Techniques. Regression, Classification and Manifold Learning. Springer texts in statistics, Springer-Verlag, New York.
Khare, K., Oh, S. -Y. and Rajaratnam, B. (2015). A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77, 4, 803–825.
Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis 88, 2, 365–411.
Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10, Oct, 2295–2328.
Mahalanobis, P. C. (1925). Analysis of race-mixture in Bengal. Proceedings of the Indian Science Congress.
Mahalanobis, P. C. (1928). Statistical study of the Chinese head. Man in India 8, 107–122.
Mahalanobis, P. C. (1930). A statistical study of certain anthropometric measurements from Sweden. Biometrika 22, 94–108.
Mahalanobis, P. (1930). On test and measures of group divergence. Journal of Asiatic Society Bengal 26, 541–588.
Mahalanobis, P. C. (1931). Anthropological observations on Anglo-Indians of Calcutta, Part II: Analysis of Anglo-India head length. Rec. Indian Museum, 23.
Mahalanobis, P. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Science, India 2, 49–55.
Mahalanobis, P. C. (1949). Historical note on the d 2-statistics. Appendix I: Anthropometric survey of the United Provinces, 1941: a statistical study. Sankhyā: The Indian Journal of Statistics 9, 237–239.
Mahalanobis, P. C., Majumdar, D. N., Yeatts, M. W. M. and Rao, C. R. (1949). Anthropometric survey of the United Provinces, 1941: a statistical study. Sankhyā: The Indian Journal of Statistics 3, 1, 89–324.
Majumdar, D. N., Rao, C. R. and Mahalanobis, P. C. (1958). Bengal anthropometric survey, 1945: A statistical study. Sankhyā: The Indian Journal of Statistics 19, 201–408.
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 1436–1462.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 404, 1023–1032.
Mulgrave, J. J. and Ghosal, S. (2018). Bayesian inference in nonparanormal graphical models. arXiv:1806.04334.
Peng, J., Wang, P., Zhou, N. and Zhu, J. (2012). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 486, 735–746.
Roc̆ková, V. and George, E. I. (2014). EMVS: The EM approach to Bayesian variable selection. Journal of the American Statistical Association 109, 506, 828–846.
Wang, H. (2010). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis 7, 4, 867–886.
Wei, R. and Ghosal, S. (2017). Contraction properties of shrinkage priors in logistic regression Preprint at http://www4.stat.ncsu.edu/ghoshal/papers.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 1, 19–35.
Author information
Authors and Affiliations
Corresponding author
Additional information
Research of the second author is partially supported by NSF grant DMS-1510238.
Appendix
Appendix
Proof of Proposition 1.
The (i,j)th element of a precision matrix Ω = LDLT is zero if and only if \({\sum }_{k = 1}^{\min (i,j)}l_{ik}l_{jk}\neq 0\). Now
Suppose that i and j are of the same order. Then the above equation can be written as
In order to assume the same sparsity for each i, we need to \(i{\pi _{i}^{2}}\) to remain constant in i, that is, \(\pi _{i}\sim {C_{p}}/{\sqrt {i}}\). □
Proof of Theorem 1.
The proof of the theorem uses an argument similar to that used in the proof of the Neyman-Pearson lemma. Observe that
Clearly
if and only if πp1(X) > (1 − π)p0(X). Since in this case ϕ∗(X) = 1 ≥ ϕ(X), we have
On the other hand, when the expression in (7.2) is less than or equal to 1/2, ϕ∗(X) = 0 ≤ ϕ(X), again leading to (7.3). Therefore, r(ϕ∗) ≤ r(ϕ).
Alternatively, we can view ϕ∗ as the Bayes decision rule for parameter space (p1,p0), action space (1,0) and prior distribution (π,1 − π) based on observation X and r(⋅) as the Bayes risk, which is minimized by the Bayes rule ϕ∗. □
To prove Theorem 2, we need the following lemma, which is an extension of Lemma A.1 of Banerjee and Ghosal (2015) to accommodate non-zero means.
Lemma 1.
Let q andq′respectively denote the densities of Nm(μ,Ω− 1) and Nm(μ′,Ω′− 1) and\(h(q,q^{\prime })=\|\sqrt {q}-\sqrt {q^{\prime }}\|_{2}\)standfor their Hellinger distance, where the dimension m is potentially growing but theeigenvalues of Ω remains bounded between two fixed positive numbers. Then there exist positiveconstantsc0(depending on Ω)andδ0such that\( h^{2}(q,q^{\prime })\leq c_{0}\left (\|\mu -\mu ^{\prime }\|^{2}+\|{\Omega }-{\Omega }^{\prime }{\|_{F}^{2}}\right )\)andifh(q,q′) < δ0,then\( \left (\|\mu -\mu ^{\prime }\|^{2}+\|{\Omega }-{\Omega }^{\prime }{\|_{F}^{2}}\right )\leq c_{0} h^{2}(q,q^{\prime })\).Moreoverc0can be taken to be a constant multiple of max{∥Ω∥S,∥Ω− 1∥S}.
Proof.
Let λi, i = 1,⋯ ,m, be the eigenvalues of the matrix A = Ω− 1/2Ω′Ω− 1/2. Then the squared Hellinger distance h2(q,q′) between q and q′ is given by
Clearly the second term is bounded by a multiple of ∥μ − μ′∥2. The first term has been bounded by a multiple of \(\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\) in Lemma A.1 (ii) of Banerjee and Ghosal (2015).
For the converse, observe that \(1-2^{m} {\prod }_{i = 1}^{m} (\lambda _{i}^{1/2}+\lambda _{i}^{-1/2})^{-1/2}\) is bounded by h2(q,q′). Let δ = h(q,q′) < δ0 for a sufficiently small δ0, so that by arguments used in the proof of Part (ii) of Lemma A.1 of Banerjee and Ghosal (2015), it follows that \(\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\le \frac {1}{2} c_{0} [1-2^{m} {\prod }_{i = 1}^{m} (\lambda _{i}^{1/2}+\lambda _{i}^{-1/2})^{-1/2}]\le c_{0} h^{2}(q,q^{\prime })\) for some constant c0 > 0. As the Frobenius norm dominates the spectral norm, this in particular implies that ∥Ω−Ω′∥S is also small, and hence the eigenvalues of Ω′ also lie between two fixed positive numbers. Now h(q,q′) < δ0 also implies that
which implies that
Hence \(\|\mu -\mu ^{\prime }\|^{2}\le \frac {1}{2} c_{0} h^{2}(q,q^{\prime })\). This gives the desired bound. □
Proof of Theorem 2.
To obtain the posterior contraction rate 𝜖n of (π,μ0,μ1,Ω0,Ω1) at \((\pi ^{*},\mu _{0}^{*},\mu _{1}^{*},{\Omega }_{0}^{*}, {\Omega }_{1}^{*})\), we apply the general theory of posterior contraction rate described in Ghosal and van der Vaart (2017), by executing the steps stated below.
-
(i)
Find a sieve \(\mathcal {P}_{n}\subset \mathcal {P}\) such that \({\Pi }(\mathcal {P}_{n}^{c})\le e^{-M n{\epsilon _{n}^{2}}}\) for a sufficiently large constant M > 0 and the 𝜖n-metric entropy of \(\mathcal {P}_{n}\) is bounded by a constant multiple of \(n{\epsilon _{n}^{2}}\).
-
(ii)
Show that the prior probability of an \({\epsilon _{n}^{2}}\)-neighborhood of the true density in the Kullback-Leibler sense is at least \(e^{-c n{\epsilon _{n}^{2}}}\) for some constant c > 0.
This gives the posterior contraction rate 𝜖n in terms of the Hellinger distance on the densities, which can be converted to the Euclidean distance on the mean and Frobenius distance on the precision matrix in view of Lemma 1.
Let \(q_{\mu _{0},{\Omega }_{0}}\) and \(q_{\mu _{1},{\Omega }_{1}}\) denote the two possible densities of an observation conditional on the classification information. The learning of π is solely based on N, and the posterior for π is clearly concentrating at its true value π∗ at rate \(1/\sqrt {n}\). Thus, to simplify notations, in the remaining analysis, we may treat π as given to be π∗ and establish posterior concentration for (μ0,μ1,Ω0,Ω1) based on N samples from Class 1 and n − N from Class 0. Then the average squared Hellinger distance is expressed as
As N/n → π∗ almost surely and 0 < π∗ < 1, it suffices to establish posterior concentration for (μ0,Ω0) and (μ1,Ω1) separately. Therefore we use a generic notation (μ,Ω) to denote either, and we establish the rate of posterior contraction for μ and Ω here. The analysis of posterior concentration is similar to that of Theorem 3.1 of Banerjee and Ghosal (2015) with the difference being that there is an extra mean parameter. Moreover, unlike them we use a prior on the precision matrix through its Cholesky decomposition, and the sparsity of off-diagonal entries is only approximate since we do not use point-mass priors. However, posterior contraction for a point mass prior can be recovered from our results since it corresponds to the limiting case v0 → 0 in the spike-and-slab prior.
Observe that for (μ,L,D),(μ′,L′,D′) with all entries of μ,μ′,L,L′,D,D′ bounded by B, for Ω = LDLT, Ω′ = L′D′L′T, we can be write \(\|{\Omega }-{\Omega }^{\prime }\|_{F}^{2}\) as
which can be bounded as
Hence if ∥D − D′∥∞≤ 𝜖n/(3B2p2 + ν), ∥L − L′∥∞≤ 𝜖n/(3B2p2 + ν), where ∥⋅∥∞ stands for the maximum norm for a vector or matrix, then
Further, ∥Ω∥S ≤tr(Ω) ≤ p2B2 ≤ pν without loss of generality, by increasing ν if necessary. Define the effective edges to be the set {(i,j) : |lij| > 𝜖n/pν,i > j}. Consider the sieve \(\mathcal {P}_{n}\) consisting of L with maximum number of effective edges at most \(\bar {r}\) a sufficiently large multiple of \(n{\epsilon _{n}^{2}}/\log n\) and each entry of μ, D and L is bounded by \(B\in [b_{1}n{\epsilon _{n}^{2}},b_{1}n{\epsilon _{n}^{2}}+ 1]\) in absolute value. Then the 𝜖n/pν-metric entropy of \(\mathcal {P}_{n}\) with respect to the Euclidean norm for μ and Frobenius norm for D and L is given by
where in the left part of the inequality, the first \(({B\sqrt {p}}/{\epsilon _{n}})^{p}\) is from the mean parameter, the second is from the off-diagonal entries of L and the last is from the diagonal elements of D. Note that for a component with span at most 𝜖n/pν, only one point is needed for a covering. In view of Lemma 1, the Hellinger distance between the corresponding densities is bounded by 𝜖n since max{∥Ω∥S,∥Ω′∥S,∥Ω− 1∥S,∥Ω′− 1∥S}≤ pν. Thus the entropy is bounded by a multiple of
For our choice of \(\bar {r}\) and B, this shows that the metric entropy is bounded by a constant multiple of \(n{\epsilon _{n}^{2}}\).
Now we bound \({\Pi }(\mathcal {P}_{n}^{c})\). Let R be the number of effective edges of L. Then \(R\sim \text {Bin}(\binom {p-1}{2},\eta _{0})\), where \(\eta _{0} ={\Pi } (|l|>\epsilon _{n}/p)\le p^{-b^{\prime }}\). Then from the tail estimate of the binomial distribution, \({\Pi }(\bar {R}>\bar {r})\le e^{-c \bar {r}\log \bar {r}}\) for some c > 0. Using the condition on the prior and the choice of B, we have
where C can be chosen as large as we like, by simply choosing b1 large enough. This verifies the required conditions on the sieve.
Finally to check the prior concentration rate,
where \(K(p^{*},p)=\int p^{*}\log (p^{*}/p)\), \(V(p^{*},p)=\int p^{*}\{\log (p^{*},p)\}^{2}\). Note that for X ∼Np(μ,Σ) and a p × p symmetric matrix A, we have
We use the above result to find the expressions for K(p∗,p) and V (p∗,p). Denoting the eigenvalues of the matrix Ω∗− 1/2ΩΩ∗− 1/2 by λi, i = 1,…,p, we have K(p0,p) given by
Now \({\sum }_{i = 1}^{p} (1-\lambda _{i})^{2} =\|I-{{\Omega }^{*}}^{-1/2}{\Omega }{\Omega }^{*-1/2}\|_{F}^{2}\), so if ∥I −Ω∗− 1/2ΩΩ∗− 1/2∥F is sufficiently small, then maxi|1 − λi| < 1, and hence \({\sum }_{i = 1}^{p} (1-\lambda _{i}-\log \lambda _{i})\lesssim {\sum }_{i = 1}(1-\lambda _{i})^{2}\), leading to the relation
Also, V (p∗,p) is given by
By the assumption that Ω∗− 1 has bounded spectral norm, we have
implying that for some sufficiently small constant c > 0,
Furthermore, we have
Since the Frobenius norm dominates the spectral norm and ∥D∗∥S and ∥L∗∥S are bounded by some constant, so are ∥D∥S and ∥L∥S if ∥D − D∗∥F and ∥L − L∗∥F are small. Thus for some sufficiently small constant c′ > 0,
In the actual prior, we constraint L and D such that Ω− 1 has spectral norm bounded by B′, but such a constraint can only increase the probability of the Kullback-Leibler neighborhood of the true density since Ω∗ satisfies the required constraints. Therefore, we may pretend that the components of L and D are independently distributed. Then the above expression simplifies in terms of products of marginal probabilities. Let ηi = π(|lij| > 𝜖n/p) be the probability that an element in the i th row of L is non-zero, and \(\zeta _{i}={\Pi }(| l_{ij}-l_{ij}^{*}| <c_{2}\epsilon _{n}/p)\) be the probability that this element is in the neighborhood of its true value when \(l_{ij}^{*}\ne 0\). Then
where si is the number of non-zero elements in the i th row of L∗. Note that s1 + ⋯ + sp = s. From the assumptions, \(\eta _{i}\le {p^{-b^{\prime }}}/\sqrt {i}\le p^{-b^{\prime }-1/2}\) for some b′ > 2 and \(\zeta _{i} ={\Pi } (| l_{ij}-l_{ij}^{*}| <c_{2}\epsilon _{n}/p ||l_{ij}|>\epsilon _{n}/p ){\Pi } (|l_{ij}|>\epsilon _{n}/p ) \ge \epsilon _{n} p^{-c^{\prime }}\) for some c′ > 0. Therefore, the lower bound for the above probability is given by
Thus we have \({\Pi }\{\| L-L^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c^{\prime }\epsilon _{n}/p)^{s}.\) Similarly, we have \({\Pi }\{\| D-D^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c^{\prime }\epsilon _{n}/p)^{p}\) and \({\Pi }\{\| \mu -\mu ^{*}\|_{\infty }\leq c^{\prime }\epsilon _{n}/p\}\gtrsim (c\epsilon _{n}/p)^{p}\). Hence, the prior concentration rate condition holds as
for the choice 𝜖n = n− 1/2(p + s)1/2(log n)1/2.
For LDA, the proof uses a similar idea but is notationally slightly more complicated because the same parameter Ω is shared by two groups and the two group of observations need to be considered together. In this case, the X-observations are not i.i.d., so we work with the average squared Hellinger distance (7.4) and dominate that by distances on μ0,μ1,Ω. The Kullback-Leibler divergences can also be bounded similarly. Then entropy and prior probability estimates of the same nature are established analogously. □
Proof of Theorem 3.
We first consider the case of LDA. The misclassification rate could be written as
Let
Then
We have
We obtain that
and
Let \(\bar {R}\) be the number of effective edges in L from \(\mathcal {P}_{n}\). From the proof of Theorem 2 it follows that the number of effective edges in Ω is bounded by \(\bar {R}\) and by the choice of \(\mathcal {P}_{n}\), with posterior probability tending to one in probability, \(\bar {R} \lesssim n\epsilon ^{2}/\log n=O(p+s)\). Let A = Ω −Ω∗, then the number of non-zero elements in A is also O(p + s) with posterior probability tending to one in probability. Thus,
which is \(O(n^{-1/2}(p+s)\sqrt {\log p})\) because
Similarly, it follows that
Thus, \(| {C_{2}^{2}}-C_{2}^{*2}|=O\left (n^{-1/2}(p+s)\sqrt {\log p}\right )=o(1)\). Hence by Assumption (A4), \({C_{2}^{2}}\geq M_{0}>0\). Therefore,
We know that
Similar to the proof above,
which is \( O(n^{-1/2} (p+s)\sqrt {\log p})\). Also,
Therefore, we have that
In addition,
Therefore, \(\max \{| C_{1}-C_{1}^{*}|,| C_{0}-C_{0}^{*}|\} =O(n^{-1/2}(p+s)\sqrt {\log p})\to 0\) under the condition that (p + s)2 log p = o(n).
For QDA, the misclassification rate does not have an explicit expression but (2.5) leads to upper bounds of similar nature. □
Rights and permissions
About this article
Cite this article
Du, X., Ghosal, S. Bayesian Discriminant Analysis Using a High Dimensional Predictor. Sankhya A 80 (Suppl 1), 112–145 (2018). https://doi.org/10.1007/s13171-018-0140-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-018-0140-z