Abstract
Shrinkage prior has gained great successes in many data analysis, however, its applications mostly focus on the Bayesian modeling of sparse parameters. In this work, we will apply Bayesian shrinkage to model high dimensional parameter that possesses an unknown blocking structure. We propose to impose heavy-tail shrinkage prior, e.g., t prior, on the differences of successive parameter entries, and such a fusion prior will shrink successive differences towards zero and hence induce posterior blocking. Comparing to conventional Bayesian fused LASSO which implements Laplace fusion prior, t fusion prior induces stronger shrinkage effect and enjoys a nice posterior consistency property. Simulation studies and real data analyses show that t fusion has superior performance to the frequentist fusion estimator and Bayesian Laplace fusion prior. This t fusion strategy is further developed to conduct a Bayesian clustering analysis, and our simulations show that the proposed algorithm compares favorably to classical Dirichlet process modeling.
Similar content being viewed by others
References
Andrews, D.F. and Mallows, C.L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B (Methodological), 99–102.
Barron, A. (1998). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. In J.M. Bernardo, J. Berger, A. Dawid, A. Smith, eds. Bayesian Statistics 6, 27–52.
Berger, J.O., Wang, X. and Shen, L. (2014). A bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics 24, 1, 110–129.
Betancourt, B., Rodríguez, A. and Boyd, N. (2017). Bayesian fused lasso regression for dynamic binary networks. Journal of Computational and Graphical Statistics26, 4, 840–850.
Bhattacharya, A., Pati, D., Pillai, N.S. and Dunson, D.B. (2015). Dirichlet-laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 1479–1490.
Carvalho, C.M., Polson, N.G. and Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika 97, 465–480.
Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics 40, 4, 2069–2101.
Castillo, I., Schmidt-Hieber, J. and van der Vaart, A.W. (2015). Bayesian linear regression with sparse priors. Annals of Statistics, 1986–2018.
Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.
Chen, J. and Chen, Z. (2012). Extended bic for small-n-large-p sparse glm. Statistica Sinica 22, 555–574.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.
Ghosal, S., Ghosh, J.K. and Van Der Vaart, A.W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28, 2, 500–531.
Ghosal, Subhashis and Van Der Vaart, A.W. (2007). Convergence rates of posterior distributions for noniid observations. Annals of Statistics 35, 1, 192–223.
Hahn, P.R. and Carvalho, C.M. (2015). Decoupling shrinkage and selection in bayesian linear models: a posterior summary perspective. Journal of the American Statistical Association 110, 435–448.
Heller, K.A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the 22nd international conference on Machine learning, 297–304.
Ishwaran, H. and Rao, J.S. (2005). Spike and slab variable selection: frequentist and bayesian strategies. Annals of Statistics, 730–773.
Jiang, W. (2007). Bayesian variable selection for high dimensional generalized linear models: Convergence rate of the fitted densities. Annals of Statistics 35, 1487–1511.
Johnson, V.E. and Rossel, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association 107, 649–660.
Johnstone, I.M. (2010). High dimensional bernstein-von mises: simple examples. Institute of Mathematical Statistics Collections 6, 87.
Ke, Z.T., Fan, J. and Wu, Y. (2015a). Homogeneity pursuit. Journal of the American Statistical Association 110, 509, 175–194.
Ke, Z.T., Fan, J. and Wu, Y. (2015b). Homogeneity pursuit. Journal of the American Statistical Association 110, 175–194.
Kleijn, B.J.K., van der Vaart, A.W. et al. (2006a). Misspecification in infinite-dimensional bayesian statistics. The Annals of Statistics 34, 2, 837–877.
Kleijn, B.J.K. and van der Vaart, A.W. (2006b). Misspecification in infinite-dimensional bayesian statistics. Annals of Statistics 34, 837–877.
Kyung, M., Gill, J., Ghosh, M. and Casella, G. (2010). Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis 5, 2, 369–411.
Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 1302–1338.
Li, H. and Pati, D. (2017). Variable selection using shrinkage priors. Computational Statistics & Data Analysis 107, 107–119.
Li, Furong and Sang, Huiyan (2018). Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association, (just-accepted), 1–37.
Liang, F., Song, Q. and Yu, K. (2013). Bayesian subset modeling for high dimensional generalized linear models. Journal of the American Statistical Association108, 589–606.
Liu, J., Yuan, L. and Ye, J. (2010). An efficient algorithm for a class of fused lasso problems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 323–332.
Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112, 517, 410–423.
Mozeika, A. and Coolen, A. (2018). Mean-field theory of bayesian clustering. arXiv:1709.01632.
Narisetty, N.N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42, 2, 789–817.
Neal, R.M. (2000). Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics 9, 2, 249–265.
Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686.
Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso. The Annals of Statistics 37, 5B, 2922–2952.
Robbins, H. (1985). An empirical bayes approach to statistics. In Herbert Robbins Selected Papers, 41–47.
Royston, J.P. (1982). Algorithm as 177: Expected normal order statistics (exact and approximate). Journal of the Royal Statistical Society. Series C (Applied statistics) 31, 2, 161–165.
Scott, J.G. and Berger, J.O. (2010). Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics, 2587–2619.
Shen, X. and Huang, H.-C. (2012). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association 105, 727–739.
Shimamura, K., Ueki, M., Kawano, S. and Konishi, S. (2018). Bayesian generalized fused lasso modeling via neg distribution. Communications in Statistics-Theory and Methods, 1–23.
Song, Q. and Liang, F. (2014). A split-and-merge bayesian variable selection approach for ultra-high dimensional regression. Journal of the Royal Statistical Society, Series B, in press.
Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensional regression. arXiv:1712.08964.
Tang, X., Xu, X., Ghosh, M. and Ghosh, P. (2016). Bayesian variable selection and estimation based on global-local shrinkage priors. arXiv:1605.07981.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
Tibshirani, R., Saunders, M., Rosset, S., Ji, Z. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 1, 91–108.
Tibshirani, R. and Wang, P. (2007). Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics 9, 1, 18–29.
van der Geer, S. and Bühlmann, P. (2011). Statistics for High-Dimensional Data Methods, Theory and Applications. Spring Series in Statistics, Springer.
van der Pas, S.L., Szabo, B. and van der Vaart, A. (2017). Adaptive posterior contraction rates for the horseshoe. arXiv:1702.03698.
Wade, S. and Ghahramani, Z. (2018). Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis 13, 559–626.
Xu, Z., Schmidt, D.F., Makalic, E., Qian, G. and Hopper, J.L. (2017). Bayesian sparse global-local shrinkage regression for grouped variables. arXiv:1709.04333.
Yang, Y., Wainwright, M.J. and Jordan, M.I. (2015). On the computational complexity of high-dimensional bayesian variable selection. Annals of Statistics, in press.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38, 894–942.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.
Zubkov, A.M. and Serov, A.A. (2013). A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications57, 539–544.
Acknowledgments
This work is in memory of Prof. Jayanta Ghosh who has jointly supervised the first PhD student of the second author. Qifan Song’s research is sponsored by NSF DMS-1811812. Guang Cheng’s research is sponsored by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research, (ONR N00014-18-2759).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
First, let us state some useful lemmas.
Lemma A.1 (Lemma 1 of Laurent and Massart 2000).
Let \({\chi ^{2}_{d}}(\kappa )\)be a chi-square distribution with degree of freedom d, and noncentral parameterκ, then we have the following concentration inequality
Lemma A.2 (Theorem 1 of Zubkov and Serov 2013).
Let X be a Binomial random variableX ∼B(n,v). For any1 < k < n − 1
where Φ is the cumulative distribution function of standard Gaussian distribution and H(v,k/n) = (k/n)log(k/nv) + (1 − k/n)log[(1 − k/n)/(1 − v)].
The next lemma is a refined result of Lemma 6 in Barron (1998):
Lemma A.3.
Let f∗ be the true probability density of data generation,f𝜃 be the likelihood function with parameter 𝜃 ∈Θ, and E∗,E𝜃 denote the corresponding expectation respectively. LetBn andCn be two subsets in parameter space Θ, and ϕn be some test function satisfying ϕn(Dn) ∈ [0,1] for any data Dn. Ifπ(Bn) ≤ bn, \(E^{*}\phi (D_{n})\leq b_{n}^{\prime }\),\(\sup _{\theta \in C_{n}}E_{\theta }(1-\phi (D_{n}))\leq c_{n}\), and furthermore,
where \(m(D_{n})={\int }_{\Theta } \pi (\theta )f_{\theta }(D_{n})d\theta \) is the margin probability of Dn. Then,
Proof 1.
Define Ωn to be the event of (m(Dn))/(f∗(Dn)) ≥ an, and \(m(D_{n}, C_{n}\cup B_{n}) = {\int }_{C_{n}\cup B_{n}} \pi (\theta )f_{\theta }(D_{n})d\theta \). Then
By Fubini theorem,
Combining the above inequalities leads to the conclusion.□
Proof 2 (Proof of Theorem 2.1 and 2.2).
Let G = {g1,g2,…,gd} be a generic subset of {2,…,n}, and it also represents potential a (d + 1)-group structure of 𝜃 as {{1,…,g1},{g1 + 1,…,g2},…,{gd + 1,…,n}}. Given G and its corresponding blocking structure, \(\widehat \theta _{G}(y)\)denotes the estimation of 𝜃 based on block mean, i.e. \(\widehat \theta _{G,j}(y) \)\(= {\sum }_{i=g_{j}+1}^{g_{j+1}}y_{i}/(g_{j+1}-g_{j})\)for all gj + 1 ≤ j ≤ gj+ 1, and \(\widehat {\sigma }^{2}_{G}(y)=\|y-\widehat \theta _{G}\|^{2}/(n-|G|-1)\).
To prove the posterior contraction, we will apply Lemma A.3. We define the following testing function
for some δ > 0, and define Cn and Bn as:
Note that when G ⊃ G∗, \(\|\widehat \theta _{G}(y)-\theta ^{*}\|^{2}\sim \sigma ^{*2}\chi ^{2}_{|G|+1}\) and \(\|y-\widehat \theta _{G}\|^{2}\)\(\sim \sigma ^{*2}\chi ^{2}_{n-|G|-1}\), thus by Lemma A.1, we have that
for some constant c1, since \(|G|=O(|G^{*}|)\prec n{\epsilon _{n}^{2}}\) and 𝜖n ≺ 1. Therefore,
as long as n𝜖2/[|G∗|log n] is sufficiently large.
For any (𝜃,σ2) ∈ Cn satisfying \(\|\theta -\theta ^{*}\|\leq M\sqrt n\sigma ^{*}\epsilon _{n}\) and σ2/σ∗2 ≥ (1 − 𝜖n)/(1 + 𝜖n), we define \(\widehat G=\{i:\theta _{i}-\theta _{i-1}\geq \sigma \epsilon _{n}/n^{2}\}\cup G^{*}\) (hence |G|≤ (1 + δ)|G∗|), thus
for some c2 given a large M, where the second inequality is due to the fact that \(\|\widehat \theta _{\widehat G}(\theta )-\theta \|\leq \sqrt {n}\sigma \epsilon _{n}\) when 𝜃 ∈ Bn.
For any (𝜃,σ2) ∈ Cn satisfying σ2/σ∗2 < (1 − 𝜖n)/(1 + 𝜖n) or σ2/σ∗2 > (1 + 𝜖n)/(1 − 𝜖n),
for some \(c_{2}^{\prime }\), where the noncentral parameter \(\lambda <n{\epsilon _{n}^{2}}\prec (n-|G|-1)\epsilon _{n}\).
Combining the results from the previous two paragraph, it is easy to obtain that
Now we consider the marginal posterior density of data y. With probability \(P(\|\varepsilon \|\leq 2\sqrt n\sigma ^{*})\) (which converges to 1),
for some constant \(c_{3}^{\prime }\). Besides,
for some constant \(c_{3}^{\prime \prime }\). Thus
for some c3, where c3 can be sufficiently small when \(n{\epsilon _{n}^{2}}/[|G^{*}|\log n]\) is large enough.
At last, we study the prior probability of set Bn. Due to the prior independence of \(\vartheta _{i}^{\prime }\)s, π(Bn) = π[Bin(n − 1 −|G∗|,p) > (δ)|G∗|], where p ≤ (1/n)1+u. By Lemma A.2,
for some c4. Combine results (A.2), (A.3), (A.4) and (A.5), and we apply Lemma A.3 to get the posterior consistency result that
given a sufficient large constants δ and n𝜖n/|G∗|log n. □
Proof 3 (Proof of Theorem 3.1).
Consider y = 𝜃 + d, where \(\theta _{i}^{*}\equiv 0\)for all i and error d is order statistics of standard normal variables, i.e. the density of d is \(f(d)=n!\prod \phi (d_{i})1(d_{1}\leq d_{2},\dots ,d_{n-1}\leq d_{n})\)andϕ denotes the standard normal density. The prior of 𝜃 follows \(\pi (\theta )=\pi _{1}(\theta _{1}){\prod }_{i=2}^{n}\pi _{t,s}(\theta _{i}-\theta _{i-1})\), where \(\pi _{\lambda _{1}}\)is the density of N(0,λ1), and πt,s is the density of t distribution with tiny scale parameter satisfying − log s ≍ log n, i.e. conditions in Corollary 2.1 holds, and we consider the misspecified posterior of form π(𝜃|Dn) = exp{−(y − 𝜃)2/2}π(𝜃).
Define \(\mu \in \mathbb {R}^{n}\)asμi = 0 for all 1 ≤ i ≤ k = 3n/4, and μi = Z0.25/2 for i > k where Z0.25 is the right 25% quantile of standard normal distribution, thus ∥μ − 𝜃∗∥2 ≍ n.
Let Δ𝜃 be any vector such that ∥Δ𝜃∥2 ≤ M log n. Then
And
for some positive constant c given sufficiently large n, where the inequalities above hold since yn ≥ yn− 1⋯ ≥ yk+ 1, and yk+ 1 ≈ Z0.25 with high probability, due to large sample empirical quantile theory.
Combining the above two results, we have that the posterior density satisfies π(μ + Δ𝜃|Dn) ≫ π(𝜃∗ + Δ𝜃|Dn) for any ∥Δ𝜃∥2 ≤ M log n with high probability. Therefore, more posterior mass is distributed within the \(\sqrt {M\log n}\)-radius ball centered at μ than at the true parameter 𝜃∗. □
Proof 4 (Proof of Theorem 3.2).
The proof of this theorem is quite similar to the proof of Theorem 2.1 and 2.2. We define the same testing function as in the proof of Theorem 2.1 and 2.2, and define the following two sets:
Using the same arguments, one can still establish exponential separation results (A.2) and (A.3).
To establish (A.4), we notice that
for some constant \(c_{3}^{\prime }\) and
This ensures (A.4).
As for the prior probability of Bn, if the scale parameter for the t distribution is sufficiently small, i.e. s = n−w for some large w and \({\int }_{\pm \epsilon _{n}/n^{2}}\pi _{t,s}(x)dx\geq 1-1/n^{1+u}\) for some sufficiently large u where πt,s denotes the t density function with scale parameter s, then for any ranking r,
This hence implies that − log(π(Bn)) ≥ u log n. □
Rights and permissions
About this article
Cite this article
Song, Q., Cheng, G. Bayesian Fusion Estimation via t Shrinkage. Sankhya A 82, 353–385 (2020). https://doi.org/10.1007/s13171-019-00177-0
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-019-00177-0