Bayesian Fusion Estimation via t Shrinkage

Song, Qifan; Cheng, Guang

doi:10.1007/s13171-019-00177-0

Bayesian Fusion Estimation via t Shrinkage

Published: 29 July 2019

Volume 82, pages 353–385, (2020)
Cite this article

Sankhya A Aims and scope Submit manuscript

524 Accesses
7 Citations
Explore all metrics

Abstract

Shrinkage prior has gained great successes in many data analysis, however, its applications mostly focus on the Bayesian modeling of sparse parameters. In this work, we will apply Bayesian shrinkage to model high dimensional parameter that possesses an unknown blocking structure. We propose to impose heavy-tail shrinkage prior, e.g., t prior, on the differences of successive parameter entries, and such a fusion prior will shrink successive differences towards zero and hence induce posterior blocking. Comparing to conventional Bayesian fused LASSO which implements Laplace fusion prior, t fusion prior induces stronger shrinkage effect and enjoys a nice posterior consistency property. Simulation studies and real data analyses show that t fusion has superior performance to the frequentist fusion estimator and Bayesian Laplace fusion prior. This t fusion strategy is further developed to conduct a Bayesian clustering analysis, and our simulations show that the proposed algorithm compares favorably to classical Dirichlet process modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse estimation of linear model via Bayesian method $$^*$$

Article 04 March 2024

Yang Yang, Yanjiao Yang & Lichun Wang

A sparse estimate based on variational approximations for semiparametric generalized additive models

Article 25 March 2024

Fan Yang & Yuehan Yang

Bayesian fused lasso modeling via horseshoe prior

Article Open access 21 August 2023

Yuko Kakikawa, Kaito Shimamura & Shuichi Kawano

References

Andrews, D.F. and Mallows, C.L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B (Methodological), 99–102.
Barron, A. (1998). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. In J.M. Bernardo, J. Berger, A. Dawid, A. Smith, eds. Bayesian Statistics 6, 27–52.
Berger, J.O., Wang, X. and Shen, L. (2014). A bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics 24, 1, 110–129.
MathSciNet Google Scholar
Betancourt, B., Rodríguez, A. and Boyd, N. (2017). Bayesian fused lasso regression for dynamic binary networks. Journal of Computational and Graphical Statistics26, 4, 840–850.
MathSciNet Google Scholar
Bhattacharya, A., Pati, D., Pillai, N.S. and Dunson, D.B. (2015). Dirichlet-laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 1479–1490.
MathSciNet MATH Google Scholar
Carvalho, C.M., Polson, N.G. and Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika 97, 465–480.
MathSciNet MATH Google Scholar
Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics 40, 4, 2069–2101.
MathSciNet MATH Google Scholar
Castillo, I., Schmidt-Hieber, J. and van der Vaart, A.W. (2015). Bayesian linear regression with sparse priors. Annals of Statistics, 1986–2018.
Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.
MathSciNet MATH Google Scholar
Chen, J. and Chen, Z. (2012). Extended bic for small-n-large-p sparse glm. Statistica Sinica 22, 555–574.
MathSciNet MATH Google Scholar
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.
MathSciNet MATH Google Scholar
Ghosal, S., Ghosh, J.K. and Van Der Vaart, A.W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28, 2, 500–531.
MathSciNet MATH Google Scholar
Ghosal, Subhashis and Van Der Vaart, A.W. (2007). Convergence rates of posterior distributions for noniid observations. Annals of Statistics 35, 1, 192–223.
MathSciNet MATH Google Scholar
Hahn, P.R. and Carvalho, C.M. (2015). Decoupling shrinkage and selection in bayesian linear models: a posterior summary perspective. Journal of the American Statistical Association 110, 435–448.
MathSciNet MATH Google Scholar
Heller, K.A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the 22nd international conference on Machine learning, 297–304.
Ishwaran, H. and Rao, J.S. (2005). Spike and slab variable selection: frequentist and bayesian strategies. Annals of Statistics, 730–773.
Jiang, W. (2007). Bayesian variable selection for high dimensional generalized linear models: Convergence rate of the fitted densities. Annals of Statistics 35, 1487–1511.
MathSciNet MATH Google Scholar
Johnson, V.E. and Rossel, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association 107, 649–660.
MathSciNet MATH Google Scholar
Johnstone, I.M. (2010). High dimensional bernstein-von mises: simple examples. Institute of Mathematical Statistics Collections 6, 87.
MathSciNet Google Scholar
Ke, Z.T., Fan, J. and Wu, Y. (2015a). Homogeneity pursuit. Journal of the American Statistical Association 110, 509, 175–194.
Ke, Z.T., Fan, J. and Wu, Y. (2015b). Homogeneity pursuit. Journal of the American Statistical Association 110, 175–194.
Kleijn, B.J.K., van der Vaart, A.W. et al. (2006a). Misspecification in infinite-dimensional bayesian statistics. The Annals of Statistics 34, 2, 837–877.
Kleijn, B.J.K. and van der Vaart, A.W. (2006b). Misspecification in infinite-dimensional bayesian statistics. Annals of Statistics 34, 837–877.
Kyung, M., Gill, J., Ghosh, M. and Casella, G. (2010). Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis 5, 2, 369–411.
MathSciNet MATH Google Scholar
Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 1302–1338.
Li, H. and Pati, D. (2017). Variable selection using shrinkage priors. Computational Statistics & Data Analysis 107, 107–119.
MathSciNet MATH Google Scholar
Li, Furong and Sang, Huiyan (2018). Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association, (just-accepted), 1–37.
Liang, F., Song, Q. and Yu, K. (2013). Bayesian subset modeling for high dimensional generalized linear models. Journal of the American Statistical Association108, 589–606.
MathSciNet MATH Google Scholar
Liu, J., Yuan, L. and Ye, J. (2010). An efficient algorithm for a class of fused lasso problems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 323–332.
Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112, 517, 410–423.
MathSciNet Google Scholar
Mozeika, A. and Coolen, A. (2018). Mean-field theory of bayesian clustering. arXiv:1709.01632.
Narisetty, N.N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42, 2, 789–817.
MathSciNet MATH Google Scholar
Neal, R.M. (2000). Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics 9, 2, 249–265.
MathSciNet Google Scholar
Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686.
MathSciNet MATH Google Scholar
Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso. The Annals of Statistics 37, 5B, 2922–2952.
MathSciNet MATH Google Scholar
Robbins, H. (1985). An empirical bayes approach to statistics. In Herbert Robbins Selected Papers, 41–47.
Royston, J.P. (1982). Algorithm as 177: Expected normal order statistics (exact and approximate). Journal of the Royal Statistical Society. Series C (Applied statistics) 31, 2, 161–165.
Google Scholar
Scott, J.G. and Berger, J.O. (2010). Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics, 2587–2619.
Shen, X. and Huang, H.-C. (2012). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association 105, 727–739.
MathSciNet MATH Google Scholar
Shimamura, K., Ueki, M., Kawano, S. and Konishi, S. (2018). Bayesian generalized fused lasso modeling via neg distribution. Communications in Statistics-Theory and Methods, 1–23.
Song, Q. and Liang, F. (2014). A split-and-merge bayesian variable selection approach for ultra-high dimensional regression. Journal of the Royal Statistical Society, Series B, in press.
Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensional regression. arXiv:1712.08964.
Tang, X., Xu, X., Ghosh, M. and Ghosh, P. (2016). Bayesian variable selection and estimation based on global-local shrinkage priors. arXiv:1605.07981.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
MathSciNet MATH Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Ji, Z. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 1, 91–108.
MathSciNet MATH Google Scholar
Tibshirani, R. and Wang, P. (2007). Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics 9, 1, 18–29.
MATH Google Scholar
van der Geer, S. and Bühlmann, P. (2011). Statistics for High-Dimensional Data Methods, Theory and Applications. Spring Series in Statistics, Springer.
van der Pas, S.L., Szabo, B. and van der Vaart, A. (2017). Adaptive posterior contraction rates for the horseshoe. arXiv:1702.03698.
Wade, S. and Ghahramani, Z. (2018). Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis 13, 559–626.
MathSciNet MATH Google Scholar
Xu, Z., Schmidt, D.F., Makalic, E., Qian, G. and Hopper, J.L. (2017). Bayesian sparse global-local shrinkage regression for grouped variables. arXiv:1709.04333.
Yang, Y., Wainwright, M.J. and Jordan, M.I. (2015). On the computational complexity of high-dimensional bayesian variable selection. Annals of Statistics, in press.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38, 894–942.
MathSciNet MATH Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.
MathSciNet MATH Google Scholar
Zubkov, A.M. and Serov, A.A. (2013). A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications57, 539–544.
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work is in memory of Prof. Jayanta Ghosh who has jointly supervised the first PhD student of the second author. Qifan Song’s research is sponsored by NSF DMS-1811812. Guang Cheng’s research is sponsored by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research, (ONR N00014-18-2759).

Author information

Authors and Affiliations

Department of Statistics, Purdue University, 610 Purdue Mall, West Lafayette, IN, 47907, USA
Qifan Song & Guang Cheng

Authors

Qifan Song
View author publications
You can also search for this author in PubMed Google Scholar
Guang Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qifan Song.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

First, let us state some useful lemmas.

Lemma A.1 (Lemma 1 of Laurent and Massart 2000).

Let ${\chi ^{2}_{d}}(\kappa )$be a chi-square distribution with degree of freedom d, and noncentral parameterκ, then we have the following concentration inequality

$$ \begin{array}{@{}rcl@{}} &&Pr({\chi^{2}_{d}}(\kappa)>d+\kappa+2x+\sqrt{(4d+8\kappa)x})\leq \exp(-x), \text{ and}\\ &&Pr({\chi^{2}_{d}}(\kappa)<d+\kappa-\sqrt{(4d+8\kappa)x})\leq \exp(-x). \end{array} $$

Lemma A.2 (Theorem 1 of Zubkov and Serov 2013).

Let X be a Binomial random variableX ∼B(n,v). For any1 < k < n − 1

$$ Pr(X\geq k+1)\leq 1- {\Phi}(\text{sign}(k-nv)\{2nH(v, k/n)\}^{1/2}), $$

where Φ is the cumulative distribution function of standard Gaussian distribution and H(v,k/n) = (k/n)log(k/nv) + (1 − k/n)log[(1 − k/n)/(1 − v)].

The next lemma is a refined result of Lemma 6 in Barron (1998):

Lemma A.3.

Let f^∗ be the true probability density of data generation,f_𝜃 be the likelihood function with parameter 𝜃 ∈Θ, and E^∗,E_𝜃 denote the corresponding expectation respectively. LetB_n andC_n be two subsets in parameter space Θ, and ϕ_n be some test function satisfying ϕ_n(D_n) ∈ [0,1] for any data D_n. Ifπ(B_n) ≤ b_n, $E^{*}\phi (D_{n})\leq b_{n}^{\prime }$,$\sup _{\theta \in C_{n}}E_{\theta }(1-\phi (D_{n}))\leq c_{n}$, and furthermore,

$$P^{*}\left\{\frac{m(D_{n})}{f^{*}(D_{n})}\geq a_{n} \right\}\geq 1-a_{n}^{\prime},$$

where $m(D_{n})={\int }_{\Theta } \pi (\theta )f_{\theta }(D_{n})d\theta $ is the margin probability of D_n. Then,

$$ E^{*}\left( \pi(C_{n}\cup B_{n})|D_{n})\right)\leq \frac{b_{n}+c_{n}}{a_{n}}+a_{n}^{\prime}+b_{n}^{\prime}. $$

Proof 1.

Define Ω_n to be the event of (m(D_n))/(f^∗(D_n)) ≥ a_n, and $m(D_{n}, C_{n}\cup B_{n}) = {\int }_{C_{n}\cup B_{n}} \pi (\theta )f_{\theta }(D_{n})d\theta $. Then

$$ \begin{array}{@{}rcl@{}} &&E^{*}\pi(C_{n}\cup B_{n})|D_{n}) = E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))1_{{\Omega}_{n}}\\ &+&E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))(1-1_{{\Omega}_{n}})+ E^{*}\pi(C_{n}\cup B_{n})|D_{n})\phi(D_{n})\\ &\leq& E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))1_{{\Omega}_{n}}+E^{*}(1-1_{{\Omega}_{n}})+ E^{*}\phi(D_{n})\\ &\leq& E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))1_{{\Omega}_{n}}+b_{n}^{\prime}+a_{n}^{\prime}\\ &\leq& E^{*}\{m(D_{n}, C_{n}\cup B_{n})/a_{n}f^{*}(D_{n})\}(1-\phi(D_{n}))+b_{n}^{\prime}+a_{n}^{\prime}. \end{array} $$

By Fubini theorem,

$$ \begin{array}{@{}rcl@{}} &&E^{*}(1-\phi(D_{n}))m(D_{n}, C_{n}\cup B_{n})/f^{*}(D_{n}) = {\int}_{C_{n}\cup B_{n}} {\int}_{\mathcal{X}}[1-\phi(D_{n})]f_{\theta}(D_{n}) dD_{n}\pi(\theta)d\theta\\ &\leq&{\int}_{C_{n}} E_{\theta}(1-\phi(D_{n}))\pi(\theta)d\theta+{\int}_{B_{n}} {\int}_{\mathcal{X}}f_{\theta}(D_{n}) dD_{n}\pi(\theta)d\theta\leq b_{n}+c_{n}. \end{array} $$

Combining the above inequalities leads to the conclusion.□

Proof 2 (Proof of Theorem 2.1 and 2.2).

Let G = {g₁,g₂,…,g_d} be a generic subset of {2,…,n}, and it also represents potential a (d + 1)-group structure of 𝜃 as {{1,…,g₁},{g₁ + 1,…,g₂},…,{g_d + 1,…,n}}. Given G and its corresponding blocking structure, $\widehat \theta _{G}(y)$denotes the estimation of 𝜃 based on block mean, i.e. $\widehat \theta _{G,j}(y) $$= {\sum }_{i=g_{j}+1}^{g_{j+1}}y_{i}/(g_{j+1}-g_{j})$for all g_j + 1 ≤ j ≤ g_j+ 1, and $\widehat {\sigma }^{2}_{G}(y)=\|y-\widehat \theta _{G}\|^{2}/(n-|G|-1)$.

To prove the posterior contraction, we will apply Lemma A.3. We define the following testing function

$$ \begin{array}{ll} \phi(y) = 1\{&\|\widehat\theta_{G}-\theta^{*}\|\geq \sqrt{n}\sigma^{*}\epsilon_{n}\text{ and\ } |\widehat{\sigma}^{2}_{G}-\sigma^{*2}|>\sigma^{*2}\epsilon_{n}\\ &\text{ for all\ } G\supset G^{*}, |G|\leq (1+\delta)|G^{*}|\} \end{array} $$

(A.1)

for some δ > 0, and define C_n and B_n as:

$$ \begin{array}{@{}rcl@{}} &&C_{n}=\{\theta: \|\theta-\theta^{*}\|\leq M\sqrt n\sigma^{*}\epsilon_{n}, (1-\epsilon_{n})/(1+\epsilon_{n})<\sigma^{2}/\sigma^{*2}<(1+\epsilon_{n})/(1-\epsilon_{n})\}^{c}\backslash B_{n},\\ &&B_{n}=\{\theta: \text{Among all\ } \{\vartheta_{i}'s\}_{i\notin G^{*}} , \text{ there are at least} \ \delta|G^{*}| \text{ of them are greater than} \ \sigma\epsilon_{n}/n\}. \end{array} $$

Note that when G ⊃ G^∗, $\|\widehat \theta _{G}(y)-\theta ^{*}\|^{2}\sim \sigma ^{*2}\chi ^{2}_{|G|+1}$ and $\|y-\widehat \theta _{G}\|^{2}$$\sim \sigma ^{*2}\chi ^{2}_{n-|G|-1}$, thus by Lemma A.1, we have that

$$ P(\|\widehat\theta_{G}(y)-\theta^{*}\|\geq \sqrt{n}\sigma^{*}\epsilon_{n}\text{ and\ } |\widehat{\sigma}^{2}_{G}-\sigma^{*2}|>\sigma^{*2}\epsilon_{n}) \leq \exp\{-c_{1}n{\epsilon_{n}^{2}}\}, $$

for some constant c₁, since $|G|=O(|G^{*}|)\prec n{\epsilon _{n}^{2}}$ and 𝜖_n ≺ 1. Therefore,

$$ E_{(\theta^{*},\sigma^{*2})}\phi(y)\leq {n-1 \choose (1+\delta)|G^{*}|}\exp\{-c_{1}n{\epsilon_{n}^{2}}\}=\exp\{-c_{1}^{\prime}n{\epsilon_{n}^{2}}\}, $$

(A.2)

as long as n𝜖²/[|G^∗|log n] is sufficiently large.

For any (𝜃,σ²) ∈ C_n satisfying $\|\theta -\theta ^{*}\|\leq M\sqrt n\sigma ^{*}\epsilon _{n}$ and σ²/σ^∗2 ≥ (1 − 𝜖_n)/(1 + 𝜖_n), we define $\widehat G=\{i:\theta _{i}-\theta _{i-1}\geq \sigma \epsilon _{n}/n^{2}\}\cup G^{*}$ (hence |G|≤ (1 + δ)|G^∗|), thus

$$ \begin{array}{@{}rcl@{}} && P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\theta^{*}\|\leq \sqrt{n}\sigma^{*}\epsilon_{n}) = P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\widehat\theta_{\widehat G}(\theta) + \widehat\theta_{\widehat G}(\theta)-\theta^{*}\|\leq \sqrt{n}\sigma^{*}\epsilon_{n})\\ &\leq&P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\widehat\theta_{\widehat G}(\theta)\| \geq \| \widehat\theta_{\widehat G}(\theta)-\theta^{*}\|-\sqrt{n}\sigma^{*}\epsilon_{n}) \\ &\leq& P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\widehat\theta_{\widehat G}(\theta)\| \geq M\sqrt{n}\sigma^{*}\epsilon_{n}-\sqrt{n}\sigma\epsilon_{n}-\sqrt{n}\sigma^{*}\epsilon_{n}) \\ &\leq&P\left( \chi_{|G+1|}^{2}\geq \left[\sqrt{\frac{1-\epsilon_{n}}{1+\epsilon_{n}}} (M-1)-1\right] n{\epsilon_{n}^{2}}\right)\leq \exp\{-c_{2}n{\epsilon_{n}^{2}}\} \end{array} $$

for some c₂ given a large M, where the second inequality is due to the fact that $\|\widehat \theta _{\widehat G}(\theta )-\theta \|\leq \sqrt {n}\sigma \epsilon _{n}$ when 𝜃 ∈ B_n.

For any (𝜃,σ²) ∈ C_n satisfying σ²/σ^∗2 < (1 − 𝜖_n)/(1 + 𝜖_n) or σ²/σ^∗2 > (1 + 𝜖_n)/(1 − 𝜖_n),

$$ \begin{array}{@{}rcl@{}} && P_{(\theta,\sigma^{2})}(|\widehat{\sigma}^{2}_{G}-\sigma^{*2}|<\sigma^{*2}\epsilon_{n}) = P_{(\theta,\sigma^{2})}(|\|y-\widehat\theta_{G}\|^{2}/\sigma^{*2}(n-|G|-1)-1|<\epsilon_{n})\\ &\leq & P_{(\theta,\sigma^{2})}(1-\epsilon_{n}< \|y-\widehat\theta_{G}\|^{2}/\sigma^{*2}(n-|G|-1)<1+\epsilon_{n})\\ &\leq &P_{(\theta,\sigma^{2})}\left( \left|\frac{\|y-\widehat\theta_{G}(y)\|^{2}}{\sigma^{2}}\right.-(n-|G|-1)\left|>(n-|G|-1)\epsilon_{n}\right.\right)\\ &=&P_{(\theta,\sigma^{2})}\left( |\chi_{n-|G|-1}^{2}(\lambda)-(n-|G|-1)|>(n-|G|-1)\epsilon_{n}\right)\leq \exp\{-c_{2}^{\prime}n{\epsilon_{n}^{2}}\} \end{array} $$

for some $c_{2}^{\prime }$, where the noncentral parameter $\lambda <n{\epsilon _{n}^{2}}\prec (n-|G|-1)\epsilon _{n}$.

Combining the results from the previous two paragraph, it is easy to obtain that

$$ \sup_{(\theta,\sigma^{2})\in C_{n}}E_{(\theta,\sigma^{2})}[1-\phi(y)]\leq \max\{\exp(-c_{2}n{\epsilon_{n}^{2}}),\exp(-c_{2}^{\prime}n{\epsilon_{n}^{2}})\}. $$

(A.3)

Now we consider the marginal posterior density of data y. With probability $P(\|\varepsilon \|\leq 2\sqrt n\sigma ^{*})$ (which converges to 1),

$$ \begin{array}{@{}rcl@{}} &&\frac{m(y)}{f^{*}(y)}=\frac{{\int}_{\sigma^{2}}{\int}_{\theta} \sigma^{*n}\exp\{-\|\theta^{*}-\theta+\varepsilon\|^{2}/\sigma^{2}\} d\theta d\sigma^{2}} {\sigma^{n}\exp\{-\|\varepsilon\|^{2}/\sigma^{*2}\}}\\ &\geq&{\int}_{\sigma^{2}}{\int}_{\theta}\exp\left\{-\frac{\|\theta^{*}-\theta\|^{2}}{\sigma^{2}} -2\frac{(\theta^{*}-\theta)^{T}\varepsilon}{\sigma^{2}}+\frac{\|\varepsilon\|^{2}}{\sigma^{*2}}-\frac{\|\varepsilon\|^{2}}{\sigma^{2}}-n\log(\sigma/\sigma^{*})\right\} \\ &&\pi(\theta,\sigma^{2})d\theta d\sigma^{2}\\ &\geq& \pi(\max\{|\theta_{1}-\theta_{1}^{*}|, |\vartheta_{i}-\vartheta_{i}^{*}|\}/\sigma\leq |G^{*}|\log n/n^{2}, 0\leq\sigma^{2}-\sigma^{*2}\\ &\leq& \sigma^{*2}|G^{*}|\log n/n)\exp\{-c_{3}^{\prime}|G^{*}|\log n \} \end{array} $$

for some constant $c_{3}^{\prime }$. Besides,

$$ \begin{array}{@{}rcl@{}} &&\pi(\max\{|\theta_{1}-\theta_{1}^{*}|, |\vartheta_{i}-\vartheta_{i}^{*}|\}/\sigma\leq |G^{*}|\log n/n^{2}, 0\leq\sigma^{2}-\sigma^{*2}\leq\sigma^{*2} |G^{*}|\log n/n)\\ &\geq&\pi_{\sigma}(\sigma^{*2})*O(\sigma^{*2}|G^{*}|\log n/n)* \underline\pi_{\theta}*O(|G^{*}|\log n/n^{2}) \\ &&*\underline\pi_{\vartheta}^{|G^{*}|} [ |G^{*}|\log n/n^{2})]^{|G^{*}|} * [\pi_{\vartheta}(\{-|G^{*}|\log n/n^{2}, |G^{*}|\log n/n^{2}\})]^{n}\\ &=&\exp\{-c_{3}^{\prime\prime}|G^{*}|\log n\}. \quad\text{ (by the conditions imposed on the prior specifications)} \end{array} $$

for some constant $c_{3}^{\prime \prime }$. Thus

$$ m(y)/f^{*}(y)\geq \exp\{-(c_{3}^{\prime}+c_{3}^{\prime\prime})|G^{*}|\log n\} =\exp\{-c_{3}n{\epsilon_{n}^{2}}\}, \text{ with probability tending to 1} $$

(A.4)

for some c₃, where c₃ can be sufficiently small when $n{\epsilon _{n}^{2}}/[|G^{*}|\log n]$ is large enough.

At last, we study the prior probability of set B_n. Due to the prior independence of $\vartheta _{i}^{\prime }$s, π(B_n) = π[Bin(n − 1 −|G^∗|,p) > (δ)|G^∗|], where p ≤ (1/n)^1+u. By Lemma A.2,

$$ \pi(B_{n}) \leq \exp\{ -c_{4}\delta |G^{*}|\log n\} $$

(A.5)

for some c₄. Combine results (A.2), (A.3), (A.4) and (A.5), and we apply Lemma A.3 to get the posterior consistency result that

$$ \pi(B_{n}\cup C_{n}|y) \rightarrow^{p} 0, $$

given a sufficient large constants δ and n𝜖_n/|G^∗|log n. □

Proof 3 (Proof of Theorem 3.1).

Consider y = 𝜃 + d, where $\theta _{i}^{*}\equiv 0$for all i and error d is order statistics of standard normal variables, i.e. the density of d is $f(d)=n!\prod \phi (d_{i})1(d_{1}\leq d_{2},\dots ,d_{n-1}\leq d_{n})$andϕ denotes the standard normal density. The prior of 𝜃 follows $\pi (\theta )=\pi _{1}(\theta _{1}){\prod }_{i=2}^{n}\pi _{t,s}(\theta _{i}-\theta _{i-1})$, where $\pi _{\lambda _{1}}$is the density of N(0,λ₁), and π_t,s is the density of t distribution with tiny scale parameter satisfying − log s ≍ log n, i.e. conditions in Corollary 2.1 holds, and we consider the misspecified posterior of form π(𝜃|D_n) = exp{−(y − 𝜃)²/2}π(𝜃).

Define $\mu \in \mathbb {R}^{n}$asμ_i = 0 for all 1 ≤ i ≤ k = 3n/4, and μ_i = Z_0.25/2 for i > k where Z_0.25 is the right 25% quantile of standard normal distribution, thus ∥μ − 𝜃^∗∥² ≍ n.

Let Δ𝜃 be any vector such that ∥Δ𝜃∥² ≤ M log n. Then

$$ -\log\left( \frac{\pi(\mu+{\Delta}\theta)}{\pi(\theta^{*}+{\Delta}\theta)}\right) =-\log\left( \frac{\pi_{t,s}(Z_{0.25}/2+{\Delta}\theta_{k})}{\pi_{t,s}({\Delta}\theta_{k})}\right) = O(-\log s) = O(\log n). $$

And

$$ \begin{array}{@{}rcl@{}} &&\log\left( \frac{\exp\{-(y-\mu-{\Delta}\theta)^{2}/2\}}{\exp\{-(y-\theta^{*}-{\Delta}\theta)^{2}/2\}}\right) =[(y-\theta^{*}-{\Delta}\theta)^{2}-(y-\mu-{\Delta}\theta)^{2}]/2\\ &=&\frac{1}{2}\sum\limits_{i=k+1}^{n}[(y_{i}-{\Delta}\theta_{i})^{2}-(y_{i}-Z_{0.25}/2-{\Delta}\theta_{i})^{2}]=\frac{1}{2}\sum\limits_{i=k+1}^{n}[(y_{i}-{\Delta}\theta_{i})Z_{0.25}-\frac{Z_{0.25}^{2}}{4}]\\ &\geq &\frac{1}{2}\left[(\|y_{k+1:n}\|_{1}-\sqrt{nM\log n/4})Z_{0.25}-\frac{nZ_{0.25}^{2}}{16} \right]\geq cn, \end{array} $$

for some positive constant c given sufficiently large n, where the inequalities above hold since y_n ≥ y_n− 1⋯ ≥ y_k+ 1, and y_k+ 1 ≈ Z_0.25 with high probability, due to large sample empirical quantile theory.

Combining the above two results, we have that the posterior density satisfies π(μ + Δ𝜃|D_n) ≫ π(𝜃^∗ + Δ𝜃|D_n) for any ∥Δ𝜃∥² ≤ M log n with high probability. Therefore, more posterior mass is distributed within the $\sqrt {M\log n}$-radius ball centered at μ than at the true parameter 𝜃^∗. □

Proof 4 (Proof of Theorem 3.2).

The proof of this theorem is quite similar to the proof of Theorem 2.1 and 2.2. We define the same testing function as in the proof of Theorem 2.1 and 2.2, and define the following two sets:

$$ \begin{array}{@{}rcl@{}} &&C_{n}=\{\theta: \|\theta-\theta^{*}\|\leq M\sqrt n\sigma^{*}\epsilon_{n}, (1-\epsilon_{n})/(1+\epsilon_{n})<\sigma^{2}/\sigma^{*2}<(1+\epsilon_{n})/(1-\epsilon_{n})\}^{c}\backslash B_{n},\\ &&B_{n}=\{\theta: \text{Among all} \ \{\theta_{i}-\theta_{i-1}\}_{i=2}^{n} , \text{ there are at least} \delta \text{ of them are greater than } \sigma\epsilon_{n}/n\}. \end{array} $$

Using the same arguments, one can still establish exponential separation results (A.2) and (A.3).

To establish (A.4), we notice that

$$ \begin{array}{@{}rcl@{}} &&\frac{m(y)}{f^{*}(y)}=\frac{{\int}_{\sigma^{2}}{\int}_{\theta} \sigma^{*n}\exp\{-\|\theta^{*}-\theta+\varepsilon\|^{2}/\sigma^{2}\} d\theta d\sigma^{2}} {\sigma^{n}\exp\{-\|\varepsilon\|^{2}/\sigma^{*2}\}}\\ &\geq&{\int}_{\sigma^{2}}{\int}_{\theta}\exp\left\{-\frac{\|\theta^{*}-\theta\|^{2}}{\sigma^{2}} -2\frac{(\theta^{*}-\theta)^{T}\varepsilon}{\sigma^{2}}+\frac{\|\varepsilon\|^{2}}{\sigma^{*2}}-\frac{\|\varepsilon\|^{2}}{\sigma^{2}}-n\log(\sigma/\sigma^{*})\right\} \pi(\theta,\sigma^{2})d\theta d\sigma^{2}\\ &\geq& \pi(\max\{|\theta_{i}-\theta_{i}^{*}|\}/\sigma\leq \log n/n, 0\leq\sigma^{2}-\sigma^{*2}\leq \sigma^{*2}\log n/n)\exp\{-c_{3}^{\prime}\log n \} \end{array} $$

for some constant $c_{3}^{\prime }$ and

$$ \begin{array}{@{}rcl@{}} &&\pi(\max\{|\theta_{i}-\theta_{i}^{*}|\}/\sigma\leq \log n/n, 0\leq\sigma^{2}-\sigma^{*2}\leq\sigma^{*2} \log n/n)\\ &\geq &\sum\limits_{r}\pi(\max\{|\theta_{r(1)}-\theta_{r(1)}^{*}|, |\theta_{r(i)}-\theta_{r(i-1)}|\}/\sigma\leq |G^{*}|\log n/n^{2}, 0\\ &\leq&\sigma^{2}-\sigma^{*2}\leq\sigma^{*2} \log n/n|r)\pi(r).\\ \end{array} $$

This ensures (A.4).

As for the prior probability of B_n, if the scale parameter for the t distribution is sufficiently small, i.e. s = n^−w for some large w and ${\int }_{\pm \epsilon _{n}/n^{2}}\pi _{t,s}(x)dx\geq 1-1/n^{1+u}$ for some sufficiently large u where π_t,s denotes the t density function with scale parameter s, then for any ranking r,

$$ \pi(B_{n}|r)\geq 1-\pi(\max\{\theta_{r(i)}-\theta_{r(i-1)}\}\leq \sigma\epsilon_{n}/n^{2} |r) \geq 1-(1-1/n^{1+u})^{n} \approx n^{-u}. $$

This hence implies that − log(π(B_n)) ≥ u log n. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Q., Cheng, G. Bayesian Fusion Estimation via t Shrinkage. Sankhya A 82, 353–385 (2020). https://doi.org/10.1007/s13171-019-00177-0

Download citation

Received: 26 December 2018
Published: 29 July 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s13171-019-00177-0

Keywords and phrases

AMS (2000) subject classification

Primary 62F15; Secondary 62J07

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian Fusion Estimation via t Shrinkage

Abstract

Access this article

Similar content being viewed by others

Sparse estimation of linear model via Bayesian method $$^*$$

A sparse estimate based on variational approximations for semiparametric generalized additive models

Bayesian fused lasso modeling via horseshoe prior

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Lemma A.1 (Lemma 1 of Laurent and Massart 2000).

Lemma A.2 (Theorem 1 of Zubkov and Serov 2013).

Lemma A.3.

Proof 1.

Proof 2 (Proof of Theorem 2.1 and 2.2).

Proof 3 (Proof of Theorem 3.1).

Proof 4 (Proof of Theorem 3.2).

Rights and permissions

About this article

Cite this article

Keywords and phrases

AMS (2000) subject classification

Navigation

Bayesian Fusion Estimation via t Shrinkage

Abstract

Access this article

Similar content being viewed by others

Sparse estimation of linear model via Bayesian method $$^*$$

A sparse estimate based on variational approximations for semiparametric generalized additive models

Bayesian fused lasso modeling via horseshoe prior

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Lemma A.1 (Lemma 1 of Laurent and Massart 2000).

Lemma A.2 (Theorem 1 of Zubkov and Serov 2013).

Lemma A.3.

Proof 1.

Proof 2 (Proof of Theorem 2.1 and 2.2).

Proof 3 (Proof of Theorem 3.1).

Proof 4 (Proof of Theorem 3.2).

Rights and permissions

About this article

Cite this article

Share this article

Keywords and phrases

AMS (2000) subject classification

Search

Navigation