Skip to main content

Bayesian Fusion Estimation via t Shrinkage

Abstract

Shrinkage prior has gained great successes in many data analysis, however, its applications mostly focus on the Bayesian modeling of sparse parameters. In this work, we will apply Bayesian shrinkage to model high dimensional parameter that possesses an unknown blocking structure. We propose to impose heavy-tail shrinkage prior, e.g., t prior, on the differences of successive parameter entries, and such a fusion prior will shrink successive differences towards zero and hence induce posterior blocking. Comparing to conventional Bayesian fused LASSO which implements Laplace fusion prior, t fusion prior induces stronger shrinkage effect and enjoys a nice posterior consistency property. Simulation studies and real data analyses show that t fusion has superior performance to the frequentist fusion estimator and Bayesian Laplace fusion prior. This t fusion strategy is further developed to conduct a Bayesian clustering analysis, and our simulations show that the proposed algorithm compares favorably to classical Dirichlet process modeling.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

References

  • Andrews, D.F. and Mallows, C.L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B (Methodological), 99–102.

  • Barron, A. (1998). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. In J.M. Bernardo, J. Berger, A. Dawid, A. Smith, eds. Bayesian Statistics 6, 27–52.

  • Berger, J.O., Wang, X. and Shen, L. (2014). A bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics 24, 1, 110–129.

    MathSciNet  Google Scholar 

  • Betancourt, B., Rodríguez, A. and Boyd, N. (2017). Bayesian fused lasso regression for dynamic binary networks. Journal of Computational and Graphical Statistics26, 4, 840–850.

    MathSciNet  Google Scholar 

  • Bhattacharya, A., Pati, D., Pillai, N.S. and Dunson, D.B. (2015). Dirichlet-laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 1479–1490.

    MathSciNet  MATH  Google Scholar 

  • Carvalho, C.M., Polson, N.G. and Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika 97, 465–480.

    MathSciNet  MATH  Google Scholar 

  • Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics 40, 4, 2069–2101.

    MathSciNet  MATH  Google Scholar 

  • Castillo, I., Schmidt-Hieber, J. and van der Vaart, A.W. (2015). Bayesian linear regression with sparse priors. Annals of Statistics, 1986–2018.

  • Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771.

    MathSciNet  MATH  Google Scholar 

  • Chen, J. and Chen, Z. (2012). Extended bic for small-n-large-p sparse glm. Statistica Sinica 22, 555–574.

    MathSciNet  MATH  Google Scholar 

  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.

    MathSciNet  MATH  Google Scholar 

  • Ghosal, S., Ghosh, J.K. and Van Der Vaart, A.W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28, 2, 500–531.

    MathSciNet  MATH  Google Scholar 

  • Ghosal, Subhashis and Van Der Vaart, A.W. (2007). Convergence rates of posterior distributions for noniid observations. Annals of Statistics 35, 1, 192–223.

    MathSciNet  MATH  Google Scholar 

  • Hahn, P.R. and Carvalho, C.M. (2015). Decoupling shrinkage and selection in bayesian linear models: a posterior summary perspective. Journal of the American Statistical Association 110, 435–448.

    MathSciNet  MATH  Google Scholar 

  • Heller, K.A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the 22nd international conference on Machine learning, 297–304.

  • Ishwaran, H. and Rao, J.S. (2005). Spike and slab variable selection: frequentist and bayesian strategies. Annals of Statistics, 730–773.

  • Jiang, W. (2007). Bayesian variable selection for high dimensional generalized linear models: Convergence rate of the fitted densities. Annals of Statistics 35, 1487–1511.

    MathSciNet  MATH  Google Scholar 

  • Johnson, V.E. and Rossel, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association 107, 649–660.

    MathSciNet  MATH  Google Scholar 

  • Johnstone, I.M. (2010). High dimensional bernstein-von mises: simple examples. Institute of Mathematical Statistics Collections 6, 87.

    MathSciNet  Google Scholar 

  • Ke, Z.T., Fan, J. and Wu, Y. (2015a). Homogeneity pursuit. Journal of the American Statistical Association 110, 509, 175–194.

  • Ke, Z.T., Fan, J. and Wu, Y. (2015b). Homogeneity pursuit. Journal of the American Statistical Association 110, 175–194.

  • Kleijn, B.J.K., van der Vaart, A.W. et al. (2006a). Misspecification in infinite-dimensional bayesian statistics. The Annals of Statistics 34, 2, 837–877.

  • Kleijn, B.J.K. and van der Vaart, A.W. (2006b). Misspecification in infinite-dimensional bayesian statistics. Annals of Statistics 34, 837–877.

  • Kyung, M., Gill, J., Ghosh, M. and Casella, G. (2010). Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis 5, 2, 369–411.

    MathSciNet  MATH  Google Scholar 

  • Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 1302–1338.

  • Li, H. and Pati, D. (2017). Variable selection using shrinkage priors. Computational Statistics & Data Analysis 107, 107–119.

    MathSciNet  MATH  Google Scholar 

  • Li, Furong and Sang, Huiyan (2018). Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association, (just-accepted), 1–37.

  • Liang, F., Song, Q. and Yu, K. (2013). Bayesian subset modeling for high dimensional generalized linear models. Journal of the American Statistical Association108, 589–606.

    MathSciNet  MATH  Google Scholar 

  • Liu, J., Yuan, L. and Ye, J. (2010). An efficient algorithm for a class of fused lasso problems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 323–332.

  • Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112, 517, 410–423.

    MathSciNet  Google Scholar 

  • Mozeika, A. and Coolen, A. (2018). Mean-field theory of bayesian clustering. arXiv:1709.01632.

  • Narisetty, N.N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42, 2, 789–817.

    MathSciNet  MATH  Google Scholar 

  • Neal, R.M. (2000). Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics 9, 2, 249–265.

    MathSciNet  Google Scholar 

  • Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686.

    MathSciNet  MATH  Google Scholar 

  • Rinaldo, A. et al. (2009). Properties and refinements of the fused lasso. The Annals of Statistics 37, 5B, 2922–2952.

    MathSciNet  MATH  Google Scholar 

  • Robbins, H. (1985). An empirical bayes approach to statistics. In Herbert Robbins Selected Papers, 41–47.

  • Royston, J.P. (1982). Algorithm as 177: Expected normal order statistics (exact and approximate). Journal of the Royal Statistical Society. Series C (Applied statistics) 31, 2, 161–165.

    Google Scholar 

  • Scott, J.G. and Berger, J.O. (2010). Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics, 2587–2619.

  • Shen, X. and Huang, H.-C. (2012). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association 105, 727–739.

    MathSciNet  MATH  Google Scholar 

  • Shimamura, K., Ueki, M., Kawano, S. and Konishi, S. (2018). Bayesian generalized fused lasso modeling via neg distribution. Communications in Statistics-Theory and Methods, 1–23.

  • Song, Q. and Liang, F. (2014). A split-and-merge bayesian variable selection approach for ultra-high dimensional regression. Journal of the Royal Statistical Society, Series B, in press.

  • Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensional regression. arXiv:1712.08964.

  • Tang, X., Xu, X., Ghosh, M. and Ghosh, P. (2016). Bayesian variable selection and estimation based on global-local shrinkage priors. arXiv:1605.07981.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R., Saunders, M., Rosset, S., Ji, Z. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 1, 91–108.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R. and Wang, P. (2007). Spatial smoothing and hot spot detection for cgh data using the fused lasso. Biostatistics 9, 1, 18–29.

    MATH  Google Scholar 

  • van der Geer, S. and Bühlmann, P. (2011). Statistics for High-Dimensional Data Methods, Theory and Applications. Spring Series in Statistics, Springer.

  • van der Pas, S.L., Szabo, B. and van der Vaart, A. (2017). Adaptive posterior contraction rates for the horseshoe. arXiv:1702.03698.

  • Wade, S. and Ghahramani, Z. (2018). Bayesian cluster analysis: Point estimation and credible balls. Bayesian Analysis 13, 559–626.

    MathSciNet  MATH  Google Scholar 

  • Xu, Z., Schmidt, D.F., Makalic, E., Qian, G. and Hopper, J.L. (2017). Bayesian sparse global-local shrinkage regression for grouped variables. arXiv:1709.04333.

  • Yang, Y., Wainwright, M.J. and Jordan, M.I. (2015). On the computational complexity of high-dimensional bayesian variable selection. Annals of Statistics, in press.

  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38, 894–942.

    MathSciNet  MATH  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.

    MathSciNet  MATH  Google Scholar 

  • Zubkov, A.M. and Serov, A.A. (2013). A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications57, 539–544.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work is in memory of Prof. Jayanta Ghosh who has jointly supervised the first PhD student of the second author. Qifan Song’s research is sponsored by NSF DMS-1811812. Guang Cheng’s research is sponsored by NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research, (ONR N00014-18-2759).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qifan Song.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

First, let us state some useful lemmas.

Lemma A.1 (Lemma 1 of Laurent and Massart 2000).

Let \({\chi ^{2}_{d}}(\kappa )\)be a chi-square distribution with degree of freedom d, and noncentral parameterκ, then we have the following concentration inequality

$$ \begin{array}{@{}rcl@{}} &&Pr({\chi^{2}_{d}}(\kappa)>d+\kappa+2x+\sqrt{(4d+8\kappa)x})\leq \exp(-x), \text{ and}\\ &&Pr({\chi^{2}_{d}}(\kappa)<d+\kappa-\sqrt{(4d+8\kappa)x})\leq \exp(-x). \end{array} $$

Lemma A.2 (Theorem 1 of Zubkov and Serov 2013).

Let X be a Binomial random variableX ∼B(n,v). For any1 < k < n − 1

$$ Pr(X\geq k+1)\leq 1- {\Phi}(\text{sign}(k-nv)\{2nH(v, k/n)\}^{1/2}), $$

where Φ is the cumulative distribution function of standard Gaussian distribution and H(v,k/n) = (k/n)log(k/nv) + (1 − k/n)log[(1 − k/n)/(1 − v)].

The next lemma is a refined result of Lemma 6 in Barron (1998):

Lemma A.3.

Let f be the true probability density of data generation,f𝜃 be the likelihood function with parameter 𝜃 ∈Θ, and E,E𝜃 denote the corresponding expectation respectively. LetBn andCn be two subsets in parameter space Θ, and ϕn be some test function satisfying ϕn(Dn) ∈ [0,1] for any data Dn. Ifπ(Bn) ≤ bn, \(E^{*}\phi (D_{n})\leq b_{n}^{\prime }\),\(\sup _{\theta \in C_{n}}E_{\theta }(1-\phi (D_{n}))\leq c_{n}\), and furthermore,

$$P^{*}\left\{\frac{m(D_{n})}{f^{*}(D_{n})}\geq a_{n} \right\}\geq 1-a_{n}^{\prime},$$

where \(m(D_{n})={\int }_{\Theta } \pi (\theta )f_{\theta }(D_{n})d\theta \) is the margin probability of Dn. Then,

$$ E^{*}\left( \pi(C_{n}\cup B_{n})|D_{n})\right)\leq \frac{b_{n}+c_{n}}{a_{n}}+a_{n}^{\prime}+b_{n}^{\prime}. $$

Proof 1.

Define Ωn to be the event of (m(Dn))/(f(Dn)) ≥ an, and \(m(D_{n}, C_{n}\cup B_{n}) = {\int }_{C_{n}\cup B_{n}} \pi (\theta )f_{\theta }(D_{n})d\theta \). Then

$$ \begin{array}{@{}rcl@{}} &&E^{*}\pi(C_{n}\cup B_{n})|D_{n}) = E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))1_{{\Omega}_{n}}\\ &+&E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))(1-1_{{\Omega}_{n}})+ E^{*}\pi(C_{n}\cup B_{n})|D_{n})\phi(D_{n})\\ &\leq& E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))1_{{\Omega}_{n}}+E^{*}(1-1_{{\Omega}_{n}})+ E^{*}\phi(D_{n})\\ &\leq& E^{*}\pi(C_{n}\cup B_{n})|D_{n})(1-\phi(D_{n}))1_{{\Omega}_{n}}+b_{n}^{\prime}+a_{n}^{\prime}\\ &\leq& E^{*}\{m(D_{n}, C_{n}\cup B_{n})/a_{n}f^{*}(D_{n})\}(1-\phi(D_{n}))+b_{n}^{\prime}+a_{n}^{\prime}. \end{array} $$

By Fubini theorem,

$$ \begin{array}{@{}rcl@{}} &&E^{*}(1-\phi(D_{n}))m(D_{n}, C_{n}\cup B_{n})/f^{*}(D_{n}) = {\int}_{C_{n}\cup B_{n}} {\int}_{\mathcal{X}}[1-\phi(D_{n})]f_{\theta}(D_{n}) dD_{n}\pi(\theta)d\theta\\ &\leq&{\int}_{C_{n}} E_{\theta}(1-\phi(D_{n}))\pi(\theta)d\theta+{\int}_{B_{n}} {\int}_{\mathcal{X}}f_{\theta}(D_{n}) dD_{n}\pi(\theta)d\theta\leq b_{n}+c_{n}. \end{array} $$

Combining the above inequalities leads to the conclusion.

Proof 2 (Proof of Theorem 2.1 and 2.2).

Let G = {g1,g2,…,gd} be a generic subset of {2,…,n}, and it also represents potential a (d + 1)-group structure of 𝜃 as {{1,…,g1},{g1 + 1,…,g2},…,{gd + 1,…,n}}. Given G and its corresponding blocking structure, \(\widehat \theta _{G}(y)\)denotes the estimation of 𝜃 based on block mean, i.e. \(\widehat \theta _{G,j}(y) \)\(= {\sum }_{i=g_{j}+1}^{g_{j+1}}y_{i}/(g_{j+1}-g_{j})\)for all gj + 1 ≤ jgj+ 1, and \(\widehat {\sigma }^{2}_{G}(y)=\|y-\widehat \theta _{G}\|^{2}/(n-|G|-1)\).

To prove the posterior contraction, we will apply Lemma A.3. We define the following testing function

$$ \begin{array}{ll} \phi(y) = 1\{&\|\widehat\theta_{G}-\theta^{*}\|\geq \sqrt{n}\sigma^{*}\epsilon_{n}\text{ and\ } |\widehat{\sigma}^{2}_{G}-\sigma^{*2}|>\sigma^{*2}\epsilon_{n}\\ &\text{ for all\ } G\supset G^{*}, |G|\leq (1+\delta)|G^{*}|\} \end{array} $$
(A.1)

for some δ > 0, and define Cn and Bn as:

$$ \begin{array}{@{}rcl@{}} &&C_{n}=\{\theta: \|\theta-\theta^{*}\|\leq M\sqrt n\sigma^{*}\epsilon_{n}, (1-\epsilon_{n})/(1+\epsilon_{n})<\sigma^{2}/\sigma^{*2}<(1+\epsilon_{n})/(1-\epsilon_{n})\}^{c}\backslash B_{n},\\ &&B_{n}=\{\theta: \text{Among all\ } \{\vartheta_{i}'s\}_{i\notin G^{*}} , \text{ there are at least} \ \delta|G^{*}| \text{ of them are greater than} \ \sigma\epsilon_{n}/n\}. \end{array} $$

Note that when GG, \(\|\widehat \theta _{G}(y)-\theta ^{*}\|^{2}\sim \sigma ^{*2}\chi ^{2}_{|G|+1}\) and \(\|y-\widehat \theta _{G}\|^{2}\)\(\sim \sigma ^{*2}\chi ^{2}_{n-|G|-1}\), thus by Lemma A.1, we have that

$$ P(\|\widehat\theta_{G}(y)-\theta^{*}\|\geq \sqrt{n}\sigma^{*}\epsilon_{n}\text{ and\ } |\widehat{\sigma}^{2}_{G}-\sigma^{*2}|>\sigma^{*2}\epsilon_{n}) \leq \exp\{-c_{1}n{\epsilon_{n}^{2}}\}, $$

for some constant c1, since \(|G|=O(|G^{*}|)\prec n{\epsilon _{n}^{2}}\) and 𝜖n ≺ 1. Therefore,

$$ E_{(\theta^{*},\sigma^{*2})}\phi(y)\leq {n-1 \choose (1+\delta)|G^{*}|}\exp\{-c_{1}n{\epsilon_{n}^{2}}\}=\exp\{-c_{1}^{\prime}n{\epsilon_{n}^{2}}\}, $$
(A.2)

as long as n𝜖2/[|G|log n] is sufficiently large.

For any (𝜃,σ2) ∈ Cn satisfying \(\|\theta -\theta ^{*}\|\leq M\sqrt n\sigma ^{*}\epsilon _{n}\) and σ2/σ∗2 ≥ (1 − 𝜖n)/(1 + 𝜖n), we define \(\widehat G=\{i:\theta _{i}-\theta _{i-1}\geq \sigma \epsilon _{n}/n^{2}\}\cup G^{*}\) (hence |G|≤ (1 + δ)|G|), thus

$$ \begin{array}{@{}rcl@{}} && P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\theta^{*}\|\leq \sqrt{n}\sigma^{*}\epsilon_{n}) = P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\widehat\theta_{\widehat G}(\theta) + \widehat\theta_{\widehat G}(\theta)-\theta^{*}\|\leq \sqrt{n}\sigma^{*}\epsilon_{n})\\ &\leq&P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\widehat\theta_{\widehat G}(\theta)\| \geq \| \widehat\theta_{\widehat G}(\theta)-\theta^{*}\|-\sqrt{n}\sigma^{*}\epsilon_{n}) \\ &\leq& P_{(\theta,\sigma^{2})}(\|\widehat\theta_{\widehat G}(y)-\widehat\theta_{\widehat G}(\theta)\| \geq M\sqrt{n}\sigma^{*}\epsilon_{n}-\sqrt{n}\sigma\epsilon_{n}-\sqrt{n}\sigma^{*}\epsilon_{n}) \\ &\leq&P\left( \chi_{|G+1|}^{2}\geq \left[\sqrt{\frac{1-\epsilon_{n}}{1+\epsilon_{n}}} (M-1)-1\right] n{\epsilon_{n}^{2}}\right)\leq \exp\{-c_{2}n{\epsilon_{n}^{2}}\} \end{array} $$

for some c2 given a large M, where the second inequality is due to the fact that \(\|\widehat \theta _{\widehat G}(\theta )-\theta \|\leq \sqrt {n}\sigma \epsilon _{n}\) when 𝜃Bn.

For any (𝜃,σ2) ∈ Cn satisfying σ2/σ∗2 < (1 − 𝜖n)/(1 + 𝜖n) or σ2/σ∗2 > (1 + 𝜖n)/(1 − 𝜖n),

$$ \begin{array}{@{}rcl@{}} && P_{(\theta,\sigma^{2})}(|\widehat{\sigma}^{2}_{G}-\sigma^{*2}|<\sigma^{*2}\epsilon_{n}) = P_{(\theta,\sigma^{2})}(|\|y-\widehat\theta_{G}\|^{2}/\sigma^{*2}(n-|G|-1)-1|<\epsilon_{n})\\ &\leq & P_{(\theta,\sigma^{2})}(1-\epsilon_{n}< \|y-\widehat\theta_{G}\|^{2}/\sigma^{*2}(n-|G|-1)<1+\epsilon_{n})\\ &\leq &P_{(\theta,\sigma^{2})}\left( \left|\frac{\|y-\widehat\theta_{G}(y)\|^{2}}{\sigma^{2}}\right.-(n-|G|-1)\left|>(n-|G|-1)\epsilon_{n}\right.\right)\\ &=&P_{(\theta,\sigma^{2})}\left( |\chi_{n-|G|-1}^{2}(\lambda)-(n-|G|-1)|>(n-|G|-1)\epsilon_{n}\right)\leq \exp\{-c_{2}^{\prime}n{\epsilon_{n}^{2}}\} \end{array} $$

for some \(c_{2}^{\prime }\), where the noncentral parameter \(\lambda <n{\epsilon _{n}^{2}}\prec (n-|G|-1)\epsilon _{n}\).

Combining the results from the previous two paragraph, it is easy to obtain that

$$ \sup_{(\theta,\sigma^{2})\in C_{n}}E_{(\theta,\sigma^{2})}[1-\phi(y)]\leq \max\{\exp(-c_{2}n{\epsilon_{n}^{2}}),\exp(-c_{2}^{\prime}n{\epsilon_{n}^{2}})\}. $$
(A.3)

Now we consider the marginal posterior density of data y. With probability \(P(\|\varepsilon \|\leq 2\sqrt n\sigma ^{*})\) (which converges to 1),

$$ \begin{array}{@{}rcl@{}} &&\frac{m(y)}{f^{*}(y)}=\frac{{\int}_{\sigma^{2}}{\int}_{\theta} \sigma^{*n}\exp\{-\|\theta^{*}-\theta+\varepsilon\|^{2}/\sigma^{2}\} d\theta d\sigma^{2}} {\sigma^{n}\exp\{-\|\varepsilon\|^{2}/\sigma^{*2}\}}\\ &\geq&{\int}_{\sigma^{2}}{\int}_{\theta}\exp\left\{-\frac{\|\theta^{*}-\theta\|^{2}}{\sigma^{2}} -2\frac{(\theta^{*}-\theta)^{T}\varepsilon}{\sigma^{2}}+\frac{\|\varepsilon\|^{2}}{\sigma^{*2}}-\frac{\|\varepsilon\|^{2}}{\sigma^{2}}-n\log(\sigma/\sigma^{*})\right\} \\ &&\pi(\theta,\sigma^{2})d\theta d\sigma^{2}\\ &\geq& \pi(\max\{|\theta_{1}-\theta_{1}^{*}|, |\vartheta_{i}-\vartheta_{i}^{*}|\}/\sigma\leq |G^{*}|\log n/n^{2}, 0\leq\sigma^{2}-\sigma^{*2}\\ &\leq& \sigma^{*2}|G^{*}|\log n/n)\exp\{-c_{3}^{\prime}|G^{*}|\log n \} \end{array} $$

for some constant \(c_{3}^{\prime }\). Besides,

$$ \begin{array}{@{}rcl@{}} &&\pi(\max\{|\theta_{1}-\theta_{1}^{*}|, |\vartheta_{i}-\vartheta_{i}^{*}|\}/\sigma\leq |G^{*}|\log n/n^{2}, 0\leq\sigma^{2}-\sigma^{*2}\leq\sigma^{*2} |G^{*}|\log n/n)\\ &\geq&\pi_{\sigma}(\sigma^{*2})*O(\sigma^{*2}|G^{*}|\log n/n)* \underline\pi_{\theta}*O(|G^{*}|\log n/n^{2}) \\ &&*\underline\pi_{\vartheta}^{|G^{*}|} [ |G^{*}|\log n/n^{2})]^{|G^{*}|} * [\pi_{\vartheta}(\{-|G^{*}|\log n/n^{2}, |G^{*}|\log n/n^{2}\})]^{n}\\ &=&\exp\{-c_{3}^{\prime\prime}|G^{*}|\log n\}. \quad\text{ (by the conditions imposed on the prior specifications)} \end{array} $$

for some constant \(c_{3}^{\prime \prime }\). Thus

$$ m(y)/f^{*}(y)\geq \exp\{-(c_{3}^{\prime}+c_{3}^{\prime\prime})|G^{*}|\log n\} =\exp\{-c_{3}n{\epsilon_{n}^{2}}\}, \text{ with probability tending to 1} $$
(A.4)

for some c3, where c3 can be sufficiently small when \(n{\epsilon _{n}^{2}}/[|G^{*}|\log n]\) is large enough.

At last, we study the prior probability of set Bn. Due to the prior independence of \(\vartheta _{i}^{\prime }\)s, π(Bn) = π[Bin(n − 1 −|G|,p) > (δ)|G|], where p ≤ (1/n)1+u. By Lemma A.2,

$$ \pi(B_{n}) \leq \exp\{ -c_{4}\delta |G^{*}|\log n\} $$
(A.5)

for some c4. Combine results (A.2), (A.3), (A.4) and (A.5), and we apply Lemma A.3 to get the posterior consistency result that

$$ \pi(B_{n}\cup C_{n}|y) \rightarrow^{p} 0, $$

given a sufficient large constants δ and n𝜖n/|G|log n. □

Proof 3 (Proof of Theorem 3.1).

Consider y = 𝜃 + d, where \(\theta _{i}^{*}\equiv 0\)for all i and error d is order statistics of standard normal variables, i.e. the density of d is \(f(d)=n!\prod \phi (d_{i})1(d_{1}\leq d_{2},\dots ,d_{n-1}\leq d_{n})\)andϕ denotes the standard normal density. The prior of 𝜃 follows \(\pi (\theta )=\pi _{1}(\theta _{1}){\prod }_{i=2}^{n}\pi _{t,s}(\theta _{i}-\theta _{i-1})\), where \(\pi _{\lambda _{1}}\)is the density of N(0,λ1), and πt,s is the density of t distribution with tiny scale parameter satisfying − log s ≍ log n, i.e. conditions in Corollary 2.1 holds, and we consider the misspecified posterior of form π(𝜃|Dn) = exp{−(y𝜃)2/2}π(𝜃).

Define \(\mu \in \mathbb {R}^{n}\)asμi = 0 for all 1 ≤ ik = 3n/4, and μi = Z0.25/2 for i > k where Z0.25 is the right 25% quantile of standard normal distribution, thusμ𝜃2n.

Let Δ𝜃 be any vector such that ∥Δ𝜃2M log n. Then

$$ -\log\left( \frac{\pi(\mu+{\Delta}\theta)}{\pi(\theta^{*}+{\Delta}\theta)}\right) =-\log\left( \frac{\pi_{t,s}(Z_{0.25}/2+{\Delta}\theta_{k})}{\pi_{t,s}({\Delta}\theta_{k})}\right) = O(-\log s) = O(\log n). $$

And

$$ \begin{array}{@{}rcl@{}} &&\log\left( \frac{\exp\{-(y-\mu-{\Delta}\theta)^{2}/2\}}{\exp\{-(y-\theta^{*}-{\Delta}\theta)^{2}/2\}}\right) =[(y-\theta^{*}-{\Delta}\theta)^{2}-(y-\mu-{\Delta}\theta)^{2}]/2\\ &=&\frac{1}{2}\sum\limits_{i=k+1}^{n}[(y_{i}-{\Delta}\theta_{i})^{2}-(y_{i}-Z_{0.25}/2-{\Delta}\theta_{i})^{2}]=\frac{1}{2}\sum\limits_{i=k+1}^{n}[(y_{i}-{\Delta}\theta_{i})Z_{0.25}-\frac{Z_{0.25}^{2}}{4}]\\ &\geq &\frac{1}{2}\left[(\|y_{k+1:n}\|_{1}-\sqrt{nM\log n/4})Z_{0.25}-\frac{nZ_{0.25}^{2}}{16} \right]\geq cn, \end{array} $$

for some positive constant c given sufficiently large n, where the inequalities above hold since ynyn− 1⋯ ≥ yk+ 1, and yk+ 1Z0.25 with high probability, due to large sample empirical quantile theory.

Combining the above two results, we have that the posterior density satisfies π(μ + Δ𝜃|Dn) ≫ π(𝜃 + Δ𝜃|Dn) for any ∥Δ𝜃2M log n with high probability. Therefore, more posterior mass is distributed within the \(\sqrt {M\log n}\)-radius ball centered at μ than at the true parameter 𝜃. □

Proof 4 (Proof of Theorem 3.2).

The proof of this theorem is quite similar to the proof of Theorem 2.1 and 2.2. We define the same testing function as in the proof of Theorem 2.1 and 2.2, and define the following two sets:

$$ \begin{array}{@{}rcl@{}} &&C_{n}=\{\theta: \|\theta-\theta^{*}\|\leq M\sqrt n\sigma^{*}\epsilon_{n}, (1-\epsilon_{n})/(1+\epsilon_{n})<\sigma^{2}/\sigma^{*2}<(1+\epsilon_{n})/(1-\epsilon_{n})\}^{c}\backslash B_{n},\\ &&B_{n}=\{\theta: \text{Among all} \ \{\theta_{i}-\theta_{i-1}\}_{i=2}^{n} , \text{ there are at least} \delta \text{ of them are greater than } \sigma\epsilon_{n}/n\}. \end{array} $$

Using the same arguments, one can still establish exponential separation results (A.2) and (A.3).

To establish (A.4), we notice that

$$ \begin{array}{@{}rcl@{}} &&\frac{m(y)}{f^{*}(y)}=\frac{{\int}_{\sigma^{2}}{\int}_{\theta} \sigma^{*n}\exp\{-\|\theta^{*}-\theta+\varepsilon\|^{2}/\sigma^{2}\} d\theta d\sigma^{2}} {\sigma^{n}\exp\{-\|\varepsilon\|^{2}/\sigma^{*2}\}}\\ &\geq&{\int}_{\sigma^{2}}{\int}_{\theta}\exp\left\{-\frac{\|\theta^{*}-\theta\|^{2}}{\sigma^{2}} -2\frac{(\theta^{*}-\theta)^{T}\varepsilon}{\sigma^{2}}+\frac{\|\varepsilon\|^{2}}{\sigma^{*2}}-\frac{\|\varepsilon\|^{2}}{\sigma^{2}}-n\log(\sigma/\sigma^{*})\right\} \pi(\theta,\sigma^{2})d\theta d\sigma^{2}\\ &\geq& \pi(\max\{|\theta_{i}-\theta_{i}^{*}|\}/\sigma\leq \log n/n, 0\leq\sigma^{2}-\sigma^{*2}\leq \sigma^{*2}\log n/n)\exp\{-c_{3}^{\prime}\log n \} \end{array} $$

for some constant \(c_{3}^{\prime }\) and

$$ \begin{array}{@{}rcl@{}} &&\pi(\max\{|\theta_{i}-\theta_{i}^{*}|\}/\sigma\leq \log n/n, 0\leq\sigma^{2}-\sigma^{*2}\leq\sigma^{*2} \log n/n)\\ &\geq &\sum\limits_{r}\pi(\max\{|\theta_{r(1)}-\theta_{r(1)}^{*}|, |\theta_{r(i)}-\theta_{r(i-1)}|\}/\sigma\leq |G^{*}|\log n/n^{2}, 0\\ &\leq&\sigma^{2}-\sigma^{*2}\leq\sigma^{*2} \log n/n|r)\pi(r).\\ \end{array} $$

This ensures (A.4).

As for the prior probability of Bn, if the scale parameter for the t distribution is sufficiently small, i.e. s = nw for some large w and \({\int }_{\pm \epsilon _{n}/n^{2}}\pi _{t,s}(x)dx\geq 1-1/n^{1+u}\) for some sufficiently large u where πt,s denotes the t density function with scale parameter s, then for any ranking r,

$$ \pi(B_{n}|r)\geq 1-\pi(\max\{\theta_{r(i)}-\theta_{r(i-1)}\}\leq \sigma\epsilon_{n}/n^{2} |r) \geq 1-(1-1/n^{1+u})^{n} \approx n^{-u}. $$

This hence implies that − log(π(Bn)) ≥ u log n. □

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Song, Q., Cheng, G. Bayesian Fusion Estimation via t Shrinkage. Sankhya A 82, 353–385 (2020). https://doi.org/10.1007/s13171-019-00177-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-019-00177-0

Keywords and phrases

  • t shrinkage prior
  • Bayesian fusion
  • Bayesian clustering
  • Posterior consistency

AMS (2000) subject classification

  • Primary 62F15; Secondary 62J07