A Blockwise Consistency Method for Parameter Estimation of Complex Models

Shi, Runmin; Liang, Faming; Song, Qifan; Luo, Ye; Ghosh, Malay

doi:10.1007/s13571-018-0183-0

A Blockwise Consistency Method for Parameter Estimation of Complex Models

Published: 07 February 2019

Volume 80, pages 179–223, (2018)
Cite this article

Sankhya B Aims and scope Submit manuscript

Runmin Shi¹,
Faming Liang ORCID: orcid.org/0000-0002-1177-5501²,
Qifan Song²,
Ye Luo³ &
…
Malay Ghosh⁴

73 Accesses
1 Citation
Explore all metrics

Abstract

The drastic improvement in data collection and acquisition technologies has enabled scientists to collect a great amount of data. With the growing dataset size, typically comes a growing complexity of data structures and of complex models to account for the data structures. How to estimate the parameters of complex models has put a great challenge on current statistical methods. This paper proposes a blockwise consistency approach as a potential solution to the problem, which works by iteratively finding consistent estimates for each block of parameters conditional on the current estimates of the parameters in other blocks. The blockwise consistency approach decomposes the high-dimensional parameter estimation problem into a series of lower-dimensional parameter estimation problems, which often have much simpler structures than the original problem and thus can be easily solved. Moreover, under the framework provided by the blockwise consistency approach, a variety of methods, such as Bayesian and frequentist methods, can be jointly used to achieve a consistent estimator for the original high-dimensional complex model. The blockwise consistency approach is illustrated using high-dimensional linear regression with both univariate and multivariate responses. The results of both problems show that the blockwise consistency approach can provide drastic improvements over the existing methods. Extension of the blockwise consistency approach to many other complex models is straightforward.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parsimonious Finite Mixtures of Matrix-Variate Regressions

Bayesian sparse covariance decomposition with a graphical structure

Article 09 December 2014

A plug-in approach to sparse and robust principal component analysis

Article 02 November 2015

References

Banerjee, M., Durot, C. and Sen, B. (2019). Divide and conquer in non-standard problems and the super-efficiency phenomenon. Ann. Statistics47, 720–757.
Article MathSciNet MATH Google Scholar
Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A., Kim, S., Wilson, C., Lehar, J., Kryukov, G., Sonkin, D., Reddy, A., Liu, M., Murray, L., Berger, M., Monahan, J., Morais, P., Meltzer, J., Korejwa, A., Jane-Valbuena, J., Mapa, F., Thibault, J., Bric-Furlong, E., Raman, P., Shipway, A. and Engels, I. (2012). The Cancer cell line encyclopedia enables predictive modeling of anticancer drug sensitivity. Nature483, 603–607.
Article Google Scholar
Bhadra, A. and Mallick, B. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eQTL analysis. Biometrics69, 447–457.
Article MathSciNet MATH Google Scholar
Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat.5, 232–252.
Article MathSciNet MATH Google Scholar
Cai, T., Li, H., Liu, W. and Xie, J. (2013). Covariate-adjusted precision matrix estimation iwth an application in genetical genomics. Biometrika100, 139–156.
Article MathSciNet MATH Google Scholar
Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika94, 759–771.
Article MATH Google Scholar
Dempster, A. (1972). Covariance selection. Biometrics28, 157–175.
Article MathSciNet Google Scholar
Duguet, M. (1997). When helicase and topoisomerase meet. J. Cell Sci.110, 1345–1350.
Google Scholar
Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman & Hall/CRC, Boca Raton.
Book MATH Google Scholar
Fan, J., Feng, Y., Saldana, D. F., Samworth, R. and Wu, Y. (2015). Sure independence screening. CRAN R Package.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.96, 1348–1360.
Article MathSciNet MATH Google Scholar
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B70, 849–911.
Article MathSciNet MATH Google Scholar
Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. J. Mach. Learn. Res.10, 1829–1853.
MathSciNet MATH Google Scholar
Fan, J. and Song, R. (2010). Sure independence screening in generalized linear model with NP-dimensionality. Ann. Stat.38, 3567–3604.
Article MathSciNet MATH Google Scholar
Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Stat.42, 819–849.
Article MathSciNet MATH Google Scholar
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika80, 27–38.
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics9, 432–441.
Article MATH Google Scholar
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.33, 1–22.
Article Google Scholar
Friedman, J., Hastie, T. and Tibshirani, R. (2015). GLASSO: Graphical lasso- estimation of Gaussian graphical models, CRAN R-Package.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell.6, 721–741.
Article MATH Google Scholar
Guo, N., Wan, Y., Tosun, K., Lin, H., Msiska, Z., Flynn, D., Remick, S., Vallyathan, V., Dowlati, A., Shi, X., Castranova, V., Beer, D. and Qian, Y. (2008). Confirmation of gene expression-based prediction of survival in non-small cell lung cancer. Clin. Cancer Res.14, 8213–8220.
Article Google Scholar
Hamburg, M. and Collins, F. (2010). The path to personalized medicine. New Engl. J. Med.363, 301–304.
Article Google Scholar
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning. Springer, Berlin.
Book MATH Google Scholar
Li, R., Lin, D. and Li, B. (2013). Statistical inference in massive data sets. Appl. Stoch. Model. Bus. Ind.29, 399–409.
MathSciNet Google Scholar
Li, X., Xu, S., Cheng, Y. and Shu, J. (2016). HSPB1 polymorphisms might be associated with radiation-induced damage risk in lung cancer patients treated with radiotherapy. Tumour Biol.37, 5743–5749.
Article Google Scholar
Liang, F., Jia, B., Xue, J., Li, Q. and Luo, Y. (2018). An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond. J. R. Statist. So. Series B80, 899–926.
Article MathSciNet MATH Google Scholar
Liang, F., Song, Q. and Qiu, P. (2015). An Equivalent Measure of Partial Correlation Coefficients for High Dimensional Gaussian Graphical Models. J. Am. Stat. Assoc.110, 1248–1265.
Article MathSciNet MATH Google Scholar
Liang, F., Song, Q. and Yu, K. (2013). Bayesian Subset Modeling for High Dimensional Generalized Linear Models. J. Am. Stat. Assoc.108, 589–606.
Article MathSciNet MATH Google Scholar
Liu, H., Lafferty, J. and Wasserman, L. (2009). The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. J. Mach. Learn. Res.10, 2295–2328.
MathSciNet MATH Google Scholar
Mazumder, R., Friedman, J. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. J. Am. Stat. Assoc.106, 1125–1138.
Article MathSciNet MATH Google Scholar
Mazumder, R. and Hastie, T. (2012). The graphical Lasso: New insights and alternatives. Elect J Stat6, 2125–2149.
Article MathSciNet MATH Google Scholar
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist.34, 1436–1462.
Article MathSciNet MATH Google Scholar
Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D. -Y., Pollack, J. R. and Wang, P. (2010). Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. Ann. Appl. Stat.4, 53–77.
Article MathSciNet MATH Google Scholar
Peng, Z., Wu, T., Xu, Y., Yan, M. and Yin, W. (2016). Coordinate friendly structures, algorithms and applications. Annals of Mathematical Sciences and Applications1, 57–119.
Article MathSciNet MATH Google Scholar
Raskutti, G., Wainwright, M. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over l _q−balls. IEEE Trans. Inf. Theory57, 6976–6994.
Article MATH Google Scholar
Rothman, A. (2015). MRCE: Multivariate regression with covariance estimation, CRAN R-Package.
Rothman, A., Levina, E. and Zhu, J. (2010). Sparse multivariate regression with covariance estimation. J. Comput. Graph. Stat.19, 947–962.
Article MathSciNet Google Scholar
Sofer, T., Dicker, L. and Lin, X. (2014). Variable selection for high dimensional multivariate outcomes. Stat. Sin.22, 1633–1654.
MathSciNet MATH Google Scholar
Song, Q. and Liang, F. (2015a). High Dimensional Variable Selection with Reciprocal L1-Regularization. J. Am. Stat. Assoc.110, 1607–1620.
Article MATH Google Scholar
Song, Q. and Liang, F. (2015b). A split-and-merge Bayesian variable selection approach for ultra-high dimensional regression. J. R. Stat. Soc. Ser. B77, 947–972.
Article MathSciNet MATH Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B58, 267–288.
MathSciNet MATH Google Scholar
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl.109, 475–494.
Article MathSciNet MATH Google Scholar
Tseng, P. and Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Mathematics Programming, Series B117, 387–423.
Article MathSciNet MATH Google Scholar
Turlach, B., Venables, W. and Wright, S. (2005). Simultaneous variable selection. Technometrics47, 349–363.
Article MathSciNet Google Scholar
Vershynin, R. (2015). Estimation in high dimensions: A geometric perspective. Cham, Pfander, G. (ed.), p. 3–66.
Wang, J. (2015). Joint estimation of sparse multivariate regression and conditional graphical models. Stat. Sin.25, 831–851.
MathSciNet MATH Google Scholar
Weickert, C.E. (2009). Transcriptome analysis of male female differences in prefrontal cortical development. Molecular Psychiatry14, 558–561.
Article Google Scholar
Witten, D., Friedman, J. and Simon, N. (2011). New insights and faster computations for the graphical Lasso. J. Comput. Graph. Stat.20, 892–900.
Article MathSciNet Google Scholar
Xue, J. and Liang, F. (2019). Double-Parallel Monte Carlo for Bayesian analysis of big data. Statist. Comput.29, 23–32.
Article MathSciNet MATH Google Scholar
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika95, 19–35.
Article MathSciNet MATH Google Scholar
Zhang, C. -H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics38, 894–942.
Article MathSciNet MATH Google Scholar
Zhang, Y., Duchi, J. and Wainwright, M. (2013). Divide and conquer kernel ridge regression. In Conference on learning theory, pp. 592–617.
Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res.7, 2541–2563.
MathSciNet MATH Google Scholar
Zou, H. (2006). The adptive lasso and its oracle properties. Ann Statist38, 894–942.
Google Scholar

Download references

Acknowledgments

Liang’s research was support in part by the grants DMS-1612924, DMS/NIGMS R01-GM117597, and NIGMS R01-GM126089. The authors thank the editor, associate editor, and two referees for their helpful comments which have led to significant improvement of this paper.

Author information

Authors and Affiliations

Department of Statistics, University of Florida, Gainesville, FL, 32611, USA
Runmin Shi
Department of Statistics, Purdue University, West Lafayette, IN, 47906, USA
Faming Liang & Qifan Song
Faculty of Business and Economics, University of Hong Kong, Hong Kong, China
Ye Luo
University of Florida, Gainesville, FL, 32611, USA
Malay Ghosh

Authors

Runmin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Faming Liang
View author publications
You can also search for this author in PubMed Google Scholar
Qifan Song
View author publications
You can also search for this author in PubMed Google Scholar
Ye Luo
View author publications
You can also search for this author in PubMed Google Scholar
Malay Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faming Liang.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem 2.1

We follow the proof of Theorem 2.4.3 of van der Vaart and Wellner (1996). By the symmetrization Lemma 2.3.1 of van der Vaart and Wellner (1996), measurability of the class $\mathcal {F}_{n}$, and Fubini’s theorem,

$$\begin{array}{@{}rcl@{}} &&E^{*}\sup\limits_{\theta^{(s)} \in {\Theta}_{n}^{(s)},\hat{\boldsymbol{\theta}}_{t-1}^{(-s)}\in {\Theta}_{n,T}^{(-s)}} \left| \widehat{G}_{n}(\theta^{(s)}|\hat{\boldsymbol{\theta}}_{t-1}^{(-s)}) - G_{n}(\theta^{(s)}|\hat{\boldsymbol{\theta}}_{t-1}^{(-s)}) \right|\\ &\leq& 2 E_{x} E_{\epsilon} \sup\limits_{q\in \mathcal{F}} \left\|\frac{1}{n}\sum\limits_{i = 1}^{n} \epsilon_{i} q(x_{i}) \right\| \leq 2 E_{x}E_{\epsilon} \sup\limits_{q\in \mathcal{G}_{n,M}} \left\|\frac{1}{n}\sum\limits_{i = 1}^{n} \epsilon_{i} q(x_{i}) \right\|\\ &&+ 2 E^{*}[m_{n}(x)1(m_{n}(x)>M)], \\ \end{array} $$

where 𝜖_i are i.i.d. Rademacher random variables with P(𝜖_i = + 1) = P(𝜖_i = − 1) = 1/2, and E^∗ denotes the outer expectation.

By condition (B2)-(a), 2E^∗[m_n(x)1(m_n(x) > M)] → 0 for sufficiently large M. To prove convergence in mean, it suffices to show that the first term converges to zero for fixed M. Fix x₁,...,x_n, and let $\mathcal {H}$ be a 𝜖-net in $L_{1}(\mathbb {P}_{n})$ over $\mathcal {G}_{M}$, then

$$E_{\epsilon} \sup\limits_{q\in \mathcal{G}_{n,M}}\left\|\frac{1}{n}\sum\limits_{i = 1}^{n} \epsilon_{i} q(x_{i}) \right\|\leq E_{\epsilon} \sup\limits_{q\in \mathcal{H}}\left\|\frac{1}{n}\sum\limits_{i = 1}^{n} \epsilon_{i} q(x_{i}) \right\|+\epsilon, $$

where the cardinality of $\mathcal {H}$ can be chosen equal to $N(\epsilon ,\mathcal {G}_{n,M},L_{1}(\mathbb {P}_{n})$. Bound the l₁-norm on the right by the Orlicz-norm ψ₂ and use the maximal inequality (Lemma 2.2.2 of van der Vaart and Wellner (1996)) and Hoeffding’s inequality, it can be shown that

$$ E_{\epsilon} \sup\limits_{q\in \mathcal{G}_{n,M}}\left\|\frac{1}{n}\sum\limits_{i = 1}^{n} \epsilon_{i} q(x)\right\| \leq K \sqrt{1+\log N(\epsilon,\mathcal{G}_{n,M},L_{1}(\mathbb{P}_{n}))}\sqrt{\frac{6}{n}}M +\epsilon \rightarrow_{P^{*}} \epsilon, $$

(A.1)

where K is a constant, and P^∗ denotes outer probability. It has been shown that the left side of Eq. A.1 converges to zero in probability. Since it is bounded by M, its expectation with respect to x₁,…,x_n converges to zero by the dominated convergence theorem.

This concludes the proof that $\sup _{\theta ^{(s)} \in {\Theta }_{n}^{(s)},\hat {\boldsymbol {\theta }}_{t-1}^{(-s)}\in {\Theta }_{n,T}^{(-s)}} \left | \widehat {G}_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)}) - G_{n}\right .$$\left .(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)}) \right | \rightarrow _{p} 0$ in mean. Further, by Markov inequality, we conclude that Eq. 2.6 holds.

Proof of Theorem 2.2

Since both $\widehat {G}_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)})$ and $G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)})$ are continuous in 𝜃^(s) as implied by the continuity of log π(x|𝜃) in 𝜃, the remaining part of the proof follows from Lemma A.1.

Lemma A.1.

Consider a sequence of functionsQ_t(𝜃, X_n) fort = 1,2,…,T.Suppose that the following conditions are satisfied: (C1) For eacht,Q_t(𝜃, X_n) is continuous in𝜃and there exists a function$Q_{t}^{*}(\boldsymbol {\theta })$,which is continuous in𝜃and uniquely maximized at$\boldsymbol {\theta }_{*}^{(t)}$.(C2) For any𝜖 > 0,$\sup _{\boldsymbol {\theta } \in {\Theta }_{n}\setminus B_{t}(\epsilon )} Q_{t}^{*}(\boldsymbol {\theta })$exists,where$B_{t}(\epsilon )=\{\boldsymbol {\theta }: \|\boldsymbol {\theta }-\boldsymbol {\theta }_{*}^{(t)}\| < \epsilon \}$;Let$\delta _{t}=Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)})- \sup _{\boldsymbol {\theta } \in {\Theta }_{n}\setminus B_{t}(\epsilon )} Q_{t}^{*}$ (𝜃),δ = min_{t∈{1,2,…,T}}δ_t > 0.(C3)$\sup _{t\in \{1,2,\ldots ,T\}} \sup _{\boldsymbol {\theta } \in {\Theta }_{n}} |Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n})-Q_{t}^{*}(\boldsymbol {\theta })| \to _{p} 0$asn →∞.(C4) The penalty function$P_{\lambda _{n}}(\boldsymbol {\theta })$isnon-negative and converges to 0 uniformly over the set$\{\boldsymbol {\theta }_{*}^{(t)}: t = 1,2,\ldots ,T\}$asn →∞,whereλ_nis a regularization parameter and its value can depend on the sample size n.Let$\boldsymbol {\theta }_{n}^{(t)}=\arg \max _{\boldsymbol {\theta }\in {\Theta }_{n}} \{ Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n})-P_{\lambda _{n}}(\boldsymbol {\theta })\}$.Then the uniform convergence holds, i.e.,$\sup _{t \in \{1,2,\ldots ,T\}} \|\boldsymbol {\theta }_{n}^{(t)}- \boldsymbol {\theta }_{*}^{(t)}\|\to _{p} 0$.

Proof.

Consider two events (i)$\sup _{t \in \{1,2,\ldots ,T\}} \sup _{\boldsymbol {\theta } \in {\Theta }_{n}\setminus B_{t}(\epsilon )} |Q_{t}(\boldsymbol {\theta },\boldsymbol {X}_{n})-Q_{t}^{*}(\boldsymbol {\theta })| < \delta /2$,and (ii) sup_{t∈{1,2,…,T}}$\sup _{\boldsymbol {\theta } \in B_{t}(\epsilon )} |Q_{t}(\boldsymbol {\theta },\boldsymbol {X}_{n})-Q_{t}^{*}(\boldsymbol {\theta })| < \delta /2$.From event (i), we can deduce that for anyt ∈{1,2,…,T} and any𝜃 ∈Θ_n ∖ B_t(𝜖),$Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n}) < Q_{t}^{*}(\boldsymbol {\theta })+\delta /2 \leq Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta _{t} +\delta /2 \leq Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2$.Therefore,$Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n}) -P_{\lambda _{n}}(\boldsymbol {\theta }) < Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2 -o(1)$bycondition (C4).

From event (ii), we can deduce that for anyt ∈{1,2,…,T} and any𝜃 ∈ B_t(𝜖),$Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n})> Q_{t}^{*}(\boldsymbol {\theta }) -\delta /2$and$Q_{t}(\boldsymbol {\theta }_{*}^{(t)}, \boldsymbol {X}_{n})> Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2$.Therefore,$Q_{t}(\boldsymbol {\theta }_{*}^{(t)}, \boldsymbol {X}_{n})-P_{\lambda _{n}}(\boldsymbol {\theta }_{*}^{(t)}) > Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2- o(1)$bycondition (C4).

If both events hold simultaneously, then we musthave${\boldsymbol {\theta }}_{n}^{(t)} \in B_{t}(\epsilon )$forallt ∈{1,2,…,T} asn →∞.By condition (C3), the probability that both events hold tends to 1.Therefore,

$$P(\boldsymbol{\theta}_{n}^{(t)} \in B_{t}(\epsilon) \text{ for all } t = 1,2,\ldots,T) \to 1, $$

which concludes the lemma. □

Proof of Theorem 2.3

Applying Taylor expansion to $G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)})$ at $\theta _{t*}^{(s)}$, we get $G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)}) - G_{n}(\theta _{t*}^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)}) =O_{p}(1/n)$, following from the condition (B5) and the condition (B3) that $G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)})$ is maximized at $\boldsymbol {\theta }_{t*}^{(s)}$. Therefore,

$$\begin{array}{@{}rcl@{}} n[\widehat{G}_{n}(\theta_{t,g}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t*}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})] &=&Z_{t,1}+\cdots+Z_{t,n}+n[G_{n}(\theta_{t,g}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})\\&&- G_{n}(\boldsymbol{\theta}_{t*}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})]= Z_{t,1}+\cdots+Z_{t,n}+ \epsilon_{n}, \end{array} $$

where 𝜖_n = O_p(1), and

$$ P(n|\widehat{G}_{n}(\boldsymbol{\theta}_{t,g}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t*}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})|>nz) \leq\! P(|Z_{t,1}+\cdots+Z_{t,n}|>nz-|\epsilon_{n}|). $$

(A.2)

By Bernstein’s inequality,

$$ P(|Z_{t,1}+\cdots+Z_{t,n}|>nz-|\epsilon_{n}|) \leq 2 \exp\left\{-\frac{1}{2} \frac{(z-|\epsilon_{n}|/n)^{2}}{ \tilde{v}^{\prime}+\tilde{M}_{b}^{\prime}(z-|\epsilon_{n}|/n)} \right\}, $$

(A.3)

for $\tilde {v}^{\prime } \geq (\tilde {v}_{1}+\cdots +\tilde {v}_{n})/n^{2}$ and $\tilde {M}_{b}^{\prime }=\tilde {M}_{b}/n$. Applying Taylor expansion to the right of Eq. A.3 at z and combining with Eq. A.2 leads to

$$ P(|\widehat{G}_{n}(\theta_{t,g}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t*}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})| >z) \leq K \exp\left\{-\frac{1}{2} \frac{z^{2}}{\tilde{v}^{\prime}+\tilde{M}_{b}^{\prime} z} \right\}, $$

(A.4)

where $K = 2+\frac {3}{\tilde {M}_{b}^{\prime }}O_{p}(1/n)= 2+\frac {3}{\tilde {M}_{b}}O_{p}(1)$, since the derivative $|d[z^{2}/(\tilde {v}^{\prime }+\tilde {M}_{b}^{\prime } z)]/dz| \leq 3/\tilde {M}_{b}^{\prime }$.

By applying Lemma 2.2.10 of van der Vaart and Wellner (1996), for Orlicz norm ψ₁, we have

$$\begin{array}{@{}rcl@{}} &&\left\| \sup\limits_{\theta_{t,g}^{(s)} \in {\Theta}_{n}^{(s)}, t = 1,2,\ldots,T} \left |\widehat{G}_{n}(\theta_{t,g}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t*}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) \right| \right\|_{\psi_{1}}\\ &&\leq \epsilon + K^{\prime}\left( \tilde{M}_{b}^{\prime} \log (1+TN(\epsilon,\mathcal{G}_{n,M},L_{1}(\mathbb{P}_{n}))) +\sqrt{\tilde{v}^{\prime}}\sqrt{\log(1+TN(\epsilon,\mathcal{G}_{n,M},L_{1}(\mathbb{P}_{n})))} \right),\\ \end{array} $$

(A.5)

for a constant K^′ and any 𝜖 > 0. Since $\tilde {v}^{\prime }=O(1/n)$, $\tilde {M}_{b}^{\prime }=O(1/n)$, log(T) = o(n), and $\log N(\epsilon ,\mathcal {G}_{n,M},L_{1}(\mathbb {P}_{n}))=o(n)$, we have

$$\left\| \sup\limits_{\theta_{t,g}^{(s)} \in {\Theta}_{n}^{(s)}, t = 1,2,\ldots,T} \left|\widehat{G}_{n}(\theta_{t,g}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t*}^{(s)}|\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) \right| \right\|_{\psi_{1}} \rightarrow_{P^{*}} \epsilon. $$

Therefore,

$$ \sup\limits_{\theta_{t,g}^{(s)} \in {\Theta}_{n}^{(s)}, t \in \{1,2,\ldots,T\}} \left|\widehat{G}_{n}(\theta_{t,g}^{(s)}| \hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})- G_{n}(\theta_{t*}^{(s)} |\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) \right| \to_{p} 0. $$

(A.6)

Note that, as implied by the proof of Lemma 2.2.10 of van der Vaart and Wellner (1996), Eq. A.5 holds for a general constant K in Eq. A.4. Then, by condition (B3), we must have the uniform convergence that $\theta _{t,g}^{(s)} \in B_{t}(\epsilon )$ for all t as n →∞, where B_t(𝜖) is as defined in (B3). This statement can be proved by contradiction as follows:

Assume $\theta _{t,g}^{(s)} \notin B_{i}(\epsilon )$ for some i ∈{1,2,…,T}. By the uniform convergence established in Theorem 2.1, $\left |\widehat {G}_{n}(\theta _{t,g}^{(s)}| \hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)})- G_{n}(\theta _{t,g}^{(s)} |\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)}) \right | =o_{p}(1)$. Further, by condition (B3) and the assumption $\theta _{t,g}^{(s)} \notin B_{i}(\epsilon )$,

$$\begin{array}{@{}rcl@{}} \left| \widehat{G}_{n}(\theta_{t,g}^{(s)}| \hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)})- G_{n}(\theta_{t*}^{(s)} |\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) \right| & \geq& \left|G_{n}(\theta_{t,g}^{(s)} |\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t*}^{(s)} |\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) \right|\\ &&- \left|\widehat{G}_{n}(\theta_{t,g}^{(s)}| \hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) - G_{n}(\theta_{t,g}^{(s)} |\hat{\boldsymbol{\theta}}_{t-1,g}^{(-s)}) \right| \\ & \geq& \delta-o_{p}(1), \end{array} $$

which contradicts with the uniform convergence established in Eq. A.6. This concludes the proof.

Proof of Theorem 2.4

Define $d_{t}^{(n)}:=\|\hat {\boldsymbol {\theta }}_{t}-\widetilde {\boldsymbol {\theta }}_{t}\|$, where n indicates the implicit dependence of $\hat {\boldsymbol {\theta }}_{t}$ on n. Then

$$ d_{t}^{(n)}:=\|\hat{\boldsymbol{\theta}}_{t}-\widetilde{\boldsymbol{\theta}}_{t}\| \leq \|\hat{\boldsymbol{\theta}}_{t}-M_{s}(\hat{\boldsymbol{\theta}}_{t-1})\| + \|M_{s}(\hat{\boldsymbol{\theta}}_{t-1})-\widetilde{\boldsymbol{\theta}}_{t}\|. $$

(A.7)

For the first component of the inequality (A.7), we define

$$g_{n}:=\sup\limits_{t \in \Bbb{N},\hat{\boldsymbol{\theta}}_{t-1} \in {\Theta}_{n}} \|\hat{\boldsymbol{\theta}}_{t}-M_{s}(\hat{\boldsymbol{\theta}}_{t-1})\|, $$

which converges to zero in probability as n →∞, following from Theorems 2.2 and 2.3 for both types of consistent estimation procedures considered in the paper. For the second component of the inequality (A.7), we have

$$\|M_{s}(\hat{\boldsymbol{\theta}}_{t-1})-\widetilde{\boldsymbol{\theta}}_{t}\|=\|M_{s}(\hat{\boldsymbol{\theta}}_{t-1})-M_{s}(\widetilde{\boldsymbol{\theta}}_{t-1})\| \leq \rho^{*} \|\hat{\boldsymbol{\theta}}_{t-1}-\widetilde{\boldsymbol{\theta}}_{t-1} \|=\rho^{*} d_{t-1}^{(n)}, $$

following from condition (B6).

Combining with the fact that d₀ = 0, i.e., the two paths $\{\hat \theta _{t}\}$ and $\{\widetilde \theta _{t}\}$ started from the same point, we have

$$ d_{t}^{(n)}\leq \sum\limits_{l = 0}^{t-1} g_{n} (\rho^{*})^{l} \leq \frac{g_{n}}{1-\rho^{*}} \stackrel{p}{\to} 0, $$

(A.8)

where the convergence is uniform over t as $g_{n} \stackrel {p}{\to } 0$. Moreover, since $\widetilde {\boldsymbol {\theta }}_{t}$ converges to a coordinatewise maximum point of $E_{\boldsymbol {\theta }_{*}} \log \pi (X|\boldsymbol {\theta })$ under conditions (A1) and (A2), $\hat {\boldsymbol {\theta }}_{t}$ will converge to the same point in probability. That is, $\hat {\boldsymbol {\theta }}_{\infty }:=\lim _{t\rightarrow \infty }\hat {\boldsymbol {\theta }}_{t} \stackrel {p}{\to } \widetilde {\boldsymbol {\theta }}_{\infty }: =\lim _{t\rightarrow \infty } \widetilde {\boldsymbol {\theta }}_{t}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, R., Liang, F., Song, Q. et al. A Blockwise Consistency Method for Parameter Estimation of Complex Models. Sankhya B 80 (Suppl 1), 179–223 (2018). https://doi.org/10.1007/s13571-018-0183-0

Download citation

Received: 19 September 2018
Published: 07 February 2019
Issue Date: December 2018
DOI: https://doi.org/10.1007/s13571-018-0183-0

Keywords and phrases.

AMS (2000) subject classification.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Blockwise Consistency Method for Parameter Estimation of Complex Models

Abstract

Access this article

Similar content being viewed by others

Parsimonious Finite Mixtures of Matrix-Variate Regressions

Bayesian sparse covariance decomposition with a graphical structure

A plug-in approach to sparse and robust principal component analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Proof of Theorem 2.1

Proof of Theorem 2.2

Lemma A.1.

Proof.

Proof of Theorem 2.3

Proof of Theorem 2.4

Rights and permissions

About this article

Cite this article

Keywords and phrases.

AMS (2000) subject classification.

Navigation

A Blockwise Consistency Method for Parameter Estimation of Complex Models

Abstract

Access this article

Similar content being viewed by others

Parsimonious Finite Mixtures of Matrix-Variate Regressions

Bayesian sparse covariance decomposition with a graphical structure

A plug-in approach to sparse and robust principal component analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Proof of Theorem 2.1

Proof of Theorem 2.2

Lemma A.1.

Proof.

Proof of Theorem 2.3

Proof of Theorem 2.4

Rights and permissions

About this article

Cite this article

Share this article

Keywords and phrases.

AMS (2000) subject classification.

Search

Navigation