Abstract
The drastic improvement in data collection and acquisition technologies has enabled scientists to collect a great amount of data. With the growing dataset size, typically comes a growing complexity of data structures and of complex models to account for the data structures. How to estimate the parameters of complex models has put a great challenge on current statistical methods. This paper proposes a blockwise consistency approach as a potential solution to the problem, which works by iteratively finding consistent estimates for each block of parameters conditional on the current estimates of the parameters in other blocks. The blockwise consistency approach decomposes the high-dimensional parameter estimation problem into a series of lower-dimensional parameter estimation problems, which often have much simpler structures than the original problem and thus can be easily solved. Moreover, under the framework provided by the blockwise consistency approach, a variety of methods, such as Bayesian and frequentist methods, can be jointly used to achieve a consistent estimator for the original high-dimensional complex model. The blockwise consistency approach is illustrated using high-dimensional linear regression with both univariate and multivariate responses. The results of both problems show that the blockwise consistency approach can provide drastic improvements over the existing methods. Extension of the blockwise consistency approach to many other complex models is straightforward.
Similar content being viewed by others
References
Banerjee, M., Durot, C. and Sen, B. (2019). Divide and conquer in non-standard problems and the super-efficiency phenomenon. Ann. Statistics47, 720–757.
Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A., Kim, S., Wilson, C., Lehar, J., Kryukov, G., Sonkin, D., Reddy, A., Liu, M., Murray, L., Berger, M., Monahan, J., Morais, P., Meltzer, J., Korejwa, A., Jane-Valbuena, J., Mapa, F., Thibault, J., Bric-Furlong, E., Raman, P., Shipway, A. and Engels, I. (2012). The Cancer cell line encyclopedia enables predictive modeling of anticancer drug sensitivity. Nature483, 603–607.
Bhadra, A. and Mallick, B. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eQTL analysis. Biometrics69, 447–457.
Breheny, P. and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat.5, 232–252.
Cai, T., Li, H., Liu, W. and Xie, J. (2013). Covariate-adjusted precision matrix estimation iwth an application in genetical genomics. Biometrika100, 139–156.
Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika94, 759–771.
Dempster, A. (1972). Covariance selection. Biometrics28, 157–175.
Duguet, M. (1997). When helicase and topoisomerase meet. J. Cell Sci.110, 1345–1350.
Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman & Hall/CRC, Boca Raton.
Fan, J., Feng, Y., Saldana, D. F., Samworth, R. and Wu, Y. (2015). Sure independence screening. CRAN R Package.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.96, 1348–1360.
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B70, 849–911.
Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional feature selection: Beyond the linear model. J. Mach. Learn. Res.10, 1829–1853.
Fan, J. and Song, R. (2010). Sure independence screening in generalized linear model with NP-dimensionality. Ann. Stat.38, 3567–3604.
Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Stat.42, 819–849.
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika80, 27–38.
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics9, 432–441.
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.33, 1–22.
Friedman, J., Hastie, T. and Tibshirani, R. (2015). GLASSO: Graphical lasso- estimation of Gaussian graphical models, CRAN R-Package.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell.6, 721–741.
Guo, N., Wan, Y., Tosun, K., Lin, H., Msiska, Z., Flynn, D., Remick, S., Vallyathan, V., Dowlati, A., Shi, X., Castranova, V., Beer, D. and Qian, Y. (2008). Confirmation of gene expression-based prediction of survival in non-small cell lung cancer. Clin. Cancer Res.14, 8213–8220.
Hamburg, M. and Collins, F. (2010). The path to personalized medicine. New Engl. J. Med.363, 301–304.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning. Springer, Berlin.
Li, R., Lin, D. and Li, B. (2013). Statistical inference in massive data sets. Appl. Stoch. Model. Bus. Ind.29, 399–409.
Li, X., Xu, S., Cheng, Y. and Shu, J. (2016). HSPB1 polymorphisms might be associated with radiation-induced damage risk in lung cancer patients treated with radiotherapy. Tumour Biol.37, 5743–5749.
Liang, F., Jia, B., Xue, J., Li, Q. and Luo, Y. (2018). An imputation-regularized optimization algorithm for high-dimensional missing data problems and beyond. J. R. Statist. So. Series B80, 899–926.
Liang, F., Song, Q. and Qiu, P. (2015). An Equivalent Measure of Partial Correlation Coefficients for High Dimensional Gaussian Graphical Models. J. Am. Stat. Assoc.110, 1248–1265.
Liang, F., Song, Q. and Yu, K. (2013). Bayesian Subset Modeling for High Dimensional Generalized Linear Models. J. Am. Stat. Assoc.108, 589–606.
Liu, H., Lafferty, J. and Wasserman, L. (2009). The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. J. Mach. Learn. Res.10, 2295–2328.
Mazumder, R., Friedman, J. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. J. Am. Stat. Assoc.106, 1125–1138.
Mazumder, R. and Hastie, T. (2012). The graphical Lasso: New insights and alternatives. Elect J Stat6, 2125–2149.
Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist.34, 1436–1462.
Peng, J., Zhu, J., Bergamaschi, A., Han, W., Noh, D. -Y., Pollack, J. R. and Wang, P. (2010). Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. Ann. Appl. Stat.4, 53–77.
Peng, Z., Wu, T., Xu, Y., Yan, M. and Yin, W. (2016). Coordinate friendly structures, algorithms and applications. Annals of Mathematical Sciences and Applications1, 57–119.
Raskutti, G., Wainwright, M. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over l q−balls. IEEE Trans. Inf. Theory57, 6976–6994.
Rothman, A. (2015). MRCE: Multivariate regression with covariance estimation, CRAN R-Package.
Rothman, A., Levina, E. and Zhu, J. (2010). Sparse multivariate regression with covariance estimation. J. Comput. Graph. Stat.19, 947–962.
Sofer, T., Dicker, L. and Lin, X. (2014). Variable selection for high dimensional multivariate outcomes. Stat. Sin.22, 1633–1654.
Song, Q. and Liang, F. (2015a). High Dimensional Variable Selection with Reciprocal L1-Regularization. J. Am. Stat. Assoc.110, 1607–1620.
Song, Q. and Liang, F. (2015b). A split-and-merge Bayesian variable selection approach for ultra-high dimensional regression. J. R. Stat. Soc. Ser. B77, 947–972.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B58, 267–288.
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl.109, 475–494.
Tseng, P. and Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Mathematics Programming, Series B117, 387–423.
Turlach, B., Venables, W. and Wright, S. (2005). Simultaneous variable selection. Technometrics47, 349–363.
Vershynin, R. (2015). Estimation in high dimensions: A geometric perspective. Cham, Pfander, G. (ed.), p. 3–66.
Wang, J. (2015). Joint estimation of sparse multivariate regression and conditional graphical models. Stat. Sin.25, 831–851.
Weickert, C.E. (2009). Transcriptome analysis of male female differences in prefrontal cortical development. Molecular Psychiatry14, 558–561.
Witten, D., Friedman, J. and Simon, N. (2011). New insights and faster computations for the graphical Lasso. J. Comput. Graph. Stat.20, 892–900.
Xue, J. and Liang, F. (2019). Double-Parallel Monte Carlo for Bayesian analysis of big data. Statist. Comput.29, 23–32.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika95, 19–35.
Zhang, C. -H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics38, 894–942.
Zhang, Y., Duchi, J. and Wainwright, M. (2013). Divide and conquer kernel ridge regression. In Conference on learning theory, pp. 592–617.
Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res.7, 2541–2563.
Zou, H. (2006). The adptive lasso and its oracle properties. Ann Statist38, 894–942.
Acknowledgments
Liang’s research was support in part by the grants DMS-1612924, DMS/NIGMS R01-GM117597, and NIGMS R01-GM126089. The authors thank the editor, associate editor, and two referees for their helpful comments which have led to significant improvement of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 2.1
We follow the proof of Theorem 2.4.3 of van der Vaart and Wellner (1996). By the symmetrization Lemma 2.3.1 of van der Vaart and Wellner (1996), measurability of the class \(\mathcal {F}_{n}\), and Fubini’s theorem,
where 𝜖i are i.i.d. Rademacher random variables with P(𝜖i = + 1) = P(𝜖i = − 1) = 1/2, and E∗ denotes the outer expectation.
By condition (B2)-(a), 2E∗[mn(x)1(mn(x) > M)] → 0 for sufficiently large M. To prove convergence in mean, it suffices to show that the first term converges to zero for fixed M. Fix x1,...,xn, and let \(\mathcal {H}\) be a 𝜖-net in \(L_{1}(\mathbb {P}_{n})\) over \(\mathcal {G}_{M}\), then
where the cardinality of \(\mathcal {H}\) can be chosen equal to \(N(\epsilon ,\mathcal {G}_{n,M},L_{1}(\mathbb {P}_{n})\). Bound the l1-norm on the right by the Orlicz-norm ψ2 and use the maximal inequality (Lemma 2.2.2 of van der Vaart and Wellner (1996)) and Hoeffding’s inequality, it can be shown that
where K is a constant, and P∗ denotes outer probability. It has been shown that the left side of Eq. A.1 converges to zero in probability. Since it is bounded by M, its expectation with respect to x1,…,xn converges to zero by the dominated convergence theorem.
This concludes the proof that \(\sup _{\theta ^{(s)} \in {\Theta }_{n}^{(s)},\hat {\boldsymbol {\theta }}_{t-1}^{(-s)}\in {\Theta }_{n,T}^{(-s)}} \left | \widehat {G}_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)}) - G_{n}\right .\)\(\left .(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)}) \right | \rightarrow _{p} 0\) in mean. Further, by Markov inequality, we conclude that Eq. 2.6 holds.
Proof of Theorem 2.2
Since both \(\widehat {G}_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)})\) and \(G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1}^{(-s)})\) are continuous in 𝜃(s) as implied by the continuity of log π(x|𝜃) in 𝜃, the remaining part of the proof follows from Lemma A.1.
Lemma A.1.
Consider a sequence of functionsQt(𝜃, Xn) fort = 1,2,…,T.Suppose that the following conditions are satisfied: (C1) For eacht,Qt(𝜃, Xn) is continuous in𝜃and there exists a function\(Q_{t}^{*}(\boldsymbol {\theta })\),which is continuous in𝜃and uniquely maximized at\(\boldsymbol {\theta }_{*}^{(t)}\).(C2) For any𝜖 > 0,\(\sup _{\boldsymbol {\theta } \in {\Theta }_{n}\setminus B_{t}(\epsilon )} Q_{t}^{*}(\boldsymbol {\theta })\)exists,where\(B_{t}(\epsilon )=\{\boldsymbol {\theta }: \|\boldsymbol {\theta }-\boldsymbol {\theta }_{*}^{(t)}\| < \epsilon \}\);Let\(\delta _{t}=Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)})- \sup _{\boldsymbol {\theta } \in {\Theta }_{n}\setminus B_{t}(\epsilon )} Q_{t}^{*}\) (𝜃),δ = mint∈{1,2,…,T}δt > 0.(C3)\(\sup _{t\in \{1,2,\ldots ,T\}} \sup _{\boldsymbol {\theta } \in {\Theta }_{n}} |Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n})-Q_{t}^{*}(\boldsymbol {\theta })| \to _{p} 0\)asn →∞.(C4) The penalty function\(P_{\lambda _{n}}(\boldsymbol {\theta })\)isnon-negative and converges to 0 uniformly over the set\(\{\boldsymbol {\theta }_{*}^{(t)}: t = 1,2,\ldots ,T\}\)asn →∞,whereλnis a regularization parameter and its value can depend on the sample size n.Let\(\boldsymbol {\theta }_{n}^{(t)}=\arg \max _{\boldsymbol {\theta }\in {\Theta }_{n}} \{ Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n})-P_{\lambda _{n}}(\boldsymbol {\theta })\}\).Then the uniform convergence holds, i.e.,\(\sup _{t \in \{1,2,\ldots ,T\}} \|\boldsymbol {\theta }_{n}^{(t)}- \boldsymbol {\theta }_{*}^{(t)}\|\to _{p} 0\).
Proof.
Consider two events (i)\(\sup _{t \in \{1,2,\ldots ,T\}} \sup _{\boldsymbol {\theta } \in {\Theta }_{n}\setminus B_{t}(\epsilon )} |Q_{t}(\boldsymbol {\theta },\boldsymbol {X}_{n})-Q_{t}^{*}(\boldsymbol {\theta })| < \delta /2\),and (ii) supt∈{1,2,…,T}\(\sup _{\boldsymbol {\theta } \in B_{t}(\epsilon )} |Q_{t}(\boldsymbol {\theta },\boldsymbol {X}_{n})-Q_{t}^{*}(\boldsymbol {\theta })| < \delta /2\).From event (i), we can deduce that for anyt ∈{1,2,…,T} and any𝜃 ∈Θn ∖ Bt(𝜖),\(Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n}) < Q_{t}^{*}(\boldsymbol {\theta })+\delta /2 \leq Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta _{t} +\delta /2 \leq Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2\).Therefore,\(Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n}) -P_{\lambda _{n}}(\boldsymbol {\theta }) < Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2 -o(1)\)bycondition (C4).
From event (ii), we can deduce that for anyt ∈{1,2,…,T} and any𝜃 ∈ Bt(𝜖),\(Q_{t}(\boldsymbol {\theta }, \boldsymbol {X}_{n})> Q_{t}^{*}(\boldsymbol {\theta }) -\delta /2\)and\(Q_{t}(\boldsymbol {\theta }_{*}^{(t)}, \boldsymbol {X}_{n})> Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2\).Therefore,\(Q_{t}(\boldsymbol {\theta }_{*}^{(t)}, \boldsymbol {X}_{n})-P_{\lambda _{n}}(\boldsymbol {\theta }_{*}^{(t)}) > Q_{t}^{*}(\boldsymbol {\theta }_{*}^{(t)}) -\delta /2- o(1)\)bycondition (C4).
If both events hold simultaneously, then we musthave\({\boldsymbol {\theta }}_{n}^{(t)} \in B_{t}(\epsilon )\)forallt ∈{1,2,…,T} asn →∞.By condition (C3), the probability that both events hold tends to 1.Therefore,
which concludes the lemma. □
Proof of Theorem 2.3
Applying Taylor expansion to \(G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)})\) at \(\theta _{t*}^{(s)}\), we get \(G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)}) - G_{n}(\theta _{t*}^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)}) =O_{p}(1/n)\), following from the condition (B5) and the condition (B3) that \(G_{n}(\theta ^{(s)}|\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)})\) is maximized at \(\boldsymbol {\theta }_{t*}^{(s)}\). Therefore,
where 𝜖n = Op(1), and
By Bernstein’s inequality,
for \(\tilde {v}^{\prime } \geq (\tilde {v}_{1}+\cdots +\tilde {v}_{n})/n^{2}\) and \(\tilde {M}_{b}^{\prime }=\tilde {M}_{b}/n\). Applying Taylor expansion to the right of Eq. A.3 at z and combining with Eq. A.2 leads to
where \(K = 2+\frac {3}{\tilde {M}_{b}^{\prime }}O_{p}(1/n)= 2+\frac {3}{\tilde {M}_{b}}O_{p}(1)\), since the derivative \(|d[z^{2}/(\tilde {v}^{\prime }+\tilde {M}_{b}^{\prime } z)]/dz| \leq 3/\tilde {M}_{b}^{\prime }\).
By applying Lemma 2.2.10 of van der Vaart and Wellner (1996), for Orlicz norm ψ1, we have
for a constant K′ and any 𝜖 > 0. Since \(\tilde {v}^{\prime }=O(1/n)\), \(\tilde {M}_{b}^{\prime }=O(1/n)\), log(T) = o(n), and \(\log N(\epsilon ,\mathcal {G}_{n,M},L_{1}(\mathbb {P}_{n}))=o(n)\), we have
Therefore,
Note that, as implied by the proof of Lemma 2.2.10 of van der Vaart and Wellner (1996), Eq. A.5 holds for a general constant K in Eq. A.4. Then, by condition (B3), we must have the uniform convergence that \(\theta _{t,g}^{(s)} \in B_{t}(\epsilon )\) for all t as n →∞, where Bt(𝜖) is as defined in (B3). This statement can be proved by contradiction as follows:
Assume \(\theta _{t,g}^{(s)} \notin B_{i}(\epsilon )\) for some i ∈{1,2,…,T}. By the uniform convergence established in Theorem 2.1, \(\left |\widehat {G}_{n}(\theta _{t,g}^{(s)}| \hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)})- G_{n}(\theta _{t,g}^{(s)} |\hat {\boldsymbol {\theta }}_{t-1,g}^{(-s)}) \right | =o_{p}(1)\). Further, by condition (B3) and the assumption \(\theta _{t,g}^{(s)} \notin B_{i}(\epsilon )\),
which contradicts with the uniform convergence established in Eq. A.6. This concludes the proof.
Proof of Theorem 2.4
Define \(d_{t}^{(n)}:=\|\hat {\boldsymbol {\theta }}_{t}-\widetilde {\boldsymbol {\theta }}_{t}\|\), where n indicates the implicit dependence of \(\hat {\boldsymbol {\theta }}_{t}\) on n. Then
For the first component of the inequality (A.7), we define
which converges to zero in probability as n →∞, following from Theorems 2.2 and 2.3 for both types of consistent estimation procedures considered in the paper. For the second component of the inequality (A.7), we have
following from condition (B6).
Combining with the fact that d0 = 0, i.e., the two paths \(\{\hat \theta _{t}\}\) and \(\{\widetilde \theta _{t}\}\) started from the same point, we have
where the convergence is uniform over t as \(g_{n} \stackrel {p}{\to } 0\). Moreover, since \(\widetilde {\boldsymbol {\theta }}_{t}\) converges to a coordinatewise maximum point of \(E_{\boldsymbol {\theta }_{*}} \log \pi (X|\boldsymbol {\theta })\) under conditions (A1) and (A2), \(\hat {\boldsymbol {\theta }}_{t}\) will converge to the same point in probability. That is, \(\hat {\boldsymbol {\theta }}_{\infty }:=\lim _{t\rightarrow \infty }\hat {\boldsymbol {\theta }}_{t} \stackrel {p}{\to } \widetilde {\boldsymbol {\theta }}_{\infty }: =\lim _{t\rightarrow \infty } \widetilde {\boldsymbol {\theta }}_{t}\).
Rights and permissions
About this article
Cite this article
Shi, R., Liang, F., Song, Q. et al. A Blockwise Consistency Method for Parameter Estimation of Complex Models. Sankhya B 80 (Suppl 1), 179–223 (2018). https://doi.org/10.1007/s13571-018-0183-0
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13571-018-0183-0