Abstract
Group-level variance estimates of zero often arise when fitting multilevel or hierarchical linear models, especially when the number of groups is small. For situations where zero variances are implausible a priori, we propose a maximum penalized likelihood approach to avoid such boundary estimates. This approach is equivalent to estimating variance parameters by their posterior mode, given a weakly informative prior distribution. By choosing the penalty from the log-gamma family with shape parameter greater than 1, we ensure that the estimated variance will be positive. We suggest a default log-gamma(2,λ) penalty with λ→0, which ensures that the maximum penalized likelihood estimate is approximately one standard error from zero when the maximum likelihood estimate is zero, thus remaining consistent with the data while being nondegenerate. We also show that the maximum penalized likelihood estimator with this default penalty is a good approximation to the posterior median obtained under a noninformative prior.
Our default method provides better estimates of model parameters and standard errors than the maximum likelihood or the restricted maximum likelihood estimators. The log-gamma family can also be used to convey substantive prior information. In either case—pure penalization or prior information—our recommended procedure gives nondegenerate estimates and in the limit coincides with maximum likelihood as the number of groups increases.
Similar content being viewed by others
References
Alderman, D., & Powers, D. (1980). The effects of special preparation on SAT-verbal scores. American Educational Research Journal, 17(2), 239–251.
Bates, D., & Maechler, M. (2010). lme4: Linear mixed-effects models using S4 classes. R. package version 0.999375-37.
Bell, W. (1999). Accounting for uncertainty about variances in small area estimation. In Bulletin of the International Statistical Institute, 52nd session, Helsinki.
Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2009). Introduction to meta-analysis. Chichester: Wiley.
Box, G., & Cox, D. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B, 26(2), 211–252.
Browne, W., & Draper, D. (2006). A comparison of Bayesian and likelihood methods for fitting multilevel models. Bayesian Analysis, 1(3), 473–514.
Ciuperca, G., Ridolfi, A., & Idier, J. (2003). Penalized maximum likelihood estimator for normal mixtures. Skandinavian Journal of Statistics, 30(1), 45–59.
Crainiceanu, C., & Ruppert, D. (2004). Likelihood ratio tests in linear mixed models with one variance component. Journal of the Royal Statistical Society. Series B, 66(1), 165–185.
Crainiceanu, C., Ruppert, D., & Vogelsang, T. (2003). Some properties of likelihood ratio tests in linear mixed models (Technical report). Available at http://www.orie.cornell.edu/~davidr/papers.
Curcio, D., & Verde, P. (2011). Comment on: Efficacy and safety of tigecycline: a systematic review and meta-analysis. Journal of Antimicrobial Chemotherapy, 66(12), 2893–2895.
DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3), 177–188.
Dorie, V. (2013). Mixed methods for mixed models: Bayesian point estimation and classical uncertainty measures in multilevel models. PhD thesis, Columbia University.
Dorie, V., Liu, J., & Gelman, A. (2013). Bridging between point estimation and Bayesian inference for generalized linear models (Technical report). Department of Statistics, Columbia University.
Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society. Series B, 57(1), 45–97.
Drum, M., & McCullagh, P. (1993). [Regression models for discrete longitudinal responses]: comment. Statistical Science, 8(3), 300–301.
Fay, R.E., & Herriot, R.A. (1979). Estimates of income for small places: an application of James–Stein procedures to census data. Journal of the American Statistical Association, 74(366), 269–277.
Fu, J., & Gleser, L. (1975). Classical asymptotic properties of a certain estimator related to the maximum likelihood estimator. Annals of the Institute of Statistical Mathematics, 27(1), 213–233.
Galindo-Garre, F., & Vermunt, J. (2006). Avoiding boundary estimates in latent class analysis by Bayesian posterior mode estimation. Behaviormetrika, 33(1), 43–59.
Galindo-Garre, F., Vermunt, J., & Bergsma, W. (2004). Bayesian posterior mode estimation of logit parameters with small samples. Sociological Methods & Research, 33(1), 88–117.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3), 515–533.
Gelman, A., Carlin, J., Stern, H., & Rubin, D. (2004). Bayesian data analysis (2nd ed.). London: Chapman & Hall/CRC.
Gelman, A., Jakulin, A., Pittau, M.G., & Su, Y.S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383.
Gelman, A., & Meng, X. (1996). Model checking and model improvement. In Markov chain Monte Carlo in practice (pp. 189–201). London: Chapman & Hall.
Gelman, A., Shor, B., Bafumi, J., & Park, D. (2007). Rich state, poor state, red state, blue state: what’s the matter with Connecticut? Quarterly Journal of Political Science, 2(4), 345–367.
Greenland, S. (2000). When should epidemiologic regressions use random coefficients? Biometrics, 56(3), 915–921.
Hardy, R., & Thompson, S. (1998). Detecting and describing heterogeneity in meta-analysis. Statistics in Medicine, 17(8), 841–856.
Harville, D.A. (1974). Bayesian inference for variance components using only error contrasts. Biometrika, 61(2), 383–385.
Harville, D.A. (1977). Maximum likelihood approaches to variance components estimation and related problems. Journal of the American Statistical Association, 72(358), 320–338.
Higgins, J.P.T., Thompson, S.G., & Spiegelhalter, D.J. (2009). A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society. Series A, 172(1), 137–159.
Huber, P.J. (1967). The behavior of maximum likelihood estimation under nonstandard condition. In L.M. LeCam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 221–233). Berkeley: University of California Press.
Kenward, M., & Roger, J.H. (1997). Small-sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53(3), 983–997.
Laird, N.M., & Ware, J.H. (1982). Random effects models for longitudinal data. Biometrics, 38(4), 963–974.
Li, H., & Lahiri, P. (2010). An adjusted maximum likelihood method for solving small area estimation problems. Journal of Multivariate Analysis, 101(4), 882–892.
Longford, N.T. (2000). On estimating standard errors in multilevel analysis. Journal of the Royal Statistical Society. Series D, 49(3), 389–398.
Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212.
Miller, J. (1977). Asymptotic properties of maximum likelihood estimates in the mixed model of the analysis of variance. The Annals of Statistics, 5(4), 746–762.
Mislevy, R.J. (1986). Bayes modal estimation in item response models. Psychometrika, 51(2), 177–195.
Morris, C. (2006). Mixed model prediction and small area estimation (with discussions). Test, 15(1), 72–76.
Morris, C., & Tang, R. (2011). Estimating random effects via adjustment for density maximization. Statistical Science, 26(2), 271–287.
Neyman, J., & Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16(1), 1–32.
O’Hagan, A. (1976). On posterior joint and marginal modes. Biometrika, 63(2), 329–333.
Overton, R. (1998). A comparison of fixed-effects and mixed (random-effects) models for meta-analysis tests of moderator variable effects. Psychological Methods, 3(3), 354.
Patterson, H.D., & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3), 545–554.
Rabe-Hesketh, S., & Skrondal, A. (2012). Multilevel and longitudinal modeling using Stata (3rd ed.). College Station: Stata Press.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics, 128(2), 301–323.
Raudenbush, S., & Bryk, A. (1985). Empirical Bayes meta-analysis. Journal of Educational Statistics, 10(2), 75–98.
Rubin, D.B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4), 377–401.
Self, S.G., & Liang, K.Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association, 82(398), 605–610.
Snijders, T., & Bosker, R. (1993). Standard errors and sample sizes for two-level research. Journal of Educational and Behavioral Statistics, 18(3), 237–259.
Stram, D.O., & Lee, J.W. (1994). Variance components testing in the logitudinal mixed effects model. Biometrics, 50(4), 1171–1177.
Swallow, W., & Monahan, J. (1984). Monte Carlo comparison of ANOVA, MIVQUE, REML, and ML estimators of variance components. Technometrics, 26(1), 47–57.
Swaminathan, H., & Gifford, J.A. (1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50(3), 349–364.
Tsutakawa, R.K., & Lin, H.Y. (1986). Bayesian estimation of item response curves. Psychometrika, 51(2), 251–267.
Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. Berlin: Springer.
Vermunt, J., & Magidson, J. (2005). Technical guide for Latent Gold 4.0: basic and advanced (Technical report). Statistical Innovations Inc., Belmont, Massachusetts.
Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30(3), 261–293.
Warton, D.I. (2008). Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association, 103(481), 340–349.
Weiss, R.E. (2005). Modeling longitudinal data. New York: Springer.
Whaley, S., Sigman, M., Neumann, C.G., Bwibo, N.O., Guthrie, D., Weiss, R.E., Alber, S., & Murphy, S.P. (2003). Animal source foods improve dietary quality, micronutrient status, growth and cognitive function in Kenyan school children: background, study design and baseline findings. The Journal of Nutrition, 133(11), 3965–3971.
White, H. (1990). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
Acknowledgements
The research reported here was supported by the Institute of Education Sciences (grant R305D100017) and the National Science Foundation (SES-1023189), the Department of Energy (DE-SC0002099), and National Security Agency (H98230-10-1-0184).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A. Derivation of Properties in Section 4
Here, we derive Properties 1 and 2.
Properties 1 and 2
With the quadratic approximation of the profile log-likelihood in Section 3.2 using Equation (5), the MPL estimator is given by
With a simple calculation, we can show that \(\partial\widehat{\sigma }_{\theta} / \partial\lambda\leq0\). Therefore, as λ→0 for fixed α and \(\widehat{\mathrm {se}}(\hat{\sigma}_{\theta}^{\operatorname{ML}})\), the MPL estimate increases monotonically to the maximum. When \(\hat{\sigma}_{\theta}^{\operatorname{ML}}=0\), the maximum is \(\widehat {\mathrm{se}}(\hat{\sigma}_{\theta}^{\operatorname{ML}}) \sqrt{\alpha -1}\). When \(\hat{\sigma}_{\theta}^{\operatorname{ML}} >0\), (A.1) is reduced into
In addition, \({\partial\widehat{\sigma}_{\theta}}/{\partial\widehat {\mathrm{se}} (\hat{\sigma}_{\theta}^{\operatorname{ML}})} \) becomes
which decreases as \(\hat{\sigma}_{\theta}^{\operatorname{ML}}\) increases.
Property 3
If we assign the log-gamma(α,λ) penalty on \(\sigma_{\theta}^{2}\) instead of σ θ , the penalty becomes \(\log p(\sigma_{\theta}^{2})=2(\alpha-1) \log\sigma_{\theta}- \lambda\sigma_{\theta}^{2}\). In the limit λ→0, the term 2(α−1)logσ θ is the same as the corresponding term of the log-gamma(2α−1,λ) penalty on σ θ .
Property 6
Let t=g γ (σ θ ). Then the Jacobian is \(\partial g_{\gamma}^{-1} (t) = (\gamma t+1)^{1/\gamma-1}\), which is \(\sigma_{\theta}^{1-\gamma}\) when written as a function of σ θ . Therefore, the prior p(g γ (σ θ )) of g γ (σ θ ) is proportional to \(\sigma_{\theta}^{\alpha-\gamma} e^{-\lambda\sigma_{\theta}}\), which is proportional to gamma(α−γ+1,λ).
Appendix B. Proof of Theorem 4
Proof
Let \(S_{nJ} = ( \sum_{j} (\bar{y}_{\cdot j} -\mu)^{2} ) /n^{2}J \) and \(T_{nJ} = S_{nJ} - ( \sigma_{\epsilon}^{0} )^{2}/n - \sigma_{\theta}^{2}\). Then S nJ follows \((\sigma_{\epsilon}^{2}/n + ( \sigma_{\theta}^{0} )^{2}) \chi^{2}_{J}/J\), T nJ =O p (J −1/2), E(T nJ )=0 and \(\mathit{Var}(T_{nJ})= 2 (\sigma_{\epsilon}^{2} + ( \sigma_{\theta}^{0} )^{2} )^{2}/J\). Using these terms, we can expand \(\hat{\sigma}_{\theta}^{\operatorname{ML}}\) as
Therefore, we have
For the asymptotic bias of \(\hat{\sigma}_{\theta}^{\operatorname {MPL}}\), here we describe the outline of the proof. Details are in Dorie (2013). We will work with an estimating equation ψ nJ (σ θ ), given by
and \(\hat{\sigma}_{\theta}^{\operatorname{MPL}}\) will be a root of ψ nJ (σ θ )=0. The expression above Theorem 4 gives \(\hat{\sigma}_{\theta}^{\operatorname{MPL}} - \sigma_{\theta}^{0}= O_{p}(J^{-1/2})\). Therefore, the Taylor expansion of ψ nJ around \(\sigma_{\theta}^{0}\) is given by
As the left-hand side of the approximation is 0, we can complete the square to obtain:
Note that each of ψ, ψ′ and ψ″ are of O p (J), so that when we pass in 1/J under the root we make each term O p (1),
The difference \(\sqrt{J}(\hat{\sigma}_{\theta}^{\operatorname{MPL}} - \sigma_{\theta}^{0})\) will blow up unless we take the positive root so that the leading terms cancel. Using the expansions of ψ, ψ′ and ψ″ and the expansion of the square root, we can reduce the numerator to
with some constants a 1, a 2, and a 3.
Similarly, Taylor expansion of the reciprocal of the denominator is written as
with constants b 1 and b 2. Multiplication of (B.1) by (B.2) gives the bias up to the order of J −1 and it follows that
Since \(\hat{\sigma}_{\theta}^{\operatorname{MPL}}\) is uniformly integrable, the expectation of the above is
□
Appendix C. Proof of Equation (9)
The model in (2) can be written as y=X β+ϵ, where X is a covariate matrix, ϵ follows N(0,V), V is a block-diagonal matrix with n×n blocks V j , and each V j contains \(\sigma_{\theta}^{2}+\sigma_{\epsilon}^{2}\) on the diagonal and \(\sigma_{\theta}^{2}\) on the off-diagonals. As noted in Section 4.4, the REML log-likelihood can be written as the log-likelihood with an additive penalty term, −log{det(X T V −1 X)}/2.
The inverse of V is also block-diagonal of the same structure as V but with \(\{ \sigma_{\epsilon}^{2} + (n_{j}-\nobreak 1) \sigma_{\theta}^{2} \} /\allowbreak \sigma_{\epsilon}^{2}(\sigma_{\epsilon}^{2} + n_{j} \sigma_{\theta}^{2})\) in the diagonals and \(-{\sigma_{\theta}^{2}}/{\sigma_{\epsilon}^{2}(\sigma_{\epsilon}^{2} + n_{j} \sigma_{\theta}^{2})}\) in the off-diagonals.
Let the columns of X consist of a vector of ones, q level-1 covariates (z 1,…,z q ) and r level-2 covariates (w 1,…,w r ). When we assume that w 1,…,w r are dummy variables for the first r groups and \(\boldsymbol{z}_{i}^{T} \boldsymbol {z}_{i}=1\) and \(\boldsymbol{z}_{i}^{T} \boldsymbol{z}_{j}=0\) for all i≠j and the data are balanced, X T V −1 X can be simplified to a block-diagonal with
and \(\frac{J}{\sigma_{\epsilon}^{2}} I_{q \times q}\).
Therefore it follows that
and
Appendix D. REML and Log-Gamma Penalty in General Cases (Referred in Section 4.4)
Figure 8 compares the REML penalty function in (9), the log of the gamma density with corresponding α=(r+1)/2+1, and the REML penalty function in the second term of (8) for a dataset with n=30, J=5, q=1, r=0, 1, or 2, which does not have the form assumed when deriving (9). For evaluating the REML penalty term in (8), the columns of the covariate matrix X consist of a vector of ones, a level-1 covariate z 1 with z 1ij =i and two level-2 covariates w 1 and w 2, where w 1j =j for all j=1,…,J and w 2 is the same as w 1 except that the values for the last group are 0 instead of J. Comparing Figures 8(a) and (c), the penalties differ by a constant which does not affect the mode, so formula (9) appears to hold more generally.
For Figures 8(a) and (b), the constant terms were ignored to make the figures easier to compare. The REML penalty functions with r=0, 1, and 2 look very similar to the gamma penalty on \(\sigma_{\theta}^{2}\) with α=2, 3, and 4, respectively, except where \(\sigma_{\theta}^{2}\) is close to zero. At \(\sigma_{\theta}^{2}=0\), the log-gamma penalty is −∞ for α>1, whereas the REML penalty approaches −∞ only if σ ϵ →0 or n→∞. This explains why REML can produce boundary estimates. Further, it implies that the log-gamma penalty assigns more penalty on \(\sigma_{\theta}^{2}\) close to zero than REML for small n and large σ ϵ . Otherwise, REML can approximately be viewed as a special case of our method with a log-gamma penalty.
Appendix E. Simulation of Unbalanced Variance Component Model
Swallow and Monahan (1984) compared several variance estimation methods for the one-way model, given by
where \(\theta_{j} \sim N(0,\sigma_{\theta}^{2})\) and \(\epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2})\). They considered unbalanced data with eight different patterns of group sizes (n 1,…,n J ), and compared the bias and RMSE of estimators of σ θ using simulated datasets.
In this appendix, we picked two of the patterns Swallow and Monahan (1984) considered, (n 1,…,n J )=(1,5,9) and (1,1,1,1,13,13) with σ ϵ =1, and compared ML and REML with the performance of the MPL estimates with log-gamma(2,0) penalty on σ θ , which approximates the REML penalty for this model.
As for the balanced case in Section 6, both ML and REML tend to underestimate σ θ for σ θ >0. (See the left column of Figure 9.) On the other hand, MPL tends to overestimate σ θ but the magnitude of the bias decreases as σ θ increases. For σ θ =1, the MPL estimator has the smallest bias for both patterns of group sizes. The RMSE is smallest for the MPL estimator when σ θ >0 as shown in the middle column of Figure 9.
The last column in Figure 9 shows the estimated bias of the standard error of \(\hat{\mu}\). When σ θ is zero, there is almost no difference in the bias between the ML and REML estimators. As σ θ increases, the bias for the MPL estimator becomes increasingly smaller than the bias for the other estimators.
Rights and permissions
About this article
Cite this article
Chung, Y., Rabe-Hesketh, S., Dorie, V. et al. A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models. Psychometrika 78, 685–709 (2013). https://doi.org/10.1007/s11336-013-9328-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-013-9328-2