Abstract
Nonlinear mixed-effects models are commonly used in fisheries and ecological studies to account for complex relationships and dependencies in data. These models involve both fixed parameters to estimate and random-effects (REs) to predict. This paper addresses the inferential setting involving repeated sampling of the data but conditional on the unknown REs. This setting is more appropriate when the focus is on statistical inferences based on the specific values of REs that generated the data. Assuming the Laplace approximation is appropriate to derive the marginal likelihood and following a frequentist framework, this work derives RE-conditional bias approximations of maximum likelihood parameter estimators and empirical Bayes RE predictors, as well as the conditional covariance and mean squared error (MSE) among parameter estimators and RE predictors. It is shown that the RE-conditional MSE can be approximated with the unconditional MSE. Simulation studies demonstrate that the variance and MSE approximations are reasonably accurate for relevant sample sizes. Considering the finite-sample RE-conditional biases in the parameter estimates and RE predictions, the MSE is more appropriate for constructing confidence intervals (CIs), and the CI coverage of REs should be interpreted as the average coverage over a range of REs or over repeated generation of REs.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Nonlinear mixed-effects models have been widely implemented to address complex multivariate correlation structures in data (see, e.g., [10, 11]; among many others) and cover a broad spectrum of statistical models. In some applications, the fixed effects, such as the regression parameters, are of primary interests, while the random effects (REs) are introduced only to account for the complex dependencies in the data (e.g., [16, 40]). However, in many other applications, REs or functions of REs represent quantities of practical significance and hence are also important to predict, and the correlations among REs are used to improve statistical inferences at spatiotemporal locations with few data (e.g., [3, 39]).
According to the definition of conditional probability, mixed-effects models (linear or nonlinear) can be written in the form of \(f(D,\Psi |\theta )=f(D|\Psi ,\theta )f(\Psi |\theta )\) (see, e.g., [20, 29]), where the vector of data D is assumed to have a multivariate probability density/mass function (pdf/pmf) \(f(D|\Psi ,\theta )\), given values of the vectors of fixed-effects parameters \(\theta \) and REs \(\Psi \). The marginal distribution of \(\Psi \) is \(f(\Psi |\theta )\). A specific example of the mixed-effects model is the fisheries state-space population dynamics model where \(f(\Psi |\theta )\) is the process model describing how the latent population processes evolve over time and/or space and \(f(D|\Psi ,\theta )\) is the observation model linking data to the latent processes (e.g., [33]). Nonlinear mixed-effects models have numerous applications in many fields including fisheries, ecology, environmental sciences, econometrics and engineering (e.g., [17]). The implementation of these models in fisheries and ecological studies relies heavily on software packages including Automatic Differentiation Model Builder (ADMB, Fournier et al. [9]) and Template Model Builder (TMB, Kristensen et al. [19]). Therefore, in this paper we study inference for nonlinear mixed-effects models as implemented with TMB or ADMB. These packages use the maximum marginal likelihood estimator (MMLE) to estimate the fixed effects \(\theta \).
The marginal distribution of D is
where \(\Psi _1,\ldots ,\Psi _q\) are the elements of \(q \times 1\) vector \(\Psi \). For simplicity, this q-fold integral is denoted as \(\int f(D|\Psi ,\theta )\) \(f(\Psi |\theta )d\Psi \). The MMLE of \(\theta \) are those values \(\hat{\theta }\) that maximize \(f(D|\theta )\). Throughout this paper, we use \(\hat{\theta }\) to denote the MMLE of \(\theta \). The integral in Equation (1) will usually not have a closed-form; however, TMB can approximate the marginal likelihood via the Laplace approximation quickly for possibly many (i.e., tens of thousands) REs by efficiently utilizing the sparseness of the joint distribution of \(f(D,\Psi |\theta )\) with respect to \(\Psi \). REs \(\Psi \) can be predicted with the conditional mean \(\hat{\Psi }_{\mathrm {E}}(\hat{\theta })=\int {\Psi f(\Psi |D, \hat{\theta })d\Psi }\) which is also the empirical Bayes predictor of \(\Psi \) in the Bayesian framework (e.g., [18]). McCulloch and Neuhaus [23] showed, for generalized linear mixed models, that \(\mathrm {E}\{\Psi |D,\theta \}\) is the best predictor in the sense of minimizing the overall mean squared error (MSE) of prediction. REs can also be predicted with posterior mode \(\hat{\Psi }(\hat{\theta })\) that maximizes the joint distribution \(f(D,\Psi |\theta )\) or equivalently the posterior \(f(\Psi |D,\theta )\) when \(\theta =\hat{\theta }\). Note the difference between posterior mean \(\hat{\Psi }_{\mathrm {E}}\) and posterior mode \(\hat{\Psi }\). In linear mixed models, the posterior mode RE predictor \(\hat{\Psi }\) is known as the empirical best linear unbiased predictor (EBLUP; Robinson [30]). In generalized linear mixed models, Jiang et al. [15] called \(\hat{\Psi }\) the maximum posterior estimate (MPE) of \(\Psi \), and proved that given sufficient information about the REs, a restricted version of the MPE exhibits an overall consistency no matter the value of the dispersion parameters of the REs distribution at which \(\hat{\Psi }\) are evaluated, even though the prediction of the individual RE is biased. In this paper, we use the posterior mode \(\hat{\Psi }\) to predict REs under a more general situation where there may not be sufficient data for all the REs, and especially there may be no data for some subset of REs. When the joint pdf \(f(D,\Psi |\theta )\) is unimodal and approximately symmetric about \(\Psi \) then \(\hat{\Psi }_{\mathrm {E}}\) and \(\hat{\Psi }\) are approximately the same. The focus of our research is statistical inference with TMB and ADMB that apply the Laplace approximation by assuming \(f(D,\Psi |\theta )\) is approximately multivariate normal (MVN). Under such circumstances, \(\hat{\Psi }_{\mathrm {E}}\) is approximately equivalent to \(\hat{\Psi }\) and hence the good properties for \(\hat{\Psi }_{\mathrm {E}}\) are also valid for \(\hat{\Psi }\).
We consider a conceptual frequentist inferential setting where the REs, \(\Psi \), are drawn once from the process model \(f(\Psi |\theta )\) and then fixed at these values during repeated data generations from the observation model \(f(D|\Psi ,\theta )\). This is a realistic inferential setting since in many cases an effect is treated as random only because it is unobservable and high-dimensional, and not because it is truly random. For instance, the popular lasso/\(L_1\) regularization [32] for addressing high-dimensional (HD) linear regression parameters is equivalent to introducing a double exponential (Laplace) marginal distribution (or prior in a Bayesian interpretation) on the HD coefficients, and then estimating the HD parameters using the posterior mode [2]. In fisheries state-space assessment models, the annual population abundance and fishery mortality rates are frequently modeled as REs (e.g., [5, 28]). Even though there are process errors in how these effects are modeled, there is only one set of process errors and only one set of time-series of true population abundance and fishing mortality rates that we need to make statistical inferences for, that is, the yearly time-series of unknown population abundance and mortality rates may be one draw from larger populations (i.e., they are random variables), but after they have been established, they behave like high-dimensional parameters staying constant during repeated sampling (i.e., catches) in different months of the year or locations. Under such circumstances, it is more appropriate to make statistical inference conditioning on the unknown REs. In this conditional inferential setting, rather than the marginal mean and covariance, we should evaluate the conditional mean \(\mathrm {E}\{\cdot |\Psi \}\) and covariance \(\mathrm {Cov}\{\cdot |\Psi \}\). The marginal statistical properties are different than the conditional properties and this can lead to mis-interpretation of CIs and possibly wrong fisheries management decisions if the conditional setting is actually appropriate. For example, in the marginal setting the parameter estimators and RE predictors are all approximately unbiased [38]; however, in the conditional setting their biases are not negligible (see Sect. 2.2). In this paper, we investigate the biases and covariances of parameter estimators and RE predictors in the conditional setting, and we also examine the CI coverage properties using simulation studies.
The marginal inferential setting may be mis-specified, which often is revealed when simulation testing the efficacy of state-space models. With the marginal setting, in each simulation run the REs \(\Psi \) need to be generated from \(f(\Psi |\theta )\), which frequently results in unrealistic REs, the extinction of the simulated fish stock, and unusable simulation data. A commonly used procedure to address this problem is repeated sampling of D from \(f(D|\hat{\Psi },\hat{\theta })\) (e.g., [5, 26, 28]), namely, the REs are fixed at \(\hat{\Psi }\) instead of being randomly generated in each simulation. In stock assessment, this is referred to as a simulation self-test [6]. This simulation setup is much closer to our conditional setting rather than the marginal setting, and hence a study based on the conditional setting can reveal and explain the difference between marginal inference and self-tests, and improve the interpretation of self-tests results. However, the distribution of RE predictor \(\hat{\Psi }\) is different from that of RE \(\Psi \), namely \(f(\Psi |\theta )\), and thus the results in this paper for conditioning on true REs may not be fully applicable to self-tests. This issue will be further clarified in Sect. 4.
The results in this paper are also generally applicable to the Gaussian process semiparametric regression model of He and Severini [12, 13] and the type of integrated likelihood [1, 31] used for primary model parameters (e.g., regression coefficients) while the unknown nuisance parameters, even though fixed, are integrated out by technically assuming some distribution which is usually MVN. We will illustrate this application with an example in Sect. 2.3.
2 Materials and Methods
2.1 Notations and Background
Consider a nonlinear mixed-effects model for random response data which are collected in a \(n \times 1\) vector D and are assumed to have a multivariate pdf \(f(D|\Psi ,\theta )\). The means and covariances of D depend on the fixed-effects parameters \(\theta \) (\(p \times 1\)) and the random-effects \(\Psi \) (\(q \times 1\)), possibly via nonlinear functions of \(\theta \), \(\Psi \), and covariates which we do not develop notation for and leave implicit in \(f(D|\Psi ,\theta )\). The pdf of \(\Psi \) is \(f(\Psi |\theta )\). We denote the joint loglikelihood of \(\theta \) and \(\Psi \) as
with the conditional data loglikelihood \(l_{c}(\Psi ,\theta )=\ln \{f(D|\Psi ,\theta )\}\) and the loglikelihood of the REs \(l_r(\Psi ,\theta )=\ln \{f(\Psi |\theta )\}\). The marginal distribution of D is given by Eq. (1), and the marginal loglikelihood is denoted as \(l(\theta )\). The true parameters \(\theta _o\) are estimated with MMLE \(\hat{\theta }\), and the REs \(\Psi \) are predicted with the mode of \(l_{j}(\Psi ,\hat{\theta })\) respecting \(\Psi \), which is denoted as \(\hat{\Psi }(\hat{\theta })\). Here the unknown true parameters \(\theta _o\) are replaced with MMLE \(\hat{\theta }\). \(\hat{\Psi }(\theta )\) denotes the mode of \(l_{j}(\Psi ,\theta )\) with respect to \(\Psi \) for general \(\theta \) and can be found by solving the equation
If the joint pdf \(f(D,\Psi \,|\,\theta )=f(D|\Psi ,\theta )f(\Psi |\theta )\) is unimodal and approximately symmetric for \(\Psi \), \(\hat{\Psi }(\theta )\) is a good approximation for the conditional mean of REs given the data, \(\mathrm {E}\{\Psi \,|\,D,\theta \}\).
When deriving approximation orders, we assume that there are \(i=1,\ldots ,T\) observational units and that there are \(n_i\) observations in the ith unit that share the same subset of REs. For example, in a time-series setting T may indicate number of years, and \(n_t\) the number of observations in year t. Our approximation orders will be conservative in some cases.
One of the main results in Zheng and Cadigan [38] is given here as a proposition for future reference.
Proposition 1
(adapted from Eqs. (13) and (14) of Zheng and Cadigan [38]) If the conditional distribution of \(\Psi \) given data D is approximately MVN, the mean squared error (MSE) of RE predictors and parameter estimators can be estimated with
where \(\ddot{l}_{j}= \partial ^{2}l_{j}(\Psi ,\theta )/\partial \Psi \partial \Psi ^{\top }|_{\theta =\hat{\theta },\Psi =\hat{\Psi }}\), I is a \(p\times p\) identity matrix, \(\partial \hat{\Psi }(\hat{\theta })/\partial \hat{\theta }^{\top }\) denotes \(\partial \hat{\Psi }(\theta )/\partial \theta ^{\top }|_{\theta =\hat{\theta }}\) and \(\mathrm {Cov}(\hat{\theta }) = -\ddot{l}^{-1}(\hat{\theta })\) which is the matrix inverse of the Hessian of the negative marginal loglikelihood evaluated at \(\hat{\theta }\).
TMB uses Eq. (4) combined with the generalized delta method to calculate the prediction standard errors (SEs) for user-specified differentiable functions of REs and parameters (\(g(\Psi ,\theta )\); see Eq. 15 in Zheng and Cadigan [38]). Hence, TMB implicitly assumes that the conditional distribution of \(\Psi \) given data D is approximately normal, which is also required for the Laplace approximation to be accurate for the marginal likelihood in Eq. (1). TMB generalized delta SEs implicitly assume that both D and \(\Psi \) are random. In the next section, we provide \(\Psi \)-conditional covariances that can be used with the generalized delta method to derive SEs that are appropriate when only D is considered to be random and \(\Psi \) is fixed.
2.2 Conditional Covariance and MSE
We consider the inferential setting when \(\Psi \) are randomly generated from the true model \(f(\Psi |\theta _o)\) only once, and then fixed in the subsequent generations of the data D from \(f(D\,|\,\Psi ,\theta _o)\) as the basis for frequentist inference. Throughout this paper, we use subscript “\(_o\)” to denote the true value. The conditional covariance \(\mathrm {Cov}(\hat{\Psi }\,|\,\Psi )\) measures the variability of \(\hat{\Psi }\) when only D is re-sampled from \(f(D\,|\,\Psi ,\theta _o)\). We derive an approximation of \(\mathrm {Cov}(\hat{\Psi }\,|\,\Psi )\) using a first-order Taylor’s series expansion of \(\hat{\Psi }(\hat{\theta })\) about \(\hat{\theta }=\theta _o\), which gives
The \(O_p(T^{-1})\) in Eq. (5) comes from \((\hat{\theta }-\theta _o)^2\) and higher-order expansion terms. We use the \(O(\cdot )\) and \(o(\cdot )\) notations in a matrix sense, such that they apply to each element of \((\cdot )\). Based on Eq. (5), we can show that
where \(\mathrm {Cov}\{ \hat{\Psi }(\hat{\theta }), \hat{\theta } \,|\, \Psi \}\) denotes the conditional covariance between vectors \(\hat{\Psi }(\hat{\theta })\) and \(\hat{\theta }\), and the approximation orders come from \(\mathrm {Cov}\{\hat{\theta },\hat{\Psi }(\theta _o)\,|\,\Psi \}=o(T^{-1})\), \(\mathrm {Cov}\{\hat{\Psi }(\theta _o),O_p(T^{-1})\,|\,\Psi \}=o(T^{-1})\) and \(\mathrm {Cov}\{\hat{\theta },O_p(T^{-1})\,|\,\Psi \}=o(T^{-1})\), which are proved in Appendix C. These results can be summarized in the following matrix form.
Theorem 1
The conditional covariance of RE predictors and parameter estimators is given by
With this formula and the subsequent approximations, the generalized Delta method can be used to evaluate the conditional covariance of the estimate of a differentiable function of \(\theta \) and \(\Psi \).
Define \(\widetilde{\Psi } = \Psi - \{\partial \hat{\Psi }(\hat{\theta })/\partial \hat{\theta }^{\top }\}\theta \), where \(\partial \hat{\Psi }(\hat{\theta })/\partial \hat{\theta }^{\top }\) is used as a constant matrix. Also, let \(\tilde{l}_r(\widetilde{\Psi },\theta )\) be \(l_r(\Psi ,\theta )\) in (2) with variables \((\Psi ,\theta )\) transformed to \((\widetilde{\Psi },\theta )\). For the conditional bias and covariance of MMLEs given \(\Psi \), in Appendix A we proved the following theorem.
Theorem 2
If the marginal distribution (1) can be well evaluated with the Laplace approximation, then the bias of MMLEs of \(\theta \) conditional on the REs \(\Psi \) is given by
and the conditional covariance is given by
where
When the estimator \(\ddot{l}^{-1}(\hat{\theta }) \, \widetilde{\mathcal {I}}_r(\hat{\theta },\hat{\Psi })\, \ddot{l}^{-1}(\hat{\theta })\) for the second term in (8) is not positive definite, we recommend to use its nearest positive definite matrix [14], as discussed in the paragraph following Eq. (A.6) in Appendix. Note that \(\mathrm {Cov}(\hat{\theta })\) in Eq. (8) involves expectations with respect to the marginal distribution of the random response variables, namely the data D, while the \(\mathrm {Cov}\) part of \(\mathrm {Cov}\lbrace \mathrm {E}(\hat{\theta } \,|\,\Psi ) \rbrace \) involves expectations with respect to the distribution of \(\Psi \). The conditional bias of (7) is in the order of \(O(T^{-1/2})\).
For evaluating \(\mathrm {Cov}\{ \hat{\Psi }(\hat{\theta })\,|\,\Psi \}\), following Theorem 1 and Eq. (B.5) in Appendix B we have this Corollary.
Corollary 1
When the distribution of the REs given data is approximately MVN, then
If REs are also MVN, then
Here if the REs are MVN, then \(\Sigma ^{-1}=-\ddot{l}_r(\Psi )\). Note that \(\ddot{l}_{j}(\Psi ,\theta _o)^{-1}\Sigma ^{-1} \ddot{l}_{j}(\Psi ,\theta _o)^{-1}\) will be a positive definite matrix so that the diagonal variances of \(\mathrm {Cov}\lbrace \hat{\Psi }(\hat{\theta })\,|\,\Psi \rbrace \) will be smaller than the diagonal variances of \(\mathrm {Cov}\lbrace \hat{\Psi }(\hat{\theta })-\Psi \rbrace \) when \(\Psi \) is random (i.e., compare Eqs. 9 and 4). This also makes sense because of the restriction on randomization when \(\Psi \) is fixed. However, the difference will be small when the data are highly informative about the REs (i.e., \(\ddot{l}_{j}(\hat{\Psi },\hat{\theta })^{-1}\rightarrow 0\) in some sense, e.g., Fahrmeir and Kaufmann [7]), and estimates of these effects are statistically like fixed-effects parameters.
When \(\Psi \) are actually fixed-effects that are considered to be random for smoothing purposes, then \(\hat{\Psi }\) is a biased estimator of \(\Psi \). For this bias, we have the following evaluation.
Theorem 3
If the conditional distribution of \(\Psi \) given data D is approximately MVN, then
If \(\Psi \) is also MVN with covariance matrix \(\Sigma \), then
Here the leading term for \(\Psi _i\), the ith element of \(\Psi \) is of order \(O(1/(n_i+1))\) with \(n_i\) being the sample size associated with \(\Psi _i\). This theorem can be easily proved by Eqs. (5), (A.8) and (B.2) in Appendices.
Theorem 2 says that conditional on \(\Psi \), \(\hat{\theta }\) is also a biased estimator of \(\theta _o\) with a bias of order \(O(T^{-1/2})\). Let \(\Omega =(\Psi ^{\top },\theta _o^{\top })^{\top }\) and \(\hat{\Omega }=(\hat{\Psi }(\hat{\theta })^{\top },\hat{\theta }^{\top })^{\top }\). Based on the results in this section, in Appendix D we proved the following corollary.
Corollary 2
If \(\Psi \) is MVN with covariance matrix \(\Sigma \) and the conditional distribution of \(\Psi \) given data D is approximately MVN, then
Equation (11) is equal to Eq. (4), namely the unconditional MSE of \(\hat{\Omega }\), \(\mathrm {MSE}( \hat{\Omega } )\). Here, because \(\Psi \) is unknown and can only be estimated with a bias of order O(1) as indicated by Eq. (10), we use \(\mathrm {E}\lbrace \mathrm {E}(\hat{\Omega }-\Omega \,|\,\Psi )\, \mathrm {E}(\hat{\Omega }-\Omega \,|\,\Psi )^{\top } \rbrace \) to give an overall estimation of the conditional bias squared \(\mathrm {E}( \hat{\Omega }-\Omega \,|\,\Psi ) \mathrm {E}( \hat{\Omega }-\Omega \,|\,\Psi )^{\top }\).
2.3 Semiparametric Regression Example
As a partial validation and application of the theoretical results in Sect. 2.2, we consider the Gaussian process semiparametric regression studied in He and Severini [13],
where \(x_1,\ldots ,x_n\) are \(p\times 1\) covariate vectors, \(\epsilon _1,\ldots ,\epsilon _n\) are unobserved independent normal random variables each with mean 0 and standard deviation \(\sigma >0\), \(\beta \) is a \(p\times 1\) vector of unknown regression parameters, \(z_1,\ldots ,z_n\) are observed constants, taking values in a set \(\mathcal {Z}\), and \(\gamma \) is an unknown real-valued function on \(\mathcal {Z}\). He and Severini [13] further denoted \(Y=(Y_1,\ldots ,Y_n)^{\top }\), \(X=(x_1^{\top },\ldots ,x_n^{\top })^{\top }\), \(\epsilon =(\epsilon _1,\ldots ,\epsilon _n)^{\top }\), and \(g_{\gamma } = (\gamma (z_1),\ldots ,\gamma (z_n))^{\top }\), and wrote the model as \(Y=X\beta + g_{\gamma } + \epsilon \). The covariance matrix of \(\epsilon \) is denoted as \(\varOmega _{\phi }\) and assumed to have a parametric form with parameter \(\phi \). The regression coefficients \(\beta \) are of primary interests, and \(g_{\gamma }\) are nuisance effects. Even though actually being fixed, He and Severini [13] technically treated \(g_{\gamma }\) as a mean-zero Gaussian process with \(n\times n\) covariance matrix \(\varSigma _{\lambda }\) parameterized by \(\lambda \) so that \(g_{\gamma }\) can be integrated out to obtain the marginal likelihood.
2.4 Random Walk Simulation Example
We also illustrate Eqs. (8), (9) and (11) using a simple random-walk example. The random walk is \(\Psi _t|\Psi _{t-1} {\mathop {\sim }\limits ^{indep}} N(\Psi _{t-1},\sigma ^{2}_{\Psi })\) for \(t= 2,\ldots ,T\), and \(\Psi _{1} = \beta \) is an unknown parameter to estimate. Here \(N(\mu ,\sigma ^2)\) denotes the normal distribution with mean \(\mu \) and variance \(\sigma ^2\). At each time-step, there are n independent observations of the process, \(Y_{t,i}|\Psi _t {\mathop {\sim }\limits ^{i.i.d}} N(\Psi _t,\sigma ^{2}_{\epsilon }), i=1,\ldots ,n\) and \(t=1,\ldots ,T\). The parameters are \(\theta = (\beta , \sigma _{\Psi },\sigma _{\epsilon })^{\top }\) and the REs are \(\Psi = (\Psi _{2},\ldots ,\Psi _{T})^{\top }\) which is a \((T-1) \times 1\) vector. This process can actually be regarded as a specific realization of the Gaussian process semiparametric regression described in the previous section with regression parameter \(\beta \), \(\Psi = \gamma \), and \(\sigma _{\epsilon }\) and \(\sigma _{\Psi }\) corresponding to the dispersion parameters \(\phi \) and \(\lambda \), respectively. In Sect. 3.1, we showed that \(\mathrm {Cov}\lbrace \hat{\beta } \,|\, \Psi \rbrace \) can be correctly evaluated by Eq. (8), and the RE predictor by maximizing the joint log-likelihood is the Best Linear Predictor (BLP; e.g., Robinson [30]) used in He and Severini [13]. In this example, we demonstrate the statistical properties of \(\hat{\Psi }\) and parameter estimators using a simulation study.
We generated y responses from the random-walk model with \(\beta =0\), \(\sigma _{\Psi } = 1\), \(\sigma _{\epsilon } = 0.5\), and two choices each for \(n=2,5\) and \(T=50,200\).
2.5 Stock Assessment Example
The Schaefer form of the state-space surplus production model (SPM, e.g., Meyer and Millar [24]) gives latent total stock biomass in year t (i.e., \(B_t \ge 0\)) as a function of the biomass in the previous year plus production (births\(+\)growth−natural deaths) minus the fishery catch (\(C_t\), tonnes), with production modeled as a quadratic function of biomass, that is,
where the parameter r controls the intrinsic rate of biomass increase at low population size and K is the carrying capacity. We assume that there is measurement error (ME) in catches and include a catch model, \(C_t = H_t B_t\), where \(H_t \ge 0\) is based on a random walk. The stochastic population dynamics model is
where \(t=1,\ldots ,T\), \(\delta _{H1},\ldots ,\delta _{HT} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\delta H})\), and \(\delta _{B1},\ldots ,\delta _{BT} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\delta B})\).
We apply this model to data for a flatfish species of the east-coast of Canada. The data available include annual estimates of total fishery catches of American plaice in Northwest Atlantic Fisheries Organization Subdiv. 3Ps during 1960–2019 (see Table 2 and Fig. 1 in Morgan et al. [25]). Another common data source is a time-series of average catch from research surveys, which are commonly referred to as stock size indices. The American plaice assessment uses indices derived from stratified random surveys since 1980. Our state-space model observation equations for the time-series of survey indices (I) and the catch observations (\(C_{ot}\)) are:
SPM times \(t=0,\ldots ,T\) correspond to years 1960–2019. Both \(q_E\) and \(q_C\) are survey catchability parameters to estimate. These are different because there was a major change in survey gears and stratification in Subdiv. 3Ps since the 1996 survey. The MEs \(\epsilon _{It} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\epsilon I})\) and \(\epsilon _{Ct} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\epsilon C})\). We assume that \(\sigma _{\epsilon C} = 0.1\) and do not estimate this parameter so that the model fits the catches closely, consistent with Morgan et al. [25]. The total set of parameters to estimate for the state-space model are \(H_0, r, K, \sigma _{\delta H}, \sigma _{\delta B}, q_E, q_C\), and \(\sigma _{\epsilon I}\), along with RE predictions for \(B_t\) and \(H_t, t=1961,\ldots ,2019\) and \(B_0\) for 1960. We also assume that the initial biomass is random,
which is broadly similar to the prior for \(B_0\) in Morgan et al. [25]. More details about the development of this model are provided in the Supplementary Information.
2.5.1 SPM Simulations
The SPM is much slower to fit than the random walk model in Sect. 2.4, so we only generated 250 datasets conditional on a random value of \(\Psi \) drawn from \(f(\Psi \,|\,\theta )\); we repeated this procedure with 250 randomly generated \(\Psi \)’s from \(f(\Psi \,|\,\theta )\) to see the average effect over different \(\Psi \)’s for a total of 62 500 simulations. Also, as mentioned in Introduction, this procedure will often generate datasets that are unrealistically different from the observed data. In fact, some values of \(\Psi \) even result in stock extinction and model estimation errors. To avoid these problems, we simply fixed the \(\delta _{Ht}\) REs at their predicted values in the simulations, and only generated random \(\Psi \)’s for the model process errors (i.e., \(\delta _{Bt}\) in Eq. 14). The random walk standard deviation for \(\delta _{Ht}\) (i.e., \(\sigma _{\delta H}\) in Table 2) is large and results in many unrealistic simulated harvest rate series.
In many of the simulations, the estimates of the process error variance (\(\sigma ^{2}_{\delta B}\)) hit a very small lower bound indicating that process errors were not needed to fit the simulation data well. This resulted in long simulation run times and problems computing \(V_f\) and \(V_r\). To avoid these problems, we fixed \(\sigma ^{2}_{\delta B}\) at the value in Table 2.
3 Results
3.1 Semiparametric Regression Example
The MMLE estimate of \(\beta \) is
which agree with the estimates of He and Severini [13] using generalized least-squares. Here \(V(\theta )=\varOmega _{\phi }+\varSigma _{\lambda }\) and \(\theta =(\phi ,\lambda )^{\top }\). The predictor of \(g_{\gamma }\) by maximizing the joint likelihood is
which is the same as the BLP of \(g_{\gamma }\) in He and Severini [13]. Applying Theorem 2, we obtain
which is the same as the result in Theorem 4.2 of He and Severini [13]. Furthermore, Theorem 2 gives a bias \(\mathrm {E}(\hat{\beta }\,|\,\gamma )-\beta _o=(X^{\top }V(\theta _o)^{-1}X)^{-1}X^{\top }V(\theta _o)^{-1}g_{\gamma }\), which is also consistent with the results of He and Severini [13]. The detailed derivations of all these results are provided in Appendix E.
3.2 Random Walk Simulation Example
The data from an arbitrary simulation when \(n=2\) and \(T=50\) are illustrated in Fig. 1, along with predictions of \(\Psi _t\) and 95% confidence intervals (CIs) based on the conditional-\(\Psi \) MSE in Eq. (11), and the conditional-\(\Psi \) variance using Eq. (9) which we denote as Vc. The conditional-\(\Psi \) MSE (11) is equal to the random-\(\Psi \) MSE (4) that TMB provides. Therefore, we denote the conditional-\(\Psi \) MSE as Vr. The Vr-based (i.e., TMB) CIs cover the real values of \(\Psi \) in \(92\%\) of the years, which is close to the nominal \(95\%\) coverage of the CIs. The Vc-based CIs cover in \(86\%\) of the years. However, this is only based on one simulated set of y’s. We repeated the simulation 1000 times but conditional on the true \(\Psi _t\) values in Fig. 1. That is, we generated 1000 datasets from \(f(D|\Psi ,\theta )\).
We computed the average of the 1000 \(\hat{\Psi }_t\)’s at each time which are shown in the top panel of Fig. 2. The \(\hat{\Psi }_t\)’s are nearly unbiased, but a little smoother than the true \(\Psi _t\)’s such that average \(\hat{\Psi }_t\)’s don’t exactly match the peaks and valleys of the \(\Psi _t\)’s. This is typical of smoothing estimators. The biases are shown in the middle panel of Fig. 2. The simulation average of the estimated bias using the approximation in Eq. (10) is reasonably accurate. Note that the grey points in this panel are based on the true value of \(\Psi \) in the leading term in Eq. (10) and are usually very close to the real simulated bias. The bias estimates using \(\hat{\Psi }_t\)’s (heavy black lines) have larger differences from the simulated bias. This demonstrates that plug-in estimates of the bias may also be biased since the bias is about order O(1) in this case. The bottom panel of Fig. 2 demonstrates that the Vc estimates using Eq. (9) are a little low for this example, while MSE estimates using Eq. (11) give average levels of the simulation-based MSE’s which vary substantially across time because Eq. (11) was derived by taking expectation of bias squared over \(\Psi \). The MSE estimates are also a little low on average.
We conducted many more simulations and the patterns were similar to Fig. 2. However, the results depended on the specific \(\Psi _t\)’s in each simulation. We averaged results (see Fig. 3) over 500 randomly generated sets of random-walk \(\Psi _t\)’s, each with 500 simulations of the data conditional on \(\Psi _t\)’s, and for different choices of n and T. The results demonstrate that when T is large then the variance estimator (9) is reliable on average, and the MSE estimator (11), which is also the TMB estimator (4), represents an average level of the MSE for RE predictions. However, the 90% quantiles in Fig. 3 demonstrate that for specific values of \(\Psi \) the Vc-based estimates of the SEs of \(\hat{\Psi }\) and the Vr-based estimates of RMSE’s can be considerably different than simulation-based results. The quantiles and means are based on the average variance estimates (i.e., the Vc-based and Vr-based SEs) for each of the 500 \(\Psi \) cases, where for each \(\Psi \), as in the lower panel of Fig. 2, the averages of the 500 Vc- and Vr-based SEs based on the 500 simulated datasets are computed, or the sample standard deviations and RMSEs of \(\hat{\Psi }\) are computed.
We also investigated the coverage properties of \(95\%\) confidence intervals (CIs) for \(\Psi \) based on a normal distribution assumption for \(\hat{\Psi }_t\), with Vr or Vc estimates of the variance. The simulation average coverage of Vr CIs (see Fig. 4) was close to the \(95\%\) nominal level, whereas the probability that the Vc CI’s contained \(\Psi \) was somewhat lower than \(95\%\) even when \(T=200\) and \(n=5\). The \(\Psi \)-conditional bias in \(\hat{\Psi }\) contributes to the reduced coverage of Vc-based CI’s. However, for specific values of \(\Psi \) the Vr-based CIs could differ somewhat from the nominal 95% level, with the simulated coverages less than 95% for approximately 50% of the random value of \(\Psi \)’s. Hence, our results indicate the Vr-based CIs are somewhat inaccurate and not simply conservative, and Vc-based CIs are less accurate than Vr-based CIs which is expected because of the conditional bias in \(\hat{\Psi }\).
The GAM (generalized additive model) literature suggests their confidence intervals are accurate when averaged over the smoothing covariates (e.g., [21, 36]), which is time in the random-walk example. For each fixed \(\Psi \), we calculated the average CI coverage over time points and the 500 simulated datasets. The Vr-based CI coverage was accurate, and in the worst case (\(n=2\) and \(T=50\)) ranged from 0.93–0.95 for the 500 \(\Psi \) cases. The coverages were closer to 0.95 for the other choices of n and T. The Vc-based CI coverage was less accurate, and in the worst case (\(n=2\) and \(T=50\)) ranged from 0.88–0.93.
The simulation standard deviations (SDs) and RMSEs for the random-walk parameter estimates, along with the averages of the Vr-based and Vc-based SEs, are shown in Table 1. Keeping in mind the simulation approximation errors, the Vc-based SEs are reasonably accurate for the simulation SD of the estimates, and the Vr-based SEs are reasonably accurate for the RMSE. These results were more accurate for \(\beta \) variances.
3.3 Stock Assessment Example
Our SPM parameter estimates and stock assessment results (Table 2) are broadly similar to Morgan et al. [25]. Here Vc- and Vr-based standard errors are shown as coefficients of variation CVc and CVr, respectively. The CVc’s are usually substantially smaller than CVr’s, but because of the conditional bias in parameter estimators and RE predictors, the CVr’s provide more reliable confidence intervals. This is what we found from the simulations in Sect. 3.2 and this will also be the case for SPM simulation results we provide later in this section. More real-data analysis are provided in the Supplementary Information.
3.3.1 SPM Simulations
Simulation assessment results (Fig. 5; top panels) demonstrate that the Vc-based SEs (SEs) were accurate for \(\Psi \)-averaged simulation standard deviations in most years. The Vr-based SEs were also usually accurate for \(\Psi \)-averaged simulation RMSE, but they were slight over-estimates in the first half of the assessment times series for biomass and harvest rates. However, especially for RMSE, for some values of \(\Psi \) the Vr SEs were substantially different than simulation RMSE values, as indicated by the much wider blue shaded regions compared to the grey regions. The thin blue lines indicate the median RMSE’s across the 250 random sets of \(\Psi \), which indicate that the Vr SEs were larger than the RMSE’s in slightly more than 50% of the \(\Psi \)-cases. The simulation coverage probabilities (CPs) of 95% confidence intervals (CIs) based on Vr SE (middle panels) were much more accurate than those based on Vc SE. When averaged over the 250 random \(\Psi \)’s, the Vr CIs had CPs close to 0.95. However, the CIs were too wide for more than half of the sets of \(\Psi \)’s, and for a small number of \(\Psi \)’s the CIs were inaccurate with CPs \(<<0.95\). Hence, we conclude that Vr CIs are usually conservative, but not always. For some \(\Psi \sim f(\Psi |\theta )\) the Vr CIs may be unreliable. It would be practically useful to have some indication of when this problem occurs, which is a useful area for future research. Vf-based CIs were unreliable because they do not account for the assessment model estimation bias which can be large relative to the SEs (bottom panels) which results in unreliable CIs.
This example illustrates problems because of data limitations; that is, there is only one data point per year before 1980, and two data points per year since 1980, while there are two REs to predict for each year. In this case, \(\ddot{l}_{j}(\Psi ,\theta _o)\) in Eq. (10) is close to \(-\Sigma ^{-1}\) since its \(\ddot{l}_{c}(\Psi ,\theta )\) component in (2) is small relative to the case when there are many data each year, and hence the biases for RE predictions are close to minus the true RE values, \(-\Psi \) (or minus the deviation of true RE values from the marginal means of REs). Here all the double dots above denote second order derivatives respecting \(\Psi \). Because there are less data for each year before 1980 than after 1980, the biases in the predictions of REs before 1980 are in general larger than after 1980, but the corresponding SEs tend to be smaller as indicated in the top panels of Fig. 5 because the model predictions of the REs tend to cluster around their marginal means due to lack of data. Therefore, there is a substantial decrease in the Bias/RE ratio after 1980 in the lower panels of Fig. 5. Also due to the data limitations, the MSEs of RE predictors are close to the marginal variances \(\Sigma \) of REs (i.e., \(\Psi \)) as suggested by Eq. (11). Hence, combined with the previous result that bias approaches \(-\Psi \), the distribution of bias/RMSE is close to standard normal especially before 1980, as evidenced in the bottom panels of Fig. 5.
We calculated the average CI simulation coverage probability (CP) over years and the 250 simulated datasets. Unlike the random-walk simulations, the SPM annual-average CPs depended on \(\Psi \). The range for Vr CPs was 0.334–0.997 for biomass (5% and 95% quantiles: 0.829–0.993) and 0.385–0.995 (quantiles: 0.841–0.992) for harvest rates, although the \(\Psi \)-averages were accurate (0.945 for biomass and 0.947 for harvest rates). Hence, for some values of \(\Psi \) the Vr CIs can have CPs very different than the nominal 0.95 value, even when averaged over years. However, the \(\Psi \)-medians were 0.971 for biomass and 0.969 for harvest rates so for more than 50% of the \(\Psi \)-cases the CIs were conservative when averaged over years. The Vc-based CI CPs were less accurate than Vr CIs. The range was 0.138–0.910 for biomass and 0.169–0.918 for harvest rates, and the \(\Psi \)-averages were 0.738 for biomass and 0.767 for harvest rates.
Simulated parameter estimates (Table 3) demonstrate that Vc SEs are accurate for the simulation SEs, and Vr SEs are accurate for simulation RMSEs. An exception is \(\sigma _{\delta H}\) but the \(\delta _{Ht}\)’s were fixed at the model predictions when generating \(250\times 250\) simulated data and we expect in this situation that the Vr SEs of \(\sigma _{\delta H}\) can be inaccurate.
4 Discussion
We developed frequentist variance approximations for predictions of REs (i.e., \(\Psi \)) in nonlinear mixed-effects models, and functions of these predictions and estimates of model parameters, \(\theta \). We focused on maximum likelihood estimators of \(\theta \) (i.e., \(\hat{\theta }\)) and the conditional mean predictor of \(\Psi \) given data (i.e., \(\hat{\Psi }\)). Our main contribution is a variance approximation for the inferential setting involving repeated sampling of the data when \(\Psi \), although unknown, are assumed to be fixed. This setting is more appropriate when we are particularly interested in statistical inferences based on the specific \(\Psi \) values that generated the data. This setting is also relevant when \(\Psi \) is not a RE but a high dimensional parameter that is modeled as a RE for nonparametric smoothing. There are well-known connections between smoothing methods and RE models (e.g., [4, 34, 37]). This is the case for some state-space fish stock assessment models in which subsets of \(\Psi \) will usually be time-dependent and could be considered to be complex but smooth functions of time.
We demonstrated in simulations that our Vc variance approximation (Eq. 9) is reasonably accurate. The more commonly used Vr variance approximation (Eq. 4, or equivalently Eq. 11), which is the MSE of \(\hat{\Psi }\) when \(\Psi \) is random, represents an average level of the MSE when \(\Psi \) is treated as fixed. This “average” interpretation of MSE also appears in the generalized additive model (GAM) literature such as Marra and Wood [21] and Wood [36]. However, for particular values of \(\Psi \) the MSE of \(\hat{\Psi }\) may differ substantially from Vr depending on the magnitude of the difference between \(\mathrm {E}(\hat{\Psi }|\Psi )\) and \(\Psi \), i.e., the bias. We also developed an accurate approximation for the bias (i.e., Eq. 10). We recommend Vr (Eq. 11) for statistical inferences about \(\Psi \) because it reflects an average level of MSE, although for specific values of \(\Psi \) the MSE may be quite different than the Vr estimate. Although Vc (Eq. 9) is a good approximation of \(\mathrm {Cov}(\hat{\Psi }|\Psi )\), confidence intervals (CIs) based on Vc have worse coverage probabilities than CIs based on Vr because of the conditional bias in \(\hat{\Psi }\).
The statistical properties of estimates of model quantities derived from nonlinear mixed-effect models can be complex, especially when these quantities are functions of parameters and REs. In practice, the types of confidence intervals illustrated in our stock assessment model example may be over-interpreted by managers or whoever is using the results for decision making. For users without extensive statistical expertise, the Vr-based confidence intervals may be better and more simply described from the GAM perspective, which is that they are usually slightly conservative when averaged over a large number of years; that is, they are reliable only in the annual average sense [21, 36]. However, this was not always the case in our surplus production model (SPM) simulations, and in some cases (i.e., choices of \(\Psi \)) the simulation CI coverage probability could be quite different from the nominal 95% value, even when averaged over years. We suggest that users should be cautioned that confidence intervals may be unreliable in specific instances, such as for a particular year. In the stock assessment content, managers may make fishing quota decisions based on the estimated probability that stock size in the most recent year is greater than a reference value, or that the harvest rate was less than a reference value. Our analyses indicate these probabilities may be substantially inaccurate in some cases, and it is difficult to know when this occurs. However, we anticipate that better assessment model formulation and inclusion of more available data can improve the accuracy of CIs.
It is well know that the parameters of SPMs may be poorly identified and estimated with available data. In this case the biases of parameter estimates and other derived quantities may be highly uncertain which can produce unreliable statistical inferences. It may be possible to improve inferences by improving the SPM formulation. For our American plaice case study, this could involve better modeling of how harvest rates change over years in light of major changes in fisheries management such as a fishing moratorium. There is also information available about the differences in the two index catchability parameters (i.e., \(q_E\) and \(q_C\)) that could be used to improve model estimation. There are other formulations of SPMs that may be better (e.g., [27]). However, major improvements will likely involve very different assessment model formulations. There is substantial data about the length structure of the American plaice stock and fisheries that is informative about how reproduction rates have varied over time. There is also historic age information available that is informative about body growth rates. This information can be utilized within an integrated size-structured assessment model (e.g., [22]) to hopefully produce more precise and reliable estimates and inference about the stock population dynamics. However, this assessment model formulation research is beyond the scope of this paper.
When developing the formulas in Sect. 2.2, we assumed that \(\Psi \sim f(\Psi |\theta )\) was a correct assumption. However, in simulation self-tests (described in Sect. 1), which are commonly used to examine the reliability of stock assessment models, the data are generated using \(\hat{\Psi }\) and when the distribution of \(\hat{\Psi }\) is considerably different than \(f(\Psi |\theta )\) then our Vc approximation may not be reliable. For instance, the covariance matrix for \(\hat{\Psi }\) is given by Eq. (9) in this conditional setting and Eq. (11) of Zheng and Cadigan [38] in the marginal setting, and they can be fairly different from that of \(f(\Psi |\theta )\). A complication is models that include some \(\Psi \) with little or no corresponding data, in which case \(\hat{\Psi }\) may be close to zero and simulation self-tests will not reflect uncertainty about the value of \(\Psi \). The SPM simulation results in Sect. 3.3.1 illustrated this case. Even though the true SPM was fit to the simulated data, the biases for RE predictions were substantial. This raises concerns on the reliability of simulation self-tests that generate data with these highly biased RE predictions, whereby correct models may be identified as inadequate. We will investigate simulation self-tests in a separate paper.
Even though we demonstrated in Eq. (17) that the conditional covariance Eq. (8) can give the asymptotic variance of the regression parameter estimators for the Gaussian process semiparametric regression [13], we also found that the parameter estimates are biased to an order of \(O(T^{-1/2})\) for finite samples in the conditional setup, which is a quite general case for semiparametric inferences, and in some cases the bias can be larger (see, e.g., Eq. 79 in Zheng and Sutradhar [40]). Therefore, we suggest that MSE (4) be used for constructing conditional CIs based on finite samples.
The conditional inference discussed in this paper shares some similarity with the GAM framework, and hence both can share many similar results. We have seen this in the interpretation of CI coverage. For another instance, Eq. (4) can give the posterior covariance matrix for the optimal coefficients \(\mathbf {\beta }\) of the base functions, \(\mathbf {V}_{\beta }\), on page 327 of Wood [36]. Nevertheless, GAM is a large topic and its inference approaches can be different from the maximum likelihood method we are considering. Therefore, we do not attempt to build a full connection between GAM and our results in this paper. We leave the study of statistical inference about implementing GAM and GAMM (generalized additive mixed models) with TMB based on the results in this paper as future research.
References
Berger JO, Liseo B, Wolpert RL (1999) Integrated likelihood methods for eliminating nuisance parameters. Stat Sci 14(1):1–28
Bhattacharya A, Pati D, Pillai NS, Dunson DB (2015) Dirichlet–Laplace priors for optimal shrinkage. J Am Stat Assoc 110(512):1479–1490
Breivik ON, Aanes F, Søvik G, Aglen A, Mehl S, Johnsen E (2021) Predicting abundance indices in areas without coverage with a latent spatio-temporal gaussian model. ICES J Mar Sci 78(6):2031–2042
Brown PE, De Jong P (2001) Nonparametric smoothing using state space techniques. Can J Stat 29(1):37–50
Cadigan NG (2015) A state-space stock assessment model for northern cod, including under-reported catches and variable natural mortality rates. Can J Fish Aquat Sci 73(2):296–308
Deroba J, Butterworth DS, Methot R Jr, De Oliveira J, Fernandez C, Nielsen A, Cadrin S, Dickey-Collas M, Legault C, Ianelli J et al (2015) Simulation testing the robustness of stock assessment models to error: some results from the ices strategic initiative on stock assessment methods. ICES J Mar Sci 72(1):19–30
Fahrmeir L, Kaufmann H (1985) Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann Stat 1:342–368
Feller W (2008) An introduction to probability theory and its applications, vol 2. Wiley, New York
Fournier DA, Skaug HJ, Ancheta J, Ianelli J, Magnusson A, Maunder MN, Nielsen A, Sibert J (2012) Ad model builder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models. Optim Methods Softw 27(2):233–249
Hall DB, Clutter M (2004) Multivariate multilevel nonlinear mixed effects models for timber yield predictions. Biometrics 60(1):16–24
Harring JR, Blozis SA (2014) Fitting correlated residual error structures in nonlinear mixed-effects models using sas proc nlmixed. Behav Res Methods 46(2):372–384
He H, Severini T (2014) Integrated likelihood inference in semiparametric regression models. Metron 72(2):185–199
He H, Severini TA (2016) A flexible approach to inference in semiparametric regression models with correlated errors using gaussian processes. Comput Stat Data Anal 103:316–329
Higham NJ (2002) Computing the nearest correlation matrix-a problem from finance. IMA J Numer Anal 22(3):329–343
Jiang J, Jia H, Chen H (2001) Maximum posterior estimation of random effects in generalized linear mixed models. Statistica Sinica 1:97–120
Johnson TR, Kim J-S (2004) A generalized estimating equations approach to mixed-effects ordinal probit models. Br J Math Stat Psychol 57(2):295–310
Kantas N, Doucet A, Singh SS, Maciejowski J, Chopin N et al (2015) On particle methods for parameter estimation in state-space models. Stat Sci 30(3):328–351
Kass RE, Steffey D (1989) Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical bayes models). J Am Stat Assoc 84(407):717–726
Kristensen K, Nielsen A, Berg CW, Skaug H, Bell B (2015) Tmb: automatic differentiation and Laplace approximation. arXiv preprint arXiv:1509.00660
Lindstrom MJ, Bates DM (1990) Nonlinear mixed effects models for repeated measures data. Biometrics 1:673–687
Marra G, Wood SN (2012) Coverage properties of confidence intervals for generalized additive model components. Scand J Stat 39(1):53–74
Maunder MN, Punt AE (2013) A review of integrated analysis in fisheries stock assessment. Fish Res 142:61–74
McCulloch CE, Neuhaus JM (2011) Prediction of random effects in linear and generalized linear models under model misspecification. Biometrics 67(1):270–279
Meyer R, Millar RB (1999) Bugs in bayesian stock assessments. Can J Fish Aquat Sci 56(6):1078–1087
Morgan M, Rogers R, Ings D, Wheeland L (2020) Assessment of the american plaice (hippoglossoides platessoides) stock in nafo subdivision 3ps in 2019. Tech. rep., Canadian Science Advisory Secretariat (CSAS) 2020/019. iv+ 17 p. URL https://waves-vagues.dfo-mpo.gc.ca/Library/40888149.pdf
Nielsen A, Berg CW (2014) Estimation of time-varying selectivity in stock assessments using state-space models. Fish Res 158:96–101
Pedersen MW, Berg CW (2017) A stochastic surplus production model in continuous time. Fish Fish 18(2):226–243
Perreault AM, Wheeland LJ, Morgan MJ, Cadigan NG (2020) A state-space stock assessment model for American plaice on the grand bank of newfoundland. J Northw Atl Fish Sci 51:45–104
Pinheiro JC, Bates DM (1995) Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat 4(1):12–35
Robinson GK (1991) That blup is a good thing: The estimation of random effects. Statistical science 1:15–32
Severini TA (2007) Integrated likelihood functions for non-Bayesian inference. Biometrika 94(3):529–542
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Valpine PD, Hilborn R (2005) State-space likelihoods for nonlinear fisheries time-series. Can J Fish Aquat Sci 62(9):1937–1952
Wand MP (2003) Smoothing and mixed models. Comput Stat 18(2):223–249
Weiss NA (2005) A course in probability
Wood SN (2020) Inference and computation with generalized additive models and their extensions. TEST 29(2):307–339
Wood SN, Scheipl F, Faraway JJ (2013) Straightforward intermediate rank tensor product smoothing in mixed models. Stat Comput 23(3):341–360
Zheng N, Cadigan N (2021) Frequentist delta-variance approximations with mixed-effects models and tmb. Comput Stat Data Anal 160:107227
Zheng N, Robertson M, Cadigan N, Zhang F, Morgan J, Wheel L (2020) Spatiotemporal variation in maturation: a case study with american plaice (hippoglossoides platessoides) on the grand bank off newfoundland. Can J Fish Aquat Sci 77(10):1688–1699
Zheng N, Sutradhar BC (2018) Inferences in semi-parametric dynamic mixed models for longitudinal count data. Ann Inst Stat Math 70(1):215–247
Acknowledgements
Research funding to NC was provided by the Ocean Choice International Industry Research Chair program at the Marine Institute of Memorial University of Newfoundland. Research funding to NC and NZ was provided by the Ocean Frontier Institute, through an award from the Canada First Research Excellence Fund. We are particularly thankful to the anonymous referees for their comments which helped improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A: Proof of Theorem 2
In this setup of fixing \(\Psi \), the expression for \(\mathrm {Cov}\{\hat{\Psi }(\hat{\theta })|\Psi \}\) involves \(\mathrm {Cov}(\hat{\theta }|\Psi )\), and the formula for \(\mathrm {Cov}(\hat{\theta }|\Psi )\) also involves \(\mathrm {Cov}\{\hat{\Psi }(\hat{\theta })|\Psi \}\), causing the failure to solve for \(\mathrm {Cov}(\hat{\theta }|\Psi )\) and \(\mathrm {Cov}\{\hat{\Psi }(\hat{\theta })|\Psi \}\) when \(\partial \hat{\Psi }^{\top }(\hat{\theta })/\partial \hat{\theta }\) is not sufficiently small. To circumvent this coupling of \(\mathrm {Cov}(\hat{\theta }|\Psi )\) and \(\mathrm {Cov}\{\hat{\Psi }(\hat{\theta })|\Psi \}\), consider a REs transformation
which possesses the following good properties:
-
1.
The Jacobian for this transformation between \(\widetilde{\Psi }\) and \(\Psi \) is an identity matrix, and hence no extra term is introduced into the likelihood function by this REs transformation.
-
2.
The MLEs \(\hat{\theta }\) are not and should not be changed by this transformation, and the predictions for the REs are simply \(\hat{\widetilde{\Psi }} = \hat{\Psi } - \dfrac{\partial \hat{\Psi }(\hat{\theta })}{\partial \hat{\theta }^{\top }}\, \hat{\theta }\).
-
3.
\(\dfrac{\partial \hat{\widetilde{\Psi }}^{\top }(\hat{\theta })}{\partial \hat{\theta }} = \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }} - \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }} =0\).
The third property fully decouples \(\mathrm {Cov}(\hat{\theta }|\Psi )=\mathrm {Cov}(\hat{\theta }|\widetilde{\Psi })\) and \(\mathrm {Cov}\{\hat{\widetilde{\Psi }}(\hat{\theta })|\widetilde{\Psi }\}\) where \(\mathrm {Cov}\{\hat{\widetilde{\Psi }}(\hat{\theta })|\widetilde{\Psi }\}\) does not involve \(\mathrm {Cov}(\hat{\theta }|\Psi )\) any more.
By Laplace approximation, the marginal likelihood is given by
where c is a constant, and \(\widetilde{H}(\theta ) = -\partial ^2\tilde{l}_j(\widetilde{\Psi },\theta )/\partial \widetilde{\Psi }\partial \widetilde{\Psi }^{\top }|_{\widetilde{\Psi }=\hat{\widetilde{\Psi }}}\). Here the tildes on H and l denote that the REs are reparameterized with \(\widetilde{\Psi }\) by (A.1), while the corresponding quantities using \(\Psi \) do not have tildes. The marginal likelihood \(l(\theta )\) is not influenced by this reparameterization.
TMB MLE (maximum likelihood estimator) \(\hat{\theta }\) is obtained by solving the score equation based on (A.2)
Note that
because \(\dfrac{\partial \tilde{l}_j(\hat{\widetilde{\Psi }}(\hat{\theta }), \hat{\theta })}{\partial \hat{\widetilde{\Psi }}(\hat{\theta })}=0\) as the definition of \(\hat{\widetilde{\Psi }}(\hat{\theta })\) in TMB, where \(\dfrac{\partial \tilde{l}_j(\hat{\widetilde{\Psi }}, \hat{\theta })}{\partial \hat{\theta }}\) denotes neglecting \(\hat{\theta }\) in \(\hat{\widetilde{\Psi }}(\hat{\theta })\) when taking derivative with respect to \(\hat{\theta }\). Hence, the score equation becomes
By expanding \(\dot{l}(\hat{\theta })\) at the true value \(\theta _o\) and rearranging terms, we have
where \(\mathcal {I}^{-1} = -\left( \dfrac{\partial ^2 l(\theta _o)}{\partial \theta _o\partial \theta _o^{\top }}\right) ^{-1}\), \(\hat{\widetilde{\Psi }}_o = \hat{\widetilde{\Psi }}(\theta _o)\), and the \(\ln \det (\widetilde{H}(\theta _o))\) term is denoted as \(\Delta \). Further expanding the \(\dfrac{\partial \tilde{l}_j(\hat{\widetilde{\Psi }}_o,\theta _o)}{\partial \theta _o}\) term about the true \(\widetilde{\Psi }\), we obtain
When \(\widetilde{\Psi }\) is MVN with mean \(\mu _{\widetilde{\Psi }}\) and covariance matrix \(\Sigma \), using Eq. (B.1), we obtain
where \(\mathrm {E}( \dfrac{\partial \tilde{l}_c(\widetilde{\Psi },\theta )}{\partial \theta } \,|\, \widetilde{\Psi })=0\) because \(\tilde{l}_c(\widetilde{\Psi },\theta )\) is the true log-likelihood when given \(\widetilde{\Psi }\), and the \(\ln \det (\widetilde{H}(\theta _o))\) term is assumed without randomness after conditional expectation and denoted as a constant c. Further note that \(\dfrac{\partial ^2 \tilde{l}_j}{\partial \hat{\theta }\partial \hat{\widetilde{\Psi }}^{\top }} \left( \dfrac{\partial ^2 \tilde{l}_j}{\partial \hat{\widetilde{\Psi }}\partial \hat{\widetilde{\Psi }}^{\top }} \right) ^{-1} = -\dfrac{\partial \hat{\widetilde{\Psi }}(\hat{\theta })}{\partial \hat{\theta }^{\top }}=0\). Therefore, whenever we apply the estimates \(\hat{\theta }\) and \(\hat{\widetilde{\Psi }}\) to estimate the last term in (A.4) we get 0. Therefore,
According to the law of total variance [35, pages 385–386],
Here we use the concept of ergodicity, which implies that a sufficiently large collection of random samples from a stochastic process can represent the average statistical properties of the entire process (see e.g., [8]). Ergodicity can be regarded as the counterpart of the law of large numbers for stochastic processes. Because the REs and data are all generated from the true model, according to ergodicity, as T gets large, the T mean \(\mathrm {Cov}( \hat{\theta } \,|\, \widetilde{\Psi })\) converges in probability to the population mean \(\mathrm {E}\lbrace \mathrm {Cov}( \hat{\theta } \,|\, \widetilde{\Psi }) \rbrace \), namely, \(\mathrm {E}\lbrace \mathrm {Cov}( \hat{\theta } \,|\, \widetilde{\Psi }) \rbrace = \mathrm {Cov}( \hat{\theta } \,|\, \widetilde{\Psi }) + o_p(T^{-1})\) since \(\mathrm {Cov}( \hat{\theta } \,|\, \widetilde{\Psi })\) is itself \(O_p(T^{-1})\). Therefore,
where
In the above derivation, we kept the leading term of \(O(T^{-1})\) and hence the approximation order is \(o(T^{-1})\). In (A.6) \(\mathrm {Cov}\lbrace \hat{\theta } \,|\, \Psi \rbrace \) appears as a correction to the marginal variance \(\mathrm {Cov}( \hat{\theta } )\) with the correction term being the covariance of \(\mathcal {I}^{-1}\dfrac{\partial \tilde{l}_r(\widetilde{\Psi },\theta _o)}{\partial \theta _o}\), \(\mathcal {I}^{-1}\, \widetilde{\mathcal {I}}_r\, \mathcal {I}^{-1}\). Therefore \(\mathcal {I}^{-1}\, \widetilde{\mathcal {I}}_r\, \mathcal {I}^{-1}\) should be positive definite. However, because we are applying the estimated \(\hat{\theta }\) and \(\hat{\widetilde{\Psi }}\), \(\mathcal {I}^{-1}\, \widetilde{\mathcal {I}}_r\, \mathcal {I}^{-1}\) may not be positive definite. When \(\mathcal {I}^{-1}\, \widetilde{\mathcal {I}}_r\, \mathcal {I}^{-1}\) is not positive definite, we recommend to use its nearest positive definite matrix [14], where its negative eigenvalues are set to 0.
Transformation (A.1) indicates that \(\Psi \) is a function of \(\widetilde{\Psi }\) and \(\theta \), and its property 1 gives \(\tilde{l}_r(\widetilde{\Psi },\theta ) = l_r(\Psi ,\theta )\). We then can use chain rule to express \(\widetilde{\mathcal {I}}_r\) in terms of \(\Psi \) instead of \(\widetilde{\Psi }\),
In (A.5), because the randomness from \(\widetilde{\Psi }\) in c is assumed to be negligible, and the expectation of \(\mathrm {E}\left( \hat{\theta } - \theta _o\,|\,\widetilde{\Psi }\right) \) with respect to \(\widetilde{\Psi }\) is approximately 0, we have \(\mathrm {E}\{c\}\approx c\approx 0\). Therefore,
Here the O(1/T) comes from \((\hat{\theta } - \theta _o)^2\). The bias is in the order of \(O(T^{-1/2})\).
Appendix B: Approximation for \(\hat{\Psi }(\theta _o)\)
A first-order Taylor series expansion with \(\dot{l}_j(\hat{\Psi }(\theta _o), \theta _o)=0\) gives
Here we assume that the distribution of REs given data is approximately MVN, which is necessary for TMB and ADMB to integrate out REs using Laplace approximation. This assumption implies that the higher-order terms not shown in Eq. (B.1) are all negligibly small. We assume \(\Psi \) has a multivariate normal (MVN) distribution with mean zero and covariance matrix \(\Sigma \) which is also \(-\ddot{l}_r^{-1}(\Psi ,\theta _o)\), where \(l_r(\Psi ,\theta _o)\) is the loglikelihood of \(\Psi \) (see Eq. 2) and the derivatives are about \(\Psi \). Note that for the MVN, \(\ddot{l}_r(\Psi ,\theta _o)\) does not depend on \(\Psi \). In Eq. (B.1), \(\ddot{l}_{j}(\Psi ,\theta _o)\) is equal to \(\ddot{l}_{c}(\Psi ,\theta _o)+\ddot{l}_r(\Psi ,\theta _o)\) (see Eq. 2). We assume that \(\mathrm {E}_{D|\Psi }\{\ddot{l}_j(\Psi ,\theta _o)\} \approx \ddot{l}_{c}(\Psi ,\theta _o)+\ddot{l}_r(\Psi ,\theta _o)\) in the first order approximation, which is a fairly general case in ecological and fisheries science modelings. For example, if \(f(D|\Psi ,\theta _o)\) is MVN and \(\Psi \) are its means or more generally \(f(D|\Psi ,\theta _o)\) belongs to exponential family and \(\Psi \) are its natural parameters, then \(\ddot{l}_{c}(\Psi ,\theta _o)\) is constant. We use the notation \(\mathcal {J}(\Psi ,\theta _o)=-\mathrm {E}_{D|\Psi }\{\ddot{l}_j(\Psi ,\theta _o)\}\). The approximation we use to derive the covariance approximations of \(\hat{\Psi }(\theta _0)\) is
Using Eq. (B.2),
However, \(\mathrm {Cov}_{D|\Psi }\{\dot{l}_{j}(\Psi ,\theta _o)\}=\mathrm {Cov}_{D|\Psi }\{\dot{l}_{c}(\Psi ,\theta _o)\}\) using Eq. (2), and because \(l_{c}(\Psi ,\theta _o)\) is the true likelihood when conditional on \(\Psi \),
Hence,
Appendix C: Proof of Theorem 1
The ith element of vector \(\hat{\Psi }(\theta _o)\), \(\hat{\Psi }_i(\theta _o)\), involves only the data associated with \(\Psi _i\) (the ith element of \(\Psi \)) and its close neighbors. As a result, \(\mathrm {Cov}\{ \dot{l}(\theta _o), \hat{\Psi }_i(\theta _o) \,|\,\Psi \}= O(1)\); that is, \(\mathrm {Cov}\{ \dot{l}(\theta _o), \hat{\Psi }(\theta _o) \,|\,\Psi \}= O(1)\). Then we have
since \(\dot{l}(\hat{\theta })\equiv 0\). Because \(\mathcal {I}^{-1}\) is \(O(T^{-1})\), we proved
According to (B.2),
where conditional on \(\Psi \), the first two terms are constant and \(\dot{l}_c(\Psi ,\theta _o)\) is the score. Then for the \(O_p(T^{-1})\) in (5)
\(\{\mathcal {J}(\Psi ,\theta _o)^{-1}\dot{l}_c(\Psi ,\theta _o)\, O_p(T^{-1})\}\) is \(O_p(T^{-1})\), and hence \(\mathrm {E}\lbrace | \mathcal {J}(\Psi ,\theta _o)^{-1}\dot{l}_c(\Psi ,\theta _o) O_p(T^{-1}) | \,|\,\Psi \rbrace \) is \(O(T^{-1})\) if \(\{\mathcal {J}(\Psi ,\theta _o)^{-1}\dot{l}_c(\Psi ,\theta _o)\, O_p(T^{-1})\}\) is uniformly integrable, which we assume true. Because \(\mathrm {E}\lbrace \dot{l}_c(\Psi ,\theta _o)\,|\,\Psi \rbrace =0\), it is reasonable that \(\mathrm {E}\{ \mathcal {J}(\Psi ,\theta _o)^{-1}\dot{l}_c(\Psi ,\theta _o)\, O_p(T^{-1}) \,|\,\Psi \}\) is smaller than \(O(T^{-1})\), namely, is \(o(T^{-1})\). This can be further seen as follows. Let \(A_i\) be the ith element of vector \(\mathcal {J}(\Psi ,\theta _o)^{-1}\dot{l}_c(\Psi ,\theta _o)\), which should involve only the data associated with \(\Psi _i\) and its close neighbors. \(O_p(T^{-1})\) term is mainly \((\hat{\theta } - \theta _o)^2\). \(\hat{\theta }\) is based on the data in all the T units, and hence as T increases \((\hat{\theta } - \theta _o)^2\) becomes less correlated with \(A_i\). As a result,
which implies that \(\mathrm {E}\lbrace A_i\, O_p(T^{-1})\,|\,\Psi \rbrace \) is \(o(T^{-1})\) because \(\mathrm {E}\lbrace O_p(T^{-1})\,|\,\Psi \rbrace \) is \(O(T^{-1})\). Applying this result to (C.2), we have that \(\mathrm {Cov}\lbrace \hat{\Psi }(\theta _o), O_p(T^{-1})\,|\,\Psi \rbrace \) is \(o(T^{-1})\). Therefore,
Appendix D: Proof of Corollary 2
It is more convenient to discuss MSE with \(\widetilde{\Psi }\) defined in (A.1) because \(\partial \hat{\widetilde{\Psi }}(\hat{\theta })/\partial \hat{\theta }^{\top }=0\) and hence \(\hat{\widetilde{\Psi }}(\hat{\theta })\) is totally decoupled from \(\hat{\theta }\).
according to (B.2). Here O(1/T) comes from \(\mathrm {E}\{ (\hat{\theta } - \theta _o)^2 |\Psi \}\). The \(\mathrm {E}( \hat{\theta } - \theta _o |\Psi )\) term is neglected because \(\partial \hat{\widetilde{\Psi }}(\hat{\theta })/\partial \hat{\theta }^{\top }=0\), that is, whenever we use \(\partial \hat{\widetilde{\Psi }}(\hat{\theta })/\partial \hat{\theta }^{\top }=0\) to estimate \(\partial \hat{\widetilde{\Psi }}(\theta _o)/\partial \theta _o\), we get 0. An estimation for the bias squared is
where \(\mathrm {E}\{\Psi \, O(1/T)\}=0\) can be proved with the similar argument as in Appendix A.2 of Zheng and Cadigan [38]. The conditional MSE of \(\hat{\widetilde{\Psi }}(\hat{\theta })\) is
which is equal to the marginal MSE of \(\hat{\widetilde{\Psi }}(\hat{\theta })\) as in (4) since \(\partial \hat{\widetilde{\Psi }}(\hat{\theta })/\partial \hat{\theta }^{\top }=0\).
For \(\hat{\theta }\), according to (A.8) the bias squared can be estimated by
The conditional MSE of \(\hat{\theta }\) is
according to (A.6).
Because \(\partial \hat{\widetilde{\Psi }}(\hat{\theta })/\partial \hat{\theta }=0\), the covariance between \(\hat{\theta }\) and \(\hat{\widetilde{\Psi }}(\hat{\theta })\) is negligible. Therefore, (D.1) and (D.3) say that \(\mathrm {MSE}(\hat{\widetilde{\Omega }}\,|\,\widetilde{\Psi })=\mathrm {MSE}(\hat{\widetilde{\Omega }})\) with \(\hat{\widetilde{\Omega }}=(\hat{\widetilde{\Psi }}^{\top },\hat{\theta }^{\top })^{\top }\). This proved that we can use TMB variance for the conditional MSE given \(\Psi \). For example, the conditional MSE of \(\hat{\Omega }=(\hat{\Psi }(\hat{\theta })^{\top },\hat{\theta }^{\top })^{\top }\) can be evaluated with Delta method as
which proved (11). Here I’s are identity matrices, and \(\mathbf {0}\)’s are matrices of 0.
Appendix E: Derivation of the Semiparametric Regression Example
The marginal log-likelihood is obtained by integrating out \(\gamma \),
For a given value of the dispersion parameter \(\theta \), He and Severini [13] estimated \(\beta \) with generalized least-squares, which is the same as the MMLE,
Here \(\theta \) can be replaced by any efficient and consistent estimator including the MMLE or the restricted maximum likelihood (REML) estimates adopted by He and Severini [13]. He and Severini [13] emphasized that the Gaussian process with covariance \(\varSigma _{\lambda }\) is only a technical device to integrate out \(g_{\gamma }\). Therefore, \(\lambda \) does not exist and hence has no true value. He and Severini [13] denoted the asymptotic value of the REML of \(\theta \) as \(\theta ^*\). In the framework of this paper, we assume that \(\theta \) can be estimated with a consistent estimator of \(\theta _o+O(n^{-1/2})\) such as the MLE, and then Theorem 4.2 of He and Severini [13] gives asymptotically
where \(\mathcal {I}=\partial ^2 l(\beta _o,\theta _o)/\partial \beta _o\partial \beta _o^{\top }=X^{\top }V(\theta _o)^{-1}X\) according to Eq. (E.1). For simplicity, we only consider inference about \(\beta \) which is the primary interest of this regression model, and we assume that the dispersion parameters \(\theta \) can be estimated by some efficient and consistent estimator \(\hat{\theta }\) converging to \(\theta _o\). The loglikelihood based on the joint distribution of Y and \(\gamma \) is
The predictor of \(g_{\gamma }\) by maximizing \(l_j\) at \(\hat{\beta }\) and \(\hat{\theta }\) is
which is the same as the Best Linear Predictor (BLP) of \(g_{\gamma }\) in He and Severini [13]. We used \(\left( \varOmega _{\hat{\phi }}^{-1} + \varSigma _{\hat{\lambda }}^{-1} \right) ^{-1} = \varSigma _{\hat{\lambda }}(\varOmega _{\hat{\phi }}+\varSigma _{\hat{\lambda }})^{-1}\varOmega _{\hat{\phi }}\), which can be easily proved by taking inverse of the right-hand side. Therefore, applying Theorem 2 with \(\theta \) fixed at true value \(\theta _o\), we obtain after some algebra,
which is the same as Eq. (E.3), the result given in He and Severini [13]. Even though this result is somewhat obvious by directly evaluating the covariance of Eq. (E.2) conditional on \(\gamma \), this example tested the applicability of our results to semiparametric regression models. Furthermore, direct evaluation based on (12) and (E.2) can reveal the bias \(\mathrm {E}(\hat{\beta }\,|\,\gamma )-\beta _o=(X^{\top }V(\theta _o)^{-1}X)^{-1}X^{\top }V(\theta _o)^{-1}g_{\gamma }\), which agrees with the bias obtained using Theorem 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zheng, N., Cadigan, N. Frequentist Conditional Variance for Nonlinear Mixed-Effects Models. J Stat Theory Pract 17, 3 (2023). https://doi.org/10.1007/s42519-022-00304-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s42519-022-00304-5