1 Introduction

Nonlinear mixed-effects models have been widely implemented to address complex multivariate correlation structures in data (see, e.g., [10, 11]; among many others) and cover a broad spectrum of statistical models. In some applications, the fixed effects, such as the regression parameters, are of primary interests, while the random effects (REs) are introduced only to account for the complex dependencies in the data (e.g., [16, 40]). However, in many other applications, REs or functions of REs represent quantities of practical significance and hence are also important to predict, and the correlations among REs are used to improve statistical inferences at spatiotemporal locations with few data (e.g., [3, 39]).

According to the definition of conditional probability, mixed-effects models (linear or nonlinear) can be written in the form of \(f(D,\Psi |\theta )=f(D|\Psi ,\theta )f(\Psi |\theta )\) (see, e.g., [20, 29]), where the vector of data D is assumed to have a multivariate probability density/mass function (pdf/pmf) \(f(D|\Psi ,\theta )\), given values of the vectors of fixed-effects parameters \(\theta \) and REs \(\Psi \). The marginal distribution of \(\Psi \) is \(f(\Psi |\theta )\). A specific example of the mixed-effects model is the fisheries state-space population dynamics model where \(f(\Psi |\theta )\) is the process model describing how the latent population processes evolve over time and/or space and \(f(D|\Psi ,\theta )\) is the observation model linking data to the latent processes (e.g., [33]). Nonlinear mixed-effects models have numerous applications in many fields including fisheries, ecology, environmental sciences, econometrics and engineering (e.g., [17]). The implementation of these models in fisheries and ecological studies relies heavily on software packages including Automatic Differentiation Model Builder (ADMB, Fournier et al. [9]) and Template Model Builder (TMB, Kristensen et al. [19]). Therefore, in this paper we study inference for nonlinear mixed-effects models as implemented with TMB or ADMB. These packages use the maximum marginal likelihood estimator (MMLE) to estimate the fixed effects \(\theta \).

The marginal distribution of D is

$$\begin{aligned} f(D|\theta ) = \idotsint _{q}{f(D|\Psi ,\theta )f(\Psi |\theta )}d\Psi _1,\ldots ,d\Psi _q, \end{aligned}$$
(1)

where \(\Psi _1,\ldots ,\Psi _q\) are the elements of \(q \times 1\) vector \(\Psi \). For simplicity, this q-fold integral is denoted as \(\int f(D|\Psi ,\theta )\) \(f(\Psi |\theta )d\Psi \). The MMLE of \(\theta \) are those values \(\hat{\theta }\) that maximize \(f(D|\theta )\). Throughout this paper, we use \(\hat{\theta }\) to denote the MMLE of \(\theta \). The integral in Equation (1) will usually not have a closed-form; however, TMB can approximate the marginal likelihood via the Laplace approximation quickly for possibly many (i.e., tens of thousands) REs by efficiently utilizing the sparseness of the joint distribution of \(f(D,\Psi |\theta )\) with respect to \(\Psi \). REs \(\Psi \) can be predicted with the conditional mean \(\hat{\Psi }_{\mathrm {E}}(\hat{\theta })=\int {\Psi f(\Psi |D, \hat{\theta })d\Psi }\) which is also the empirical Bayes predictor of \(\Psi \) in the Bayesian framework (e.g., [18]). McCulloch and Neuhaus [23] showed, for generalized linear mixed models, that \(\mathrm {E}\{\Psi |D,\theta \}\) is the best predictor in the sense of minimizing the overall mean squared error (MSE) of prediction. REs can also be predicted with posterior mode \(\hat{\Psi }(\hat{\theta })\) that maximizes the joint distribution \(f(D,\Psi |\theta )\) or equivalently the posterior \(f(\Psi |D,\theta )\) when \(\theta =\hat{\theta }\). Note the difference between posterior mean \(\hat{\Psi }_{\mathrm {E}}\) and posterior mode \(\hat{\Psi }\). In linear mixed models, the posterior mode RE predictor \(\hat{\Psi }\) is known as the empirical best linear unbiased predictor (EBLUP; Robinson [30]). In generalized linear mixed models, Jiang et al. [15] called \(\hat{\Psi }\) the maximum posterior estimate (MPE) of \(\Psi \), and proved that given sufficient information about the REs, a restricted version of the MPE exhibits an overall consistency no matter the value of the dispersion parameters of the REs distribution at which \(\hat{\Psi }\) are evaluated, even though the prediction of the individual RE is biased. In this paper, we use the posterior mode \(\hat{\Psi }\) to predict REs under a more general situation where there may not be sufficient data for all the REs, and especially there may be no data for some subset of REs. When the joint pdf \(f(D,\Psi |\theta )\) is unimodal and approximately symmetric about \(\Psi \) then \(\hat{\Psi }_{\mathrm {E}}\) and \(\hat{\Psi }\) are approximately the same. The focus of our research is statistical inference with TMB and ADMB that apply the Laplace approximation by assuming \(f(D,\Psi |\theta )\) is approximately multivariate normal (MVN). Under such circumstances, \(\hat{\Psi }_{\mathrm {E}}\) is approximately equivalent to \(\hat{\Psi }\) and hence the good properties for \(\hat{\Psi }_{\mathrm {E}}\) are also valid for \(\hat{\Psi }\).

We consider a conceptual frequentist inferential setting where the REs, \(\Psi \), are drawn once from the process model \(f(\Psi |\theta )\) and then fixed at these values during repeated data generations from the observation model \(f(D|\Psi ,\theta )\). This is a realistic inferential setting since in many cases an effect is treated as random only because it is unobservable and high-dimensional, and not because it is truly random. For instance, the popular lasso/\(L_1\) regularization [32] for addressing high-dimensional (HD) linear regression parameters is equivalent to introducing a double exponential (Laplace) marginal distribution (or prior in a Bayesian interpretation) on the HD coefficients, and then estimating the HD parameters using the posterior mode [2]. In fisheries state-space assessment models, the annual population abundance and fishery mortality rates are frequently modeled as REs (e.g., [5, 28]). Even though there are process errors in how these effects are modeled, there is only one set of process errors and only one set of time-series of true population abundance and fishing mortality rates that we need to make statistical inferences for, that is, the yearly time-series of unknown population abundance and mortality rates may be one draw from larger populations (i.e., they are random variables), but after they have been established, they behave like high-dimensional parameters staying constant during repeated sampling (i.e., catches) in different months of the year or locations. Under such circumstances, it is more appropriate to make statistical inference conditioning on the unknown REs. In this conditional inferential setting, rather than the marginal mean and covariance, we should evaluate the conditional mean \(\mathrm {E}\{\cdot |\Psi \}\) and covariance \(\mathrm {Cov}\{\cdot |\Psi \}\). The marginal statistical properties are different than the conditional properties and this can lead to mis-interpretation of CIs and possibly wrong fisheries management decisions if the conditional setting is actually appropriate. For example, in the marginal setting the parameter estimators and RE predictors are all approximately unbiased [38]; however, in the conditional setting their biases are not negligible (see Sect. 2.2). In this paper, we investigate the biases and covariances of parameter estimators and RE predictors in the conditional setting, and we also examine the CI coverage properties using simulation studies.

The marginal inferential setting may be mis-specified, which often is revealed when simulation testing the efficacy of state-space models. With the marginal setting, in each simulation run the REs \(\Psi \) need to be generated from \(f(\Psi |\theta )\), which frequently results in unrealistic REs, the extinction of the simulated fish stock, and unusable simulation data. A commonly used procedure to address this problem is repeated sampling of D from \(f(D|\hat{\Psi },\hat{\theta })\) (e.g., [5, 26, 28]), namely, the REs are fixed at \(\hat{\Psi }\) instead of being randomly generated in each simulation. In stock assessment, this is referred to as a simulation self-test [6]. This simulation setup is much closer to our conditional setting rather than the marginal setting, and hence a study based on the conditional setting can reveal and explain the difference between marginal inference and self-tests, and improve the interpretation of self-tests results. However, the distribution of RE predictor \(\hat{\Psi }\) is different from that of RE \(\Psi \), namely \(f(\Psi |\theta )\), and thus the results in this paper for conditioning on true REs may not be fully applicable to self-tests. This issue will be further clarified in Sect. 4.

The results in this paper are also generally applicable to the Gaussian process semiparametric regression model of He and Severini [12, 13] and the type of integrated likelihood [1, 31] used for primary model parameters (e.g., regression coefficients) while the unknown nuisance parameters, even though fixed, are integrated out by technically assuming some distribution which is usually MVN. We will illustrate this application with an example in Sect. 2.3.

2 Materials and Methods

2.1 Notations and Background

Consider a nonlinear mixed-effects model for random response data which are collected in a \(n \times 1\) vector D and are assumed to have a multivariate pdf \(f(D|\Psi ,\theta )\). The means and covariances of D depend on the fixed-effects parameters \(\theta \) (\(p \times 1\)) and the random-effects \(\Psi \) (\(q \times 1\)), possibly via nonlinear functions of \(\theta \), \(\Psi \), and covariates which we do not develop notation for and leave implicit in \(f(D|\Psi ,\theta )\). The pdf of \(\Psi \) is \(f(\Psi |\theta )\). We denote the joint loglikelihood of \(\theta \) and \(\Psi \) as

$$\begin{aligned} l_{j}(\Psi ,\theta )=l_{c}(\Psi ,\theta )+l_r(\Psi ,\theta ), \end{aligned}$$
(2)

with the conditional data loglikelihood \(l_{c}(\Psi ,\theta )=\ln \{f(D|\Psi ,\theta )\}\) and the loglikelihood of the REs \(l_r(\Psi ,\theta )=\ln \{f(\Psi |\theta )\}\). The marginal distribution of D is given by Eq. (1), and the marginal loglikelihood is denoted as \(l(\theta )\). The true parameters \(\theta _o\) are estimated with MMLE \(\hat{\theta }\), and the REs \(\Psi \) are predicted with the mode of \(l_{j}(\Psi ,\hat{\theta })\) respecting \(\Psi \), which is denoted as \(\hat{\Psi }(\hat{\theta })\). Here the unknown true parameters \(\theta _o\) are replaced with MMLE \(\hat{\theta }\). \(\hat{\Psi }(\theta )\) denotes the mode of \(l_{j}(\Psi ,\theta )\) with respect to \(\Psi \) for general \(\theta \) and can be found by solving the equation

$$\begin{aligned} \dot{l}_{j}(\Psi ,\theta )|_{\Psi =\hat{\Psi }(\theta )}=\left. \frac{\partial l_{j}(\Psi ,\theta )}{\partial \Psi }\right| _{\Psi =\hat{\Psi }(\theta )}=0. \end{aligned}$$
(3)

If the joint pdf \(f(D,\Psi \,|\,\theta )=f(D|\Psi ,\theta )f(\Psi |\theta )\) is unimodal and approximately symmetric for \(\Psi \), \(\hat{\Psi }(\theta )\) is a good approximation for the conditional mean of REs given the data, \(\mathrm {E}\{\Psi \,|\,D,\theta \}\).

When deriving approximation orders, we assume that there are \(i=1,\ldots ,T\) observational units and that there are \(n_i\) observations in the ith unit that share the same subset of REs. For example, in a time-series setting T may indicate number of years, and \(n_t\) the number of observations in year t. Our approximation orders will be conservative in some cases.

One of the main results in Zheng and Cadigan [38] is given here as a proposition for future reference.

Proposition 1

(adapted from Eqs. (13) and (14) of Zheng and Cadigan [38]) If the conditional distribution of \(\Psi \) given data D is approximately MVN, the mean squared error (MSE) of RE predictors and parameter estimators can be estimated with

$$\begin{aligned} \mathrm {Cov}\left\{ \left[ \begin{array}{c} \hat{\Psi }(\hat{\theta })-\Psi \\ \hat{\theta } \\ \end{array} \right] \right\}&\approx \left[ \begin{array}{cc} -\ddot{l}_{j}^{-1} &{} 0 \\ 0 &{} 0 \end{array} \right] + \left[ \begin{array}{c} \dfrac{\partial \hat{\Psi }(\hat{\theta })}{\partial \hat{\theta }^{\top }} \\ I \end{array} \right] \mathrm {Cov}(\hat{\theta }) \left[ \begin{array}{cc} \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }}&I \end{array} \right] , \end{aligned}$$
(4)

where \(\ddot{l}_{j}= \partial ^{2}l_{j}(\Psi ,\theta )/\partial \Psi \partial \Psi ^{\top }|_{\theta =\hat{\theta },\Psi =\hat{\Psi }}\), I is a \(p\times p\) identity matrix, \(\partial \hat{\Psi }(\hat{\theta })/\partial \hat{\theta }^{\top }\) denotes \(\partial \hat{\Psi }(\theta )/\partial \theta ^{\top }|_{\theta =\hat{\theta }}\) and \(\mathrm {Cov}(\hat{\theta }) = -\ddot{l}^{-1}(\hat{\theta })\) which is the matrix inverse of the Hessian of the negative marginal loglikelihood evaluated at \(\hat{\theta }\).

TMB uses Eq. (4) combined with the generalized delta method to calculate the prediction standard errors (SEs) for user-specified differentiable functions of REs and parameters (\(g(\Psi ,\theta )\); see Eq. 15 in Zheng and Cadigan [38]). Hence, TMB implicitly assumes that the conditional distribution of \(\Psi \) given data D is approximately normal, which is also required for the Laplace approximation to be accurate for the marginal likelihood in Eq. (1). TMB generalized delta SEs implicitly assume that both D and \(\Psi \) are random. In the next section, we provide \(\Psi \)-conditional covariances that can be used with the generalized delta method to derive SEs that are appropriate when only D is considered to be random and \(\Psi \) is fixed.

2.2 Conditional Covariance and MSE

We consider the inferential setting when \(\Psi \) are randomly generated from the true model \(f(\Psi |\theta _o)\) only once, and then fixed in the subsequent generations of the data D from \(f(D\,|\,\Psi ,\theta _o)\) as the basis for frequentist inference. Throughout this paper, we use subscript “\(_o\)” to denote the true value. The conditional covariance \(\mathrm {Cov}(\hat{\Psi }\,|\,\Psi )\) measures the variability of \(\hat{\Psi }\) when only D is re-sampled from \(f(D\,|\,\Psi ,\theta _o)\). We derive an approximation of \(\mathrm {Cov}(\hat{\Psi }\,|\,\Psi )\) using a first-order Taylor’s series expansion of \(\hat{\Psi }(\hat{\theta })\) about \(\hat{\theta }=\theta _o\), which gives

$$\begin{aligned} \hat{\Psi }(\hat{\theta }) = \hat{\Psi }(\theta _o) + \left. \dfrac{\partial \hat{\Psi }(\theta )}{\partial \theta ^{\top }}\right| _{\theta =\theta _o}(\hat{\theta }-\theta _o) + O_p(T^{-1}). \end{aligned}$$
(5)

The \(O_p(T^{-1})\) in Eq. (5) comes from \((\hat{\theta }-\theta _o)^2\) and higher-order expansion terms. We use the \(O(\cdot )\) and \(o(\cdot )\) notations in a matrix sense, such that they apply to each element of \((\cdot )\). Based on Eq. (5), we can show that

$$\begin{aligned} \mathrm {Cov}\{ \hat{\Psi }(\hat{\theta }) \,|\, \Psi \}&= \mathrm {Cov}\{ \hat{\Psi }(\theta _o)\,|\, \Psi \} + \dfrac{\partial \hat{\Psi }(\theta _o)}{\partial \theta _o^{\top }}\mathrm {Cov}(\hat{\theta }\,|\, \Psi ) \dfrac{\partial \hat{\Psi }^{\top }(\theta _o)}{\partial \theta _o}+ o(T^{-1}),\\ \mathrm {Cov}\{ \hat{\Psi }(\hat{\theta }), \hat{\theta } \,|\, \Psi \}&= \dfrac{\partial \hat{\Psi }(\theta _o)}{\partial \theta _o^{\top }}\mathrm {Cov}(\hat{\theta }\,|\, \Psi ) + o(T^{-1}), \end{aligned}$$

where \(\mathrm {Cov}\{ \hat{\Psi }(\hat{\theta }), \hat{\theta } \,|\, \Psi \}\) denotes the conditional covariance between vectors \(\hat{\Psi }(\hat{\theta })\) and \(\hat{\theta }\), and the approximation orders come from \(\mathrm {Cov}\{\hat{\theta },\hat{\Psi }(\theta _o)\,|\,\Psi \}=o(T^{-1})\), \(\mathrm {Cov}\{\hat{\Psi }(\theta _o),O_p(T^{-1})\,|\,\Psi \}=o(T^{-1})\) and \(\mathrm {Cov}\{\hat{\theta },O_p(T^{-1})\,|\,\Psi \}=o(T^{-1})\), which are proved in Appendix C. These results can be summarized in the following matrix form.

Theorem 1

The conditional covariance of RE predictors and parameter estimators is given by

$$\begin{aligned} \begin{aligned} \mathrm {Cov}\left\{ \left. \left[ \begin{array}{c} \hat{\Psi }(\hat{\theta }) \\ \hat{\theta } \\ \end{array} \right] \right| \Psi \right\}&= \left[ \begin{array}{cc} \mathrm {Cov}\lbrace \hat{\Psi }(\theta _o)\,|\, \Psi \rbrace &{} 0 \\ 0 &{} 0 \end{array} \right] \\&\quad + \left[ \begin{array}{c} \dfrac{\partial \hat{\Psi }(\theta _o)}{\partial \theta _o^{\top }} \\ I \end{array} \right] \mathrm {Cov}(\hat{\theta }\,|\,\Psi ) \left[ \begin{array}{cc} \dfrac{\partial \hat{\Psi }^{\top }(\theta _o)}{\partial \theta _o}&I \end{array} \right] +o(T^{-1}). \end{aligned} \end{aligned}$$
(6)

With this formula and the subsequent approximations, the generalized Delta method can be used to evaluate the conditional covariance of the estimate of a differentiable function of \(\theta \) and \(\Psi \).

Define \(\widetilde{\Psi } = \Psi - \{\partial \hat{\Psi }(\hat{\theta })/\partial \hat{\theta }^{\top }\}\theta \), where \(\partial \hat{\Psi }(\hat{\theta })/\partial \hat{\theta }^{\top }\) is used as a constant matrix. Also, let \(\tilde{l}_r(\widetilde{\Psi },\theta )\) be \(l_r(\Psi ,\theta )\) in (2) with variables \((\Psi ,\theta )\) transformed to \((\widetilde{\Psi },\theta )\). For the conditional bias and covariance of MMLEs given \(\Psi \), in Appendix A we proved the following theorem.

Theorem 2

If the marginal distribution (1) can be well evaluated with the Laplace approximation, then the bias of MMLEs of \(\theta \) conditional on the REs \(\Psi \) is given by

$$\begin{aligned} \mathrm {E}( \hat{\theta } - \theta _o\,|\,\Psi )&= \mathcal {I}^{-1}\dfrac{\partial \tilde{l}_r(\widetilde{\Psi },\theta _o)}{\partial \theta _o}+O(T^{-1}), \end{aligned}$$
(7)

and the conditional covariance is given by

$$\begin{aligned} \begin{aligned} \mathrm {Cov}(\hat{\theta }\,|\,\Psi )&= \mathrm {Cov}(\hat{\theta }) - \mathrm {Cov}\left\{ \mathrm {E}(\hat{\theta } \,|\,\Psi ) \right\} + o(T^{-1})\\&=-\ddot{l}^{-1}(\theta _o) - \ddot{l}^{-1}(\theta _o) \, \widetilde{\mathcal {I}}_r\, \ddot{l}^{-1}(\theta _o) + o(T^{-1}), \end{aligned} \end{aligned}$$
(8)

where

$$\begin{aligned} \begin{aligned}&\dfrac{\partial \tilde{l}_r(\widetilde{\Psi },\theta _o)}{\partial \theta _o} = \dfrac{\partial l_r}{\partial \theta _o} + \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }}\dfrac{\partial l_r}{\partial \Psi },\\&\widetilde{\mathcal {I}}_r = -\dfrac{\partial ^2 l_r(\Psi ,\theta _o)}{\partial \theta _o\partial \theta _o^{\top }} - \dfrac{\partial ^2 l_r(\Psi ,\theta _o)}{\partial \theta _o\partial \Psi ^{\top }}\dfrac{\partial \hat{\Psi }(\hat{\theta })}{\partial \hat{\theta }^{\top }} - \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }}\dfrac{\partial ^2 l_r(\Psi ,\theta _o)}{\partial \Psi \partial \theta _o^{\top }}\\&\qquad - \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }}\dfrac{\partial ^2 l_r(\Psi ,\theta _o)}{\partial \Psi \partial \Psi ^{\top }}\dfrac{\partial \hat{\Psi }(\hat{\theta })}{\partial \hat{\theta }^{\top }}. \end{aligned} \end{aligned}$$

When the estimator \(\ddot{l}^{-1}(\hat{\theta }) \, \widetilde{\mathcal {I}}_r(\hat{\theta },\hat{\Psi })\, \ddot{l}^{-1}(\hat{\theta })\) for the second term in (8) is not positive definite, we recommend to use its nearest positive definite matrix [14], as discussed in the paragraph following Eq. (A.6) in Appendix. Note that \(\mathrm {Cov}(\hat{\theta })\) in Eq. (8) involves expectations with respect to the marginal distribution of the random response variables, namely the data D, while the \(\mathrm {Cov}\) part of \(\mathrm {Cov}\lbrace \mathrm {E}(\hat{\theta } \,|\,\Psi ) \rbrace \) involves expectations with respect to the distribution of \(\Psi \). The conditional bias of (7) is in the order of \(O(T^{-1/2})\).

For evaluating \(\mathrm {Cov}\{ \hat{\Psi }(\hat{\theta })\,|\,\Psi \}\), following Theorem 1 and Eq. (B.5) in Appendix B we have this Corollary.

Corollary 1

When the distribution of the REs given data is approximately MVN, then

$$\begin{aligned} \begin{aligned} \mathrm {Cov}\{ \hat{\Psi }(\hat{\theta })\,|\,\Psi \}&\approx -\ddot{l}_{j}(\Psi ,\theta _o)^{-1} + \ddot{l}_{j}(\Psi ,\theta _o)^{-1}\ddot{l}_r(\Psi ) \ddot{l}_{j}(\Psi ,\theta _o)^{-1}\\&\quad + \dfrac{\partial \hat{\Psi }(\theta _o)}{\partial \theta _o^{\top }}\mathrm {Cov}(\hat{\theta }\,|\,\Psi ) \dfrac{\partial \hat{\Psi }^{\top }(\theta _o)}{\partial \theta _o}. \end{aligned} \end{aligned}$$

If REs are also MVN, then

$$\begin{aligned} \begin{aligned} \mathrm {Cov}\{ \hat{\Psi }(\hat{\theta })\,|\,\Psi \}&\approx -\ddot{l}_{j}(\Psi ,\theta _o)^{-1} - \ddot{l}_{j}(\Psi ,\theta _o)^{-1}\Sigma ^{-1} \ddot{l}_{j}(\Psi ,\theta _o)^{-1}\\&\quad + \dfrac{\partial \hat{\Psi }(\theta _o)}{\partial \theta _o^{\top }}\mathrm {Cov}(\hat{\theta }\,|\,\Psi ) \dfrac{\partial \hat{\Psi }^{\top }(\theta _o)}{\partial \theta _o}. \end{aligned} \end{aligned}$$
(9)

Here if the REs are MVN, then \(\Sigma ^{-1}=-\ddot{l}_r(\Psi )\). Note that \(\ddot{l}_{j}(\Psi ,\theta _o)^{-1}\Sigma ^{-1} \ddot{l}_{j}(\Psi ,\theta _o)^{-1}\) will be a positive definite matrix so that the diagonal variances of \(\mathrm {Cov}\lbrace \hat{\Psi }(\hat{\theta })\,|\,\Psi \rbrace \) will be smaller than the diagonal variances of \(\mathrm {Cov}\lbrace \hat{\Psi }(\hat{\theta })-\Psi \rbrace \) when \(\Psi \) is random (i.e., compare Eqs. 9 and 4). This also makes sense because of the restriction on randomization when \(\Psi \) is fixed. However, the difference will be small when the data are highly informative about the REs (i.e., \(\ddot{l}_{j}(\hat{\Psi },\hat{\theta })^{-1}\rightarrow 0\) in some sense, e.g., Fahrmeir and Kaufmann [7]), and estimates of these effects are statistically like fixed-effects parameters.

When \(\Psi \) are actually fixed-effects that are considered to be random for smoothing purposes, then \(\hat{\Psi }\) is a biased estimator of \(\Psi \). For this bias, we have the following evaluation.

Theorem 3

If the conditional distribution of \(\Psi \) given data D is approximately MVN, then

$$\begin{aligned} \begin{aligned} \mathrm {E}\{ \hat{\Psi }(\hat{\theta })\,|\,\Psi \} - \Psi&= -\ddot{l}_{j}(\Psi ,\theta _o)^{-1} \ddot{l}_r(\Psi ) [\Psi -\mathrm {E}\{\Psi \}] + O(T^{-1/2}). \end{aligned} \end{aligned}$$

If \(\Psi \) is also MVN with covariance matrix \(\Sigma \), then

$$\begin{aligned} \begin{aligned} \mathrm {E}\{ \hat{\Psi }(\hat{\theta })\,|\,\Psi \} - \Psi&= \ddot{l}_{j}(\Psi ,\theta _o)^{-1} \Sigma ^{-1} [\Psi -\mathrm {E}\{\Psi \}] + O(T^{-1/2}). \end{aligned} \end{aligned}$$
(10)

Here the leading term for \(\Psi _i\), the ith element of \(\Psi \) is of order \(O(1/(n_i+1))\) with \(n_i\) being the sample size associated with \(\Psi _i\). This theorem can be easily proved by Eqs. (5), (A.8) and (B.2) in Appendices.

Theorem 2 says that conditional on \(\Psi \), \(\hat{\theta }\) is also a biased estimator of \(\theta _o\) with a bias of order \(O(T^{-1/2})\). Let \(\Omega =(\Psi ^{\top },\theta _o^{\top })^{\top }\) and \(\hat{\Omega }=(\hat{\Psi }(\hat{\theta })^{\top },\hat{\theta }^{\top })^{\top }\). Based on the results in this section, in Appendix D we proved the following corollary.

Corollary 2

If \(\Psi \) is MVN with covariance matrix \(\Sigma \) and the conditional distribution of \(\Psi \) given data D is approximately MVN, then

$$\begin{aligned} \begin{aligned} \mathrm {MSE}( \hat{\Omega }\,|\,\Psi )&= \mathrm {Cov}( \hat{\Omega }\,|\,\Psi ) + \mathrm {E}( \hat{\Omega }-\Omega \,|\,\Psi ) \mathrm {E}( \hat{\Omega }-\Omega \,|\,\Psi )^{\top }\\&\approx \mathrm {Cov}( \hat{\Omega }\,|\,\Psi ) + \mathrm {E}\left\{ \mathrm {E}(\hat{\Omega }-\Omega \,|\,\Psi )\, \mathrm {E}(\hat{\Omega }-\Omega \,|\,\Psi )^{\top } \right\} \\&= \left[ \begin{array}{cc} -\ddot{l}_j^{-1} + \dfrac{\partial \hat{\Psi }(\hat{\theta })}{\partial \hat{\theta }^{\top }} \mathrm {Cov}( \hat{\theta } ) \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }} &{} \dfrac{\partial \hat{\Psi }(\hat{\theta })}{\partial \hat{\theta }^{\top }} \mathrm {Cov}( \hat{\theta } ) \\ \mathrm {Cov}( \hat{\theta } ) \dfrac{\partial \hat{\Psi }^{\top }(\hat{\theta })}{\partial \hat{\theta }} &{} \mathrm {Cov}( \hat{\theta } ) \end{array} \right] +o(1/T). \end{aligned} \end{aligned}$$
(11)

Equation (11) is equal to Eq. (4), namely the unconditional MSE of \(\hat{\Omega }\), \(\mathrm {MSE}( \hat{\Omega } )\). Here, because \(\Psi \) is unknown and can only be estimated with a bias of order O(1) as indicated by Eq. (10), we use \(\mathrm {E}\lbrace \mathrm {E}(\hat{\Omega }-\Omega \,|\,\Psi )\, \mathrm {E}(\hat{\Omega }-\Omega \,|\,\Psi )^{\top } \rbrace \) to give an overall estimation of the conditional bias squared \(\mathrm {E}( \hat{\Omega }-\Omega \,|\,\Psi ) \mathrm {E}( \hat{\Omega }-\Omega \,|\,\Psi )^{\top }\).

2.3 Semiparametric Regression Example

As a partial validation and application of the theoretical results in Sect. 2.2, we consider the Gaussian process semiparametric regression studied in He and Severini [13],

$$\begin{aligned} Y_i&= x_i^{\top }\beta + \gamma (z_i) + \epsilon _i, i=1,\ldots ,n, \end{aligned}$$
(12)

where \(x_1,\ldots ,x_n\) are \(p\times 1\) covariate vectors, \(\epsilon _1,\ldots ,\epsilon _n\) are unobserved independent normal random variables each with mean 0 and standard deviation \(\sigma >0\), \(\beta \) is a \(p\times 1\) vector of unknown regression parameters, \(z_1,\ldots ,z_n\) are observed constants, taking values in a set \(\mathcal {Z}\), and \(\gamma \) is an unknown real-valued function on \(\mathcal {Z}\). He and Severini [13] further denoted \(Y=(Y_1,\ldots ,Y_n)^{\top }\), \(X=(x_1^{\top },\ldots ,x_n^{\top })^{\top }\), \(\epsilon =(\epsilon _1,\ldots ,\epsilon _n)^{\top }\), and \(g_{\gamma } = (\gamma (z_1),\ldots ,\gamma (z_n))^{\top }\), and wrote the model as \(Y=X\beta + g_{\gamma } + \epsilon \). The covariance matrix of \(\epsilon \) is denoted as \(\varOmega _{\phi }\) and assumed to have a parametric form with parameter \(\phi \). The regression coefficients \(\beta \) are of primary interests, and \(g_{\gamma }\) are nuisance effects. Even though actually being fixed, He and Severini [13] technically treated \(g_{\gamma }\) as a mean-zero Gaussian process with \(n\times n\) covariance matrix \(\varSigma _{\lambda }\) parameterized by \(\lambda \) so that \(g_{\gamma }\) can be integrated out to obtain the marginal likelihood.

2.4 Random Walk Simulation Example

We also illustrate Eqs. (8), (9) and (11) using a simple random-walk example. The random walk is \(\Psi _t|\Psi _{t-1} {\mathop {\sim }\limits ^{indep}} N(\Psi _{t-1},\sigma ^{2}_{\Psi })\) for \(t= 2,\ldots ,T\), and \(\Psi _{1} = \beta \) is an unknown parameter to estimate. Here \(N(\mu ,\sigma ^2)\) denotes the normal distribution with mean \(\mu \) and variance \(\sigma ^2\). At each time-step, there are n independent observations of the process, \(Y_{t,i}|\Psi _t {\mathop {\sim }\limits ^{i.i.d}} N(\Psi _t,\sigma ^{2}_{\epsilon }), i=1,\ldots ,n\) and \(t=1,\ldots ,T\). The parameters are \(\theta = (\beta , \sigma _{\Psi },\sigma _{\epsilon })^{\top }\) and the REs are \(\Psi = (\Psi _{2},\ldots ,\Psi _{T})^{\top }\) which is a \((T-1) \times 1\) vector. This process can actually be regarded as a specific realization of the Gaussian process semiparametric regression described in the previous section with regression parameter \(\beta \), \(\Psi = \gamma \), and \(\sigma _{\epsilon }\) and \(\sigma _{\Psi }\) corresponding to the dispersion parameters \(\phi \) and \(\lambda \), respectively. In Sect. 3.1, we showed that \(\mathrm {Cov}\lbrace \hat{\beta } \,|\, \Psi \rbrace \) can be correctly evaluated by Eq. (8), and the RE predictor by maximizing the joint log-likelihood is the Best Linear Predictor (BLP; e.g., Robinson [30]) used in He and Severini [13]. In this example, we demonstrate the statistical properties of \(\hat{\Psi }\) and parameter estimators using a simulation study.

We generated y responses from the random-walk model with \(\beta =0\), \(\sigma _{\Psi } = 1\), \(\sigma _{\epsilon } = 0.5\), and two choices each for \(n=2,5\) and \(T=50,200\).

2.5 Stock Assessment Example

The Schaefer form of the state-space surplus production model (SPM, e.g., Meyer and Millar [24]) gives latent total stock biomass in year t (i.e., \(B_t \ge 0\)) as a function of the biomass in the previous year plus production (births\(+\)growth−natural deaths) minus the fishery catch (\(C_t\), tonnes), with production modeled as a quadratic function of biomass, that is,

$$\begin{aligned} B_t = B_{t-1} + rB_{t-1}(1-B_{t-1}/K) - C_{t-1}, \end{aligned}$$
(13)

where the parameter r controls the intrinsic rate of biomass increase at low population size and K is the carrying capacity. We assume that there is measurement error (ME) in catches and include a catch model, \(C_t = H_t B_t\), where \(H_t \ge 0\) is based on a random walk. The stochastic population dynamics model is

$$\begin{aligned} \begin{aligned} \ln (B_t)&= \ln \lbrace B_{t-1} + rB_{t-1}(1-B_{t-1}/K) - H_{t-1}B_{t-1} \rbrace + \delta _{Bt},\\ \ln (H_t)&= \ln (H_{t-1}) + \delta _{Ht}, \end{aligned} \end{aligned}$$
(14)

where \(t=1,\ldots ,T\), \(\delta _{H1},\ldots ,\delta _{HT} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\delta H})\), and \(\delta _{B1},\ldots ,\delta _{BT} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\delta B})\).

We apply this model to data for a flatfish species of the east-coast of Canada. The data available include annual estimates of total fishery catches of American plaice in Northwest Atlantic Fisheries Organization Subdiv. 3Ps during 1960–2019 (see Table 2 and Fig. 1 in Morgan et al. [25]). Another common data source is a time-series of average catch from research surveys, which are commonly referred to as stock size indices. The American plaice assessment uses indices derived from stratified random surveys since 1980. Our state-space model observation equations for the time-series of survey indices (I) and the catch observations (\(C_{ot}\)) are:

$$\begin{aligned} \begin{aligned} \ln (C_{ot})&= \ln (H_t) + \ln (B_t) + \epsilon _{Ct},\\ \ln (I_t)&= \left\{ \begin{array}{cl} \ln (q_E)+\ln (B_t) + \epsilon _{It} &{} \mathrm {if \, year}\, \le 1995 \\ \ln (q_C)+\ln (B_t) + \epsilon _{It} &{} \mathrm {if \, year}\, > 1995. \end{array} \right. \end{aligned} \end{aligned}$$
(15)

SPM times \(t=0,\ldots ,T\) correspond to years 1960–2019. Both \(q_E\) and \(q_C\) are survey catchability parameters to estimate. These are different because there was a major change in survey gears and stratification in Subdiv. 3Ps since the 1996 survey. The MEs \(\epsilon _{It} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\epsilon I})\) and \(\epsilon _{Ct} {\mathop {\sim }\limits ^{iid}} N(0,\sigma ^{2}_{\epsilon C})\). We assume that \(\sigma _{\epsilon C} = 0.1\) and do not estimate this parameter so that the model fits the catches closely, consistent with Morgan et al. [25]. The total set of parameters to estimate for the state-space model are \(H_0, r, K, \sigma _{\delta H}, \sigma _{\delta B}, q_E, q_C\), and \(\sigma _{\epsilon I}\), along with RE predictions for \(B_t\) and \(H_t, t=1961,\ldots ,2019\) and \(B_0\) for 1960. We also assume that the initial biomass is random,

$$\ln (B_0/K) \sim N\left\{ \ln (0.75),0.125^2\right\} ,$$

which is broadly similar to the prior for \(B_0\) in Morgan et al. [25]. More details about the development of this model are provided in the Supplementary Information.

2.5.1 SPM Simulations

The SPM is much slower to fit than the random walk model in Sect. 2.4, so we only generated 250 datasets conditional on a random value of \(\Psi \) drawn from \(f(\Psi \,|\,\theta )\); we repeated this procedure with 250 randomly generated \(\Psi \)’s from \(f(\Psi \,|\,\theta )\) to see the average effect over different \(\Psi \)’s for a total of 62 500 simulations. Also, as mentioned in Introduction, this procedure will often generate datasets that are unrealistically different from the observed data. In fact, some values of \(\Psi \) even result in stock extinction and model estimation errors. To avoid these problems, we simply fixed the \(\delta _{Ht}\) REs at their predicted values in the simulations, and only generated random \(\Psi \)’s for the model process errors (i.e., \(\delta _{Bt}\) in Eq. 14). The random walk standard deviation for \(\delta _{Ht}\) (i.e., \(\sigma _{\delta H}\) in Table  2) is large and results in many unrealistic simulated harvest rate series.

In many of the simulations, the estimates of the process error variance (\(\sigma ^{2}_{\delta B}\)) hit a very small lower bound indicating that process errors were not needed to fit the simulation data well. This resulted in long simulation run times and problems computing \(V_f\) and \(V_r\). To avoid these problems, we fixed \(\sigma ^{2}_{\delta B}\) at the value in Table 2.

3 Results

3.1 Semiparametric Regression Example

The MMLE estimate of \(\beta \) is

$$\begin{aligned} \hat{\beta }&= \left( X^{\top }V(\theta )^{-1}X \right) ^{-1} X^{\top }V(\theta )^{-1}Y, \end{aligned}$$
(16)

which agree with the estimates of He and Severini [13] using generalized least-squares. Here \(V(\theta )=\varOmega _{\phi }+\varSigma _{\lambda }\) and \(\theta =(\phi ,\lambda )^{\top }\). The predictor of \(g_{\gamma }\) by maximizing the joint likelihood is

$$\begin{aligned} \begin{aligned} \hat{g}_{\gamma }&= \varSigma _{\hat{\lambda }} V(\hat{\theta })^{-1}\left( Y-X\hat{\beta } \right) , \end{aligned} \end{aligned}$$

which is the same as the BLP of \(g_{\gamma }\) in He and Severini [13]. Applying Theorem 2, we obtain

$$\begin{aligned} \begin{aligned} \mathrm {Cov}( \hat{\beta } \,|\, \gamma )&= \mathcal {I}^{-1}X^{\top }V(\theta _o)^{-1}\varOmega _{\phi _o}^{-1}V(\theta _o)^{-1}X\mathcal {I}^{-1}, \end{aligned} \end{aligned}$$
(17)

which is the same as the result in Theorem 4.2 of He and Severini [13]. Furthermore, Theorem 2 gives a bias \(\mathrm {E}(\hat{\beta }\,|\,\gamma )-\beta _o=(X^{\top }V(\theta _o)^{-1}X)^{-1}X^{\top }V(\theta _o)^{-1}g_{\gamma }\), which is also consistent with the results of He and Severini [13]. The detailed derivations of all these results are provided in Appendix E.

3.2 Random Walk Simulation Example

The data from an arbitrary simulation when \(n=2\) and \(T=50\) are illustrated in Fig. 1, along with predictions of \(\Psi _t\) and 95% confidence intervals (CIs) based on the conditional-\(\Psi \) MSE in Eq. (11), and the conditional-\(\Psi \) variance using Eq. (9) which we denote as Vc. The conditional-\(\Psi \) MSE (11) is equal to the random-\(\Psi \) MSE (4) that TMB provides. Therefore, we denote the conditional-\(\Psi \) MSE as Vr. The Vr-based (i.e., TMB) CIs cover the real values of \(\Psi \) in \(92\%\) of the years, which is close to the nominal \(95\%\) coverage of the CIs. The Vc-based CIs cover in \(86\%\) of the years. However, this is only based on one simulated set of y’s. We repeated the simulation 1000 times but conditional on the true \(\Psi _t\) values in Fig. 1. That is, we generated 1000 datasets from \(f(D|\Psi ,\theta )\).

Fig. 1
figure 1

Simulated y data (points) and predictions of \(\Psi _t\) (heavy solid line) for the random-walk example. The grey line indicates the true values of \(\Psi _t\) used to generate the responses. The white shaded region indicate \(95\%\) confidence intervals based on Vc (Eq. 9), which almost cover the blue shaded region that indicates \(95\%\) confidence intervals based on Vr (Eq. 11)

We computed the average of the 1000 \(\hat{\Psi }_t\)’s at each time which are shown in the top panel of Fig. 2. The \(\hat{\Psi }_t\)’s are nearly unbiased, but a little smoother than the true \(\Psi _t\)’s such that average \(\hat{\Psi }_t\)’s don’t exactly match the peaks and valleys of the \(\Psi _t\)’s. This is typical of smoothing estimators. The biases are shown in the middle panel of Fig. 2. The simulation average of the estimated bias using the approximation in Eq. (10) is reasonably accurate. Note that the grey points in this panel are based on the true value of \(\Psi \) in the leading term in Eq. (10) and are usually very close to the real simulated bias. The bias estimates using \(\hat{\Psi }_t\)’s (heavy black lines) have larger differences from the simulated bias. This demonstrates that plug-in estimates of the bias may also be biased since the bias is about order O(1) in this case. The bottom panel of Fig. 2 demonstrates that the Vc estimates using Eq. (9) are a little low for this example, while MSE estimates using Eq. (11) give average levels of the simulation-based MSE’s which vary substantially across time because Eq. (11) was derived by taking expectation of bias squared over \(\Psi \). The MSE estimates are also a little low on average.

Fig. 2
figure 2

Top panel: True \(\Psi _t\)’s (grey) and simulation average predictions (black). Middle panel: Grey lines indicate the bias in \(\hat{\Psi }_t\), the black lines indicate the bias estimates from Eq. (10) using \(\hat{\Psi }\) for \(\Psi \), and the dark grey points indicate the bias based on \(\ddot{l}_{j}(\hat{\Psi },\hat{\theta })^{-1} \hat{\Sigma }^{-1} \Psi \). Bottom panel: The red and blue lines indicate the simulation standard deviation and root mean squared error of \(\hat{\Psi }_t\), respectively. The grey lines are simulation average standard errors based on Vr (Eq. 11) and the black lines are based on Vc (Eq. 9). Horizontal dashed lines indicate the time-series average

We conducted many more simulations and the patterns were similar to Fig. 2. However, the results depended on the specific \(\Psi _t\)’s in each simulation. We averaged results (see Fig. 3) over 500 randomly generated sets of random-walk \(\Psi _t\)’s, each with 500 simulations of the data conditional on \(\Psi _t\)’s, and for different choices of n and T. The results demonstrate that when T is large then the variance estimator (9) is reliable on average, and the MSE estimator (11), which is also the TMB estimator (4), represents an average level of the MSE for RE predictions. However, the 90% quantiles in Fig. 3 demonstrate that for specific values of \(\Psi \) the Vc-based estimates of the SEs of \(\hat{\Psi }\) and the Vr-based estimates of RMSE’s can be considerably different than simulation-based results. The quantiles and means are based on the average variance estimates (i.e., the Vc-based and Vr-based SEs) for each of the 500 \(\Psi \) cases, where for each \(\Psi \), as in the lower panel of Fig. 2, the averages of the 500 Vc- and Vr-based SEs based on the 500 simulated datasets are computed, or the sample standard deviations and RMSEs of \(\hat{\Psi }\) are computed.

Fig. 3
figure 3

The thick red and blue lines indicate the average (i.e., over 500 \(\Psi \)’s) simulation standard deviations (i.e., based on 500 data sets for each \(\Psi \); columns 1 and 2) and root mean squared errors (columns 3 and 4) of \(\hat{\Psi }_t\), respectively. The thick grey lines are simulation average standard errors based on Vr (Eq. 11) and the thick black lines are based on Vc (Eq. 9). Color corresponding thin lines indicate medians among the 500 \(\Psi \) cases, and shaded regions indicate 5% and 95% quantiles. T is the number of time points and n is the number of samples at each time

Fig. 4
figure 4

The thick grey lines are average (i.e., over 500 \(\Psi \)’s) simulation coverage (i.e., based on 500 datasets for each \(\Psi \)) of \(95\%\) Vr-based (i.e., RMSE) CIs, and the thick black lines are for Vc-based (i.e., SE) CI coverage probabilities. The horizontal dashed line indicates \(95\%\). Color corresponding thin lines indicate medians among the 500 \(\Psi \) cases, and shaded regions indicate 5% and 95% quantiles. T is the number of time points and n is the number of samples at each time

We also investigated the coverage properties of \(95\%\) confidence intervals (CIs) for \(\Psi \) based on a normal distribution assumption for \(\hat{\Psi }_t\), with Vr or Vc estimates of the variance. The simulation average coverage of Vr CIs (see Fig. 4) was close to the \(95\%\) nominal level, whereas the probability that the Vc CI’s contained \(\Psi \) was somewhat lower than \(95\%\) even when \(T=200\) and \(n=5\). The \(\Psi \)-conditional bias in \(\hat{\Psi }\) contributes to the reduced coverage of Vc-based CI’s. However, for specific values of \(\Psi \) the Vr-based CIs could differ somewhat from the nominal 95% level, with the simulated coverages less than 95% for approximately 50% of the random value of \(\Psi \)’s. Hence, our results indicate the Vr-based CIs are somewhat inaccurate and not simply conservative, and Vc-based CIs are less accurate than Vr-based CIs which is expected because of the conditional bias in \(\hat{\Psi }\).

The GAM (generalized additive model) literature suggests their confidence intervals are accurate when averaged over the smoothing covariates (e.g., [21, 36]), which is time in the random-walk example. For each fixed \(\Psi \), we calculated the average CI coverage over time points and the 500 simulated datasets. The Vr-based CI coverage was accurate, and in the worst case (\(n=2\) and \(T=50\)) ranged from 0.93–0.95 for the 500 \(\Psi \) cases. The coverages were closer to 0.95 for the other choices of n and T. The Vc-based CI coverage was less accurate, and in the worst case (\(n=2\) and \(T=50\)) ranged from 0.88–0.93.

The simulation standard deviations (SDs) and RMSEs for the random-walk parameter estimates, along with the averages of the Vr-based and Vc-based SEs, are shown in Table 1. Keeping in mind the simulation approximation errors, the Vc-based SEs are reasonably accurate for the simulation SD of the estimates, and the Vr-based SEs are reasonably accurate for the RMSE. These results were more accurate for \(\beta \) variances.

3.3 Stock Assessment Example

Our SPM parameter estimates and stock assessment results (Table 2) are broadly similar to Morgan et al. [25]. Here Vc- and Vr-based standard errors are shown as coefficients of variation CVc and CVr, respectively. The CVc’s are usually substantially smaller than CVr’s, but because of the conditional bias in parameter estimators and RE predictors, the CVr’s provide more reliable confidence intervals. This is what we found from the simulations in Sect. 3.2 and this will also be the case for SPM simulation results we provide later in this section. More real-data analysis are provided in the Supplementary Information.

Table 1 Simulation standard deviations (SD) and root mean squared errors (RMSE) for the random-walk model parameter estimates, along with the averages of the Vr- and Vc-based SEs
Table 2 Estimates (Est) of parameters (Par) and some important derived assessment results (SAR) from the surplus production model for American plaice in NAFO Subdivision 3Ps

3.3.1 SPM Simulations

Simulation assessment results (Fig. 5; top panels) demonstrate that the Vc-based SEs (SEs) were accurate for \(\Psi \)-averaged simulation standard deviations in most years. The Vr-based SEs were also usually accurate for \(\Psi \)-averaged simulation RMSE, but they were slight over-estimates in the first half of the assessment times series for biomass and harvest rates. However, especially for RMSE, for some values of \(\Psi \) the Vr SEs were substantially different than simulation RMSE values, as indicated by the much wider blue shaded regions compared to the grey regions. The thin blue lines indicate the median RMSE’s across the 250 random sets of \(\Psi \), which indicate that the Vr SEs were larger than the RMSE’s in slightly more than 50% of the \(\Psi \)-cases. The simulation coverage probabilities (CPs) of 95% confidence intervals (CIs) based on Vr SE (middle panels) were much more accurate than those based on Vc SE. When averaged over the 250 random \(\Psi \)’s, the Vr CIs had CPs close to 0.95. However, the CIs were too wide for more than half of the sets of \(\Psi \)’s, and for a small number of \(\Psi \)’s the CIs were inaccurate with CPs \(<<0.95\). Hence, we conclude that Vr CIs are usually conservative, but not always. For some \(\Psi \sim f(\Psi |\theta )\) the Vr CIs may be unreliable. It would be practically useful to have some indication of when this problem occurs, which is a useful area for future research. Vf-based CIs were unreliable because they do not account for the assessment model estimation bias which can be large relative to the SEs (bottom panels) which results in unreliable CIs.

Fig. 5
figure 5

Simulation average (i.e., based on 250 datasets for each \(\Psi \)) results for the surplus production model. All panels: Thick lines are averages across the 250 \(\Psi \)’s. Color corresponding thin lines indicate medians (i.e., over 250 \(\Psi \)’s) of the simulation averages and shaded regions indicate 5% and 95% quantiles. Top panels: Standard deviations (red) and root mean squared error (RMSE, blue) of simulated biomass and harvest rate values. Vr- and Vc-based standard errors are in grey and black, respectively. Middle panels: Coverage of \(95\%\) Vr-based (grey) and Vc-based (black) confidence intervals. The horizontal dashed lines indicate \(95\%\). Bottom panels: Standardized absolute bias (i.e., divided by standard error)

This example illustrates problems because of data limitations; that is, there is only one data point per year before 1980, and two data points per year since 1980, while there are two REs to predict for each year. In this case, \(\ddot{l}_{j}(\Psi ,\theta _o)\) in Eq. (10) is close to \(-\Sigma ^{-1}\) since its \(\ddot{l}_{c}(\Psi ,\theta )\) component in (2) is small relative to the case when there are many data each year, and hence the biases for RE predictions are close to minus the true RE values, \(-\Psi \) (or minus the deviation of true RE values from the marginal means of REs). Here all the double dots above denote second order derivatives respecting \(\Psi \). Because there are less data for each year before 1980 than after 1980, the biases in the predictions of REs before 1980 are in general larger than after 1980, but the corresponding SEs tend to be smaller as indicated in the top panels of Fig. 5 because the model predictions of the REs tend to cluster around their marginal means due to lack of data. Therefore, there is a substantial decrease in the Bias/RE ratio after 1980 in the lower panels of Fig. 5. Also due to the data limitations, the MSEs of RE predictors are close to the marginal variances \(\Sigma \) of REs (i.e., \(\Psi \)) as suggested by Eq. (11). Hence, combined with the previous result that bias approaches \(-\Psi \), the distribution of bias/RMSE is close to standard normal especially before 1980, as evidenced in the bottom panels of Fig. 5.

We calculated the average CI simulation coverage probability (CP) over years and the 250 simulated datasets. Unlike the random-walk simulations, the SPM annual-average CPs depended on \(\Psi \). The range for Vr CPs was 0.334–0.997 for biomass (5% and 95% quantiles: 0.829–0.993) and 0.385–0.995 (quantiles: 0.841–0.992) for harvest rates, although the \(\Psi \)-averages were accurate (0.945 for biomass and 0.947 for harvest rates). Hence, for some values of \(\Psi \) the Vr CIs can have CPs very different than the nominal 0.95 value, even when averaged over years. However, the \(\Psi \)-medians were 0.971 for biomass and 0.969 for harvest rates so for more than 50% of the \(\Psi \)-cases the CIs were conservative when averaged over years. The Vc-based CI CPs were less accurate than Vr CIs. The range was 0.138–0.910 for biomass and 0.169–0.918 for harvest rates, and the \(\Psi \)-averages were 0.738 for biomass and 0.767 for harvest rates.

Simulated parameter estimates (Table 3) demonstrate that Vc SEs are accurate for the simulation SEs, and Vr SEs are accurate for simulation RMSEs. An exception is \(\sigma _{\delta H}\) but the \(\delta _{Ht}\)’s were fixed at the model predictions when generating \(250\times 250\) simulated data and we expect in this situation that the Vr SEs of \(\sigma _{\delta H}\) can be inaccurate.

Table 3 Simulation results for the surplus production model parameters (Par)

4 Discussion

We developed frequentist variance approximations for predictions of REs (i.e., \(\Psi \)) in nonlinear mixed-effects models, and functions of these predictions and estimates of model parameters, \(\theta \). We focused on maximum likelihood estimators of \(\theta \) (i.e., \(\hat{\theta }\)) and the conditional mean predictor of \(\Psi \) given data (i.e., \(\hat{\Psi }\)). Our main contribution is a variance approximation for the inferential setting involving repeated sampling of the data when \(\Psi \), although unknown, are assumed to be fixed. This setting is more appropriate when we are particularly interested in statistical inferences based on the specific \(\Psi \) values that generated the data. This setting is also relevant when \(\Psi \) is not a RE but a high dimensional parameter that is modeled as a RE for nonparametric smoothing. There are well-known connections between smoothing methods and RE models (e.g., [4, 34, 37]). This is the case for some state-space fish stock assessment models in which subsets of \(\Psi \) will usually be time-dependent and could be considered to be complex but smooth functions of time.

We demonstrated in simulations that our Vc variance approximation (Eq. 9) is reasonably accurate. The more commonly used Vr variance approximation (Eq. 4, or equivalently Eq. 11), which is the MSE of \(\hat{\Psi }\) when \(\Psi \) is random, represents an average level of the MSE when \(\Psi \) is treated as fixed. This “average” interpretation of MSE also appears in the generalized additive model (GAM) literature such as Marra and Wood [21] and Wood [36]. However, for particular values of \(\Psi \) the MSE of \(\hat{\Psi }\) may differ substantially from Vr depending on the magnitude of the difference between \(\mathrm {E}(\hat{\Psi }|\Psi )\) and \(\Psi \), i.e., the bias. We also developed an accurate approximation for the bias (i.e., Eq. 10). We recommend Vr (Eq. 11) for statistical inferences about \(\Psi \) because it reflects an average level of MSE, although for specific values of \(\Psi \) the MSE may be quite different than the Vr estimate. Although Vc (Eq. 9) is a good approximation of \(\mathrm {Cov}(\hat{\Psi }|\Psi )\), confidence intervals (CIs) based on Vc have worse coverage probabilities than CIs based on Vr because of the conditional bias in \(\hat{\Psi }\).

The statistical properties of estimates of model quantities derived from nonlinear mixed-effect models can be complex, especially when these quantities are functions of parameters and REs. In practice, the types of confidence intervals illustrated in our stock assessment model example may be over-interpreted by managers or whoever is using the results for decision making. For users without extensive statistical expertise, the Vr-based confidence intervals may be better and more simply described from the GAM perspective, which is that they are usually slightly conservative when averaged over a large number of years; that is, they are reliable only in the annual average sense [21, 36]. However, this was not always the case in our surplus production model (SPM) simulations, and in some cases (i.e., choices of \(\Psi \)) the simulation CI coverage probability could be quite different from the nominal 95% value, even when averaged over years. We suggest that users should be cautioned that confidence intervals may be unreliable in specific instances, such as for a particular year. In the stock assessment content, managers may make fishing quota decisions based on the estimated probability that stock size in the most recent year is greater than a reference value, or that the harvest rate was less than a reference value. Our analyses indicate these probabilities may be substantially inaccurate in some cases, and it is difficult to know when this occurs. However, we anticipate that better assessment model formulation and inclusion of more available data can improve the accuracy of CIs.

It is well know that the parameters of SPMs may be poorly identified and estimated with available data. In this case the biases of parameter estimates and other derived quantities may be highly uncertain which can produce unreliable statistical inferences. It may be possible to improve inferences by improving the SPM formulation. For our American plaice case study, this could involve better modeling of how harvest rates change over years in light of major changes in fisheries management such as a fishing moratorium. There is also information available about the differences in the two index catchability parameters (i.e., \(q_E\) and \(q_C\)) that could be used to improve model estimation. There are other formulations of SPMs that may be better (e.g., [27]). However, major improvements will likely involve very different assessment model formulations. There is substantial data about the length structure of the American plaice stock and fisheries that is informative about how reproduction rates have varied over time. There is also historic age information available that is informative about body growth rates. This information can be utilized within an integrated size-structured assessment model (e.g., [22]) to hopefully produce more precise and reliable estimates and inference about the stock population dynamics. However, this assessment model formulation research is beyond the scope of this paper.

When developing the formulas in Sect. 2.2, we assumed that \(\Psi \sim f(\Psi |\theta )\) was a correct assumption. However, in simulation self-tests (described in Sect. 1), which are commonly used to examine the reliability of stock assessment models, the data are generated using \(\hat{\Psi }\) and when the distribution of \(\hat{\Psi }\) is considerably different than \(f(\Psi |\theta )\) then our Vc approximation may not be reliable. For instance, the covariance matrix for \(\hat{\Psi }\) is given by Eq. (9) in this conditional setting and Eq. (11) of Zheng and Cadigan [38] in the marginal setting, and they can be fairly different from that of \(f(\Psi |\theta )\). A complication is models that include some \(\Psi \) with little or no corresponding data, in which case \(\hat{\Psi }\) may be close to zero and simulation self-tests will not reflect uncertainty about the value of \(\Psi \). The SPM simulation results in Sect. 3.3.1 illustrated this case. Even though the true SPM was fit to the simulated data, the biases for RE predictions were substantial. This raises concerns on the reliability of simulation self-tests that generate data with these highly biased RE predictions, whereby correct models may be identified as inadequate. We will investigate simulation self-tests in a separate paper.

Even though we demonstrated in Eq. (17) that the conditional covariance Eq. (8) can give the asymptotic variance of the regression parameter estimators for the Gaussian process semiparametric regression [13], we also found that the parameter estimates are biased to an order of \(O(T^{-1/2})\) for finite samples in the conditional setup, which is a quite general case for semiparametric inferences, and in some cases the bias can be larger (see, e.g., Eq. 79 in Zheng and Sutradhar [40]). Therefore, we suggest that MSE (4) be used for constructing conditional CIs based on finite samples.

The conditional inference discussed in this paper shares some similarity with the GAM framework, and hence both can share many similar results. We have seen this in the interpretation of CI coverage. For another instance, Eq. (4) can give the posterior covariance matrix for the optimal coefficients \(\mathbf {\beta }\) of the base functions, \(\mathbf {V}_{\beta }\), on page 327 of Wood [36]. Nevertheless, GAM is a large topic and its inference approaches can be different from the maximum likelihood method we are considering. Therefore, we do not attempt to build a full connection between GAM and our results in this paper. We leave the study of statistical inference about implementing GAM and GAMM (generalized additive mixed models) with TMB based on the results in this paper as future research.