Frequentist Conditional Variance for Nonlinear Mixed-Effects Models

Nonlinear mixed-effects models are commonly used in fisheries and ecological studies to account for complex relationships and dependencies in data. These models involve both fixed parameters to estimate and random-effects (REs) to predict. This paper addresses the inferential setting involving repeated sampling of the data but conditional on the unknown REs. This setting is more appropriate when the focus is on statistical inferences based on the specific values of REs that generated the data. Assuming the Laplace approximation is appropriate to derive the marginal likelihood and following a frequentist framework, this work derives RE-conditional bias approximations of maximum likelihood parameter estimators and empirical Bayes RE predictors, as well as the conditional covariance and mean squared error (MSE) among parameter estimators and RE predictors. It is shown that the RE-conditional MSE can be approximated with the unconditional MSE. Simulation studies demonstrate that the variance and MSE approximations are reasonably accurate for relevant sample sizes. Considering the finite-sample RE-conditional biases in the parameter estimates and RE predictions, the MSE is more appropriate for constructing confidence intervals (CIs), and the CI coverage of REs should be interpreted as the average coverage over a range of REs or over repeated generation of REs.


Introduction
Nonlinear mixed-effects models have been widely implemented to address complex multivariate correlation structures in data (see, e.g., [10,11]; among many others) and cover a broad spectrum of statistical models. In some applications, the fixed effects, such as the regression parameters, are of primary interests, while the random effects (REs) are introduced only to account for the complex dependencies in the data (e.g., [16,40]). However, in many other applications, REs or functions of REs represent quantities of practical significance and hence are also important to predict, and the correlations among REs are used to improve statistical inferences at spatiotemporal locations with few data (e.g., [3,39]).
According to the definition of conditional probability, mixed-effects models (linear or nonlinear) can be written in the form of f (D, |θ) = f (D| , θ) f ( |θ) (see, e.g., [20,29]), where the vector of data D is assumed to have a multivariate probability density/mass function (pdf/pmf) f (D| , θ), given values of the vectors of fixed-effects parameters θ and REs . The marginal distribution of is f ( |θ). A specific example of the mixed-effects model is the fisheries state-space population dynamics model where f ( |θ) is the process model describing how the latent population processes evolve over time and/or space and f (D| , θ) is the observation model linking data to the latent processes (e.g., [33]). Nonlinear mixed-effects models have numerous applications in many fields including fisheries, ecology, environmental sciences, econometrics and engineering (e.g., [17]). The implementation of these models in fisheries and ecological studies relies heavily on software packages including Automatic Differentiation Model Builder (ADMB, Fournier et al. [9]) and Template Model Builder (TMB, Kristensen et al. [19]). Therefore, in this paper we study inference for nonlinear mixed-effects models as implemented with TMB or ADMB. These packages use the maximum marginal likelihood estimator (MMLE) to estimate the fixed effects θ .
The marginal distribution of D is where 1 , . . . , q are the elements of q × 1 vector . For simplicity, this q-fold integral is denoted as f (D| , θ) f ( |θ)d . The MMLE of θ are those valueŝ θ that maximize f (D|θ). Throughout this paper, we useθ to denote the MMLE of θ . The integral in Equation (1) will usually not have a closed-form; however, TMB can approximate the marginal likelihood via the Laplace approximation quickly for possibly many (i.e., tens of thousands) REs by efficiently utilizing the sparseness of the joint distribution of f (D, |θ) with respect to . REs can be predicted with the conditional meanˆ E (θ) = f ( |D,θ)d which is also the empirical Bayes predictor of in the Bayesian framework (e.g., [18]). McCulloch and Neuhaus [23] showed, for generalized linear mixed models, that E{ |D, θ} is the best predictor in the sense of minimizing the overall mean squared error (MSE) of prediction. REs can also be predicted with posterior modeˆ (θ) that maximizes the joint distribution f (D, |θ) or equivalently the posterior f ( |D, θ) when θ =θ . Note the difference between posterior meanˆ E and posterior modeˆ . In linear mixed models, the posterior mode RE predictorˆ is known as the empirical best linear unbiased predictor (EBLUP; Robinson [30]). In generalized linear mixed models, Jiang et al. [15] calledˆ the maximum posterior estimate (MPE) of , and proved that given sufficient information about the REs, a restricted version of the MPE exhibits an overall consistency no matter the value of the dispersion parameters of the REs distribution at whichˆ are evaluated, even though the prediction of the individual RE is biased. In this paper, we use the posterior modeˆ to predict REs under a more general situation where there may not be sufficient data for all the REs, and especially there may be no data for some subset of REs. When the joint pdf f (D, |θ) is unimodal and approximately symmetric about thenˆ E andˆ are approximately the same. The focus of our research is statistical inference with TMB and ADMB that apply the Laplace approximation by assuming f (D, |θ) is approximately multivariate normal (MVN). Under such circumstances, E is approximately equivalent toˆ and hence the good properties forˆ E are also valid forˆ .
We consider a conceptual frequentist inferential setting where the REs, , are drawn once from the process model f ( |θ) and then fixed at these values during repeated data generations from the observation model f (D| , θ). This is a realistic inferential setting since in many cases an effect is treated as random only because it is unobservable and high-dimensional, and not because it is truly random. For instance, the popular lasso/L 1 regularization [32] for addressing high-dimensional (HD) linear regression parameters is equivalent to introducing a double exponential (Laplace) marginal distribution (or prior in a Bayesian interpretation) on the HD coefficients, and then estimating the HD parameters using the posterior mode [2]. In fisheries statespace assessment models, the annual population abundance and fishery mortality rates are frequently modeled as REs (e.g., [5,28]). Even though there are process errors in how these effects are modeled, there is only one set of process errors and only one set of time-series of true population abundance and fishing mortality rates that we need to make statistical inferences for, that is, the yearly time-series of unknown population abundance and mortality rates may be one draw from larger populations (i.e., they are random variables), but after they have been established, they behave like high-dimensional parameters staying constant during repeated sampling (i.e., catches) in different months of the year or locations. Under such circumstances, it is more appropriate to make statistical inference conditioning on the unknown REs. In this conditional inferential setting, rather than the marginal mean and covariance, we should evaluate the conditional mean E{·| } and covariance Cov{·| }. The marginal statistical properties are different than the conditional properties and this can lead to mis-interpretation of CIs and possibly wrong fisheries management decisions if the conditional setting is actually appropriate. For example, in the marginal setting the parameter estimators and RE predictors are all approximately unbiased [38]; however, in the conditional setting their biases are not negligible (see Sect. 2.2). In this paper, we investigate the biases and covariances of parameter estimators and RE predictors in the conditional setting, and we also examine the CI coverage properties using simulation studies.
The marginal inferential setting may be mis-specified, which often is revealed when simulation testing the efficacy of state-space models. With the marginal setting, in each simulation run the REs need to be generated from f ( |θ), which frequently results in unrealistic REs, the extinction of the simulated fish stock, and unusable simulation data. A commonly used procedure to address this problem is repeated sampling of D from f (D|ˆ ,θ) (e.g., [5,26,28]), namely, the REs are fixed atˆ instead of being randomly generated in each simulation. In stock assessment, this is referred to as a simulation self-test [6]. This simulation setup is much closer to our conditional setting rather than the marginal setting, and hence a study based on the conditional setting can reveal and explain the difference between marginal inference and self-tests, and improve the interpretation of self-tests results. However, the distribution of RE predictorˆ is different from that of RE , namely f ( |θ), and thus the results in this paper for conditioning on true REs may not be fully applicable to self-tests. This issue will be further clarified in Sect. 4.
The results in this paper are also generally applicable to the Gaussian process semiparametric regression model of He and Severini [12,13] and the type of integrated likelihood [1,31] used for primary model parameters (e.g., regression coefficients) while the unknown nuisance parameters, even though fixed, are integrated out by technically assuming some distribution which is usually MVN. We will illustrate this application with an example in Sect. 2.3.

Notations and Background
Consider a nonlinear mixed-effects model for random response data which are collected in a n × 1 vector D and are assumed to have a multivariate pdf f (D| , θ). The means and covariances of D depend on the fixed-effects parameters θ ( p × 1) and the random-effects (q × 1), possibly via nonlinear functions of θ , , and covariates which we do not develop notation for and leave implicit in f (D| , θ). The pdf of is f ( |θ). We denote the joint loglikelihood of θ and as with the conditional data loglikelihood l c ( , θ ) = ln{ f (D| , θ)} and the loglikelihood of the REs l r ( , θ ) = ln{ f ( |θ)}. The marginal distribution of D is given by Eq. (1), and the marginal loglikelihood is denoted as l(θ ). The true parameters θ o are estimated with MMLEθ , and the REs are predicted with the mode of l j ( ,θ) respecting , which is denoted asˆ (θ). Here the unknown true parameters θ o are replaced with MMLEθ.ˆ (θ) denotes the mode of l j ( , θ ) with respect to for general θ and can be found by solving the equatioṅ If the joint pdf f (D, | θ) = f (D| , θ) f ( |θ) is unimodal and approximately symmetric for ,ˆ (θ) is a good approximation for the conditional mean of REs given the data, E{ | D, θ}. When deriving approximation orders, we assume that there are i = 1, . . . , T observational units and that there are n i observations in the ith unit that share the same subset of REs. For example, in a time-series setting T may indicate number of years, and n t the number of observations in year t. Our approximation orders will be conservative in some cases.
One of the main results in Zheng and Cadigan [38] is given here as a proposition for future reference.
Proposition 1 (adapted from Eqs. (13) and (14) of Zheng and Cadigan [38]) If the conditional distribution of given data D is approximately MVN, the mean squared error (MSE) of RE predictors and parameter estimators can be estimated with denotes ∂ˆ (θ)/∂θ | θ=θ and Cov(θ) = −l −1 (θ) which is the matrix inverse of the Hessian of the negative marginal loglikelihood evaluated atθ .
TMB uses Eq. (4) combined with the generalized delta method to calculate the prediction standard errors (SEs) for user-specified differentiable functions of REs and parameters (g( , θ ); see Eq. 15 in Zheng and Cadigan [38]). Hence, TMB implicitly assumes that the conditional distribution of given data D is approximately normal, which is also required for the Laplace approximation to be accurate for the marginal likelihood in Eq. (1). TMB generalized delta SEs implicitly assume that both D and are random. In the next section, we provide -conditional covariances that can be used with the generalized delta method to derive SEs that are appropriate when only D is considered to be random and is fixed.

Conditional Covariance and MSE
We consider the inferential setting when are randomly generated from the true model f ( |θ o ) only once, and then fixed in the subsequent generations of the data D from f (D | , θ o ) as the basis for frequentist inference. Throughout this paper, we use subscript " o " to denote the true value. The conditional covariance Cov(ˆ | ) measures the variability ofˆ when only D is re-sampled from f (D | , θ o ). We derive an approximation of Cov(ˆ | ) using a first-order Taylor's series expansion ofˆ (θ) aboutθ = θ o , which giveŝ The O p (T −1 ) in Eq. (5) comes from (θ − θ o ) 2 and higher-order expansion terms. We use the O(·) and o(·) notations in a matrix sense, such that they apply to each element of (·). Based on Eq. (5), we can show that where Cov{ˆ (θ),θ | } denotes the conditional covariance between vectorsˆ (θ) andθ , and the approximation orders come from Cov{θ, , which are proved in Appendix C. These results can be summarized in the following matrix form.

Theorem 1 The conditional covariance of RE predictors and parameter estimators is given by
Cov With this formula and the subsequent approximations, the generalized Delta method can be used to evaluate the conditional covariance of the estimate of a differentiable function of θ and . Define = − {∂ˆ (θ)/∂θ }θ , where ∂ˆ (θ)/∂θ is used as a constant matrix. Also, letl r ( , θ) be l r ( , θ ) in (2) with variables ( , θ ) transformed to ( , θ). For the conditional bias and covariance of MMLEs given , in Appendix A we proved the following theorem.

Theorem 2 If the marginal distribution (1) can be well evaluated with the Laplace approximation, then the bias of MMLEs of θ conditional on the REs is given by
and the conditional covariance is given by where ∂θ .
When the estimatorl −1 (θ) I r (θ,ˆ )l −1 (θ) for the second term in (8) is not positive definite, we recommend to use its nearest positive definite matrix [14], as discussed in the paragraph following Eq. (A.6) in Appendix. Note that Cov(θ) in Eq. (8) involves expectations with respect to the marginal distribution of the random response variables, namely the data D, while the Cov part of Cov{E(θ | )} involves expectations with respect to the distribution of . The conditional bias of (7) is in the order of O(T −1/2 ). For evaluating Cov{ˆ (θ) | }, following Theorem 1 and Eq. (B.5) in Appendix B we have this Corollary.

Corollary 1 When the distribution of the REs given data is approximately MVN, then
If REs are also MVN, then Here if the REs are MVN, then −1 = −l r ( ). Note thatl j ( , θ o ) −1 −1l j ( , θ o ) −1 will be a positive definite matrix so that the diagonal variances of Cov{ˆ (θ) | } will be smaller than the diagonal variances of Cov{ˆ (θ)− } when is random (i.e., compare Eqs. 9 and 4). This also makes sense because of the restriction on randomization when is fixed. However, the difference will be small when the data are highly informative about the REs (i.e.,l j (ˆ ,θ) −1 → 0 in some sense, e.g., Fahrmeir and Kaufmann [7]), and estimates of these effects are statistically like fixed-effects parameters.
When are actually fixed-effects that are considered to be random for smoothing purposes, thenˆ is a biased estimator of . For this bias, we have the following evaluation.

Theorem 3 If the conditional distribution of given data D is approximately MVN, then
If is also MVN with covariance matrix , then Here the leading term for i , the ith element of is of order O(1/(n i + 1)) with n i being the sample size associated with i . This theorem can be easily proved by Eqs.
. Based on the results in this section, in Appendix D we proved the following corollary.

Corollary 2 If is MVN with covariance matrix and the conditional distribution of given data D is approximately MVN, then
Equation (11) is equal to Eq. (4), namely the unconditional MSE ofˆ , MSE(ˆ ).
Here, because is unknown and can only be estimated with a bias of order O(1) as indicated by Eq. (10), we use E{E(ˆ − | ) E(ˆ − | ) } to give an overall estimation of the conditional bias squared E(ˆ − | )E(ˆ − | ) .

Semiparametric Regression Example
As a partial validation and application of the theoretical results in Sect. 2.2, we consider the Gaussian process semiparametric regression studied in He and Severini [13], where x 1 , . . . , x n are p × 1 covariate vectors, 1 , . . . , n are unobserved independent normal random variables each with mean 0 and standard deviation σ > 0, β is a p × 1 vector of unknown regression parameters, z 1 , . . . , z n are observed constants, taking values in a set Z, and γ is an unknown real-valued function on Z. He and Severini [13] further , and wrote the model as Y = X β + g γ + . The covariance matrix of is denoted as Ω φ and assumed to have a parametric form with parameter φ. The regression coefficients β are of primary interests, and g γ are nuisance effects. Even though actually being fixed, He and Severini [13] technically treated g γ as a mean-zero Gaussian process with n × n covariance matrix Σ λ parameterized by λ so that g γ can be integrated out to obtain the marginal likelihood.

Random Walk Simulation Example
We also illustrate Eqs. (8), (9) and (11) using a simple random-walk example. The . . , T , and 1 = β is an unknown parameter to estimate. Here N (μ, σ 2 ) denotes the normal distribution with mean μ and variance σ 2 . At each time-step, there are n independent observations of . . , n and t = 1, . . . , T . The parameters are θ = (β, σ , σ ) and the REs are = ( 2 , . . . , T ) which is a (T − 1) × 1 vector. This process can actually be regarded as a specific realization of the Gaussian process semiparametric regression described in the previous section with regression parameter β, = γ , and σ and σ corresponding to the dispersion parameters φ and λ, respectively. In Sect. 3.1, we showed that Cov{β | } can be correctly evaluated by Eq. (8), and the RE predictor by maximizing the joint log-likelihood is the Best Linear Predictor (BLP; e.g., Robinson [30]) used in He and Severini [13]. In this example, we demonstrate the statistical properties ofˆ and parameter estimators using a simulation study.

Stock Assessment Example
The Schaefer form of the state-space surplus production model (SPM, e.g., Meyer and Millar [24]) gives latent total stock biomass in year t (i.e., B t ≥ 0) as a function of the biomass in the previous year plus production (births+growth−natural deaths) minus the fishery catch (C t , tonnes), with production modeled as a quadratic function of biomass, that is, where the parameter r controls the intrinsic rate of biomass increase at low population size and K is the carrying capacity. We assume that there is measurement error (ME) in catches and include a catch model, where H t ≥ 0 is based on a random walk. The stochastic population dynamics model is where . We apply this model to data for a flatfish species of the east-coast of Canada. The data available include annual estimates of total fishery catches of American plaice in Northwest Atlantic Fisheries Organization Subdiv. 3Ps during 1960-2019 (see Table surveys since 1980. Our state-space model observation equations for the time-series of survey indices (I ) and the catch observations (C ot ) are: SPM times t = 0, . . . , T correspond to years 1960-2019. Both q E and q C are survey catchability parameters to estimate. These are different because there was a major change in survey gears and stratification in Subdiv. 3Ps since the 1996 survey. The . We assume that σ C = 0.1 and do not estimate this parameter so that the model fits the catches closely, consistent with Morgan et al. [25]. The total set of parameters to estimate for the state-space model are H 0 , r , K , σ δ H , σ δ B , q E , q C , and σ I , along with RE predictions for B t and H t , t = 1961, . . . , 2019 and B 0 for 1960. We also assume that the initial biomass is random, which is broadly similar to the prior for B 0 in Morgan et al. [25]. More details about the development of this model are provided in the Supplementary Information.

SPM Simulations
The SPM is much slower to fit than the random walk model in Sect. 2.4, so we only generated 250 datasets conditional on a random value of drawn from f ( | θ); we repeated this procedure with 250 randomly generated 's from f ( | θ) to see the average effect over different 's for a total of 62 500 simulations. Also, as mentioned in Introduction, this procedure will often generate datasets that are unrealistically different from the observed data. In fact, some values of even result in stock extinction and model estimation errors. To avoid these problems, we simply fixed the δ Ht REs at their predicted values in the simulations, and only generated random 's for the model process errors (i.e., δ Bt in Eq. 14). The random walk standard deviation for δ Ht (i.e., σ δ H in Table 2) is large and results in many unrealistic simulated harvest rate series.
In many of the simulations, the estimates of the process error variance (σ 2 δ B ) hit a very small lower bound indicating that process errors were not needed to fit the simulation data well. This resulted in long simulation run times and problems computing V f and V r . To avoid these problems, we fixed σ 2 δ B at the value in Table 2.

Semiparametric Regression Example
The MMLE estimate of β iŝ which agree with the estimates of He and Severini [13] using generalized least-squares.
Here V (θ ) = Ω φ + Σ λ and θ = (φ, λ) . The predictor of g γ by maximizing the joint likelihood isĝ which is the same as the BLP of g γ in He and Severini [13]. Applying Theorem 2, we obtain which is the same as the result in Theorem 4.2 of He and Severini [13]. Furthermore, is also consistent with the results of He and Severini [13]. The detailed derivations of all these results are provided in Appendix E.

Random Walk Simulation Example
The data from an arbitrary simulation when n = 2 and T = 50 are illustrated in Fig. 1, along with predictions of t and 95% confidence intervals (CIs) based on the conditional-MSE in Eq. (11), and the conditional-variance using Eq. (9) which we denote as Vc. The conditional-MSE (11) is equal to the random-MSE (4) that TMB provides. Therefore, we denote the conditional-MSE as Vr. The Vr-based (i.e., TMB) CIs cover the real values of in 92% of the years, which is close to the nominal 95% coverage of the CIs. The Vc-based CIs cover in 86% of the years. However, this is only based on one simulated set of y's. We repeated the simulation 1000 times but conditional on the true t values in Fig. 1. That is, we generated 1000 datasets from f (D| , θ).
We computed the average of the 1000ˆ t 's at each time which are shown in the top panel of Fig. 2. Theˆ t 's are nearly unbiased, but a little smoother than the true t 's such that averageˆ t 's don't exactly match the peaks and valleys of the t 's. This is typical of smoothing estimators. The biases are shown in the middle panel of Fig. 2. The simulation average of the estimated bias using the approximation in Eq. (10) is reasonably accurate. Note that the grey points in this panel are based on the true value of in the leading term in Eq. (10) and are usually very close to the real simulated bias. The bias estimates usingˆ t 's (heavy black lines) have larger differences from the simulated bias. This demonstrates that plug-in estimates of the  Fig. 2 demonstrates that the Vc estimates using Eq. (9) are a little low for this example, while MSE estimates using Eq. (11) give average levels of the simulationbased MSE's which vary substantially across time because Eq. (11) was derived by taking expectation of bias squared over . The MSE estimates are also a little low on average.
We conducted many more simulations and the patterns were similar to Fig. 2. However, the results depended on the specific t 's in each simulation. We averaged results (see Fig. 3) over 500 randomly generated sets of random-walk t 's, each with 500 simulations of the data conditional on t 's, and for different choices of n and T . The results demonstrate that when T is large then the variance estimator (9) is reliable on average, and the MSE estimator (11), which is also the TMB estimator Fig. 3 The thick red and blue lines indicate the average (i.e., over 500 's) simulation standard deviations (i.e., based on 500 data sets for each ; columns 1 and 2) and root mean squared errors (columns 3 and 4) ofˆ t , respectively. The thick grey lines are simulation average standard errors based on Vr (Eq. 11) and the thick black lines are based on Vc (Eq. 9). Color corresponding thin lines indicate medians among the 500 cases, and shaded regions indicate 5% and 95% quantiles. T is the number of time points and n is the number of samples at each time (4), represents an average level of the MSE for RE predictions. However, the 90% quantiles in Fig. 3 demonstrate that for specific values of the Vc-based estimates of the SEs ofˆ and the Vr-based estimates of RMSE's can be considerably different than simulation-based results. The quantiles and means are based on the average variance estimates (i.e., the Vc-based and Vr-based SEs) for each of the 500 cases, where for each , as in the lower panel of Fig. 2, the averages of the 500 Vc-and Vr-based SEs based on the 500 simulated datasets are computed, or the sample standard deviations and RMSEs ofˆ are computed.
We also investigated the coverage properties of 95% confidence intervals (CIs) for based on a normal distribution assumption forˆ t , with Vr or Vc estimates of the variance. The simulation average coverage of Vr CIs (see Fig. 4) was close to Fig. 4 The thick grey lines are average (i.e., over 500 's) simulation coverage (i.e., based on 500 datasets for each ) of 95% Vr-based (i.e., RMSE) CIs, and the thick black lines are for Vc-based (i.e., SE) CI coverage probabilities. The horizontal dashed line indicates 95%. Color corresponding thin lines indicate medians among the 500 cases, and shaded regions indicate 5% and 95% quantiles. T is the number of time points and n is the number of samples at each time the 95% nominal level, whereas the probability that the Vc CI's contained was somewhat lower than 95% even when T = 200 and n = 5. The -conditional bias in contributes to the reduced coverage of Vc-based CI's. However, for specific values of the Vr-based CIs could differ somewhat from the nominal 95% level, with the simulated coverages less than 95% for approximately 50% of the random value of 's. Hence, our results indicate the Vr-based CIs are somewhat inaccurate and not simply conservative, and Vc-based CIs are less accurate than Vr-based CIs which is expected because of the conditional bias inˆ .
The GAM (generalized additive model) literature suggests their confidence intervals are accurate when averaged over the smoothing covariates (e.g., [21,36]), which is time in the random-walk example. For each fixed , we calculated the average CI coverage over time points and the 500 simulated datasets. The Vr-based CI coverage was accurate, and in the worst case (n = 2 and T = 50) ranged from 0.93-0.95 for the 500 cases. The coverages were closer to 0.95 for the other choices of n and T . The Vc-based CI coverage was less accurate, and in the worst case (n = 2 and T = 50) ranged from 0.88-0.93. The simulation standard deviations (SDs) and RMSEs for the random-walk parameter estimates, along with the averages of the Vr-based and Vc-based SEs, are shown in Table 1. Keeping in mind the simulation approximation errors, the Vc-based SEs are reasonably accurate for the simulation SD of the estimates, and the Vr-based SEs are reasonably accurate for the RMSE. These results were more accurate for β variances.

Stock Assessment Example
Our SPM parameter estimates and stock assessment results (Table 2) are broadly similar to Morgan et al. [25]. Here Vc-and Vr-based standard errors are shown as coefficients of variation CVc and CVr, respectively. The CVc's are usually substantially smaller than CVr's, but because of the conditional bias in parameter estimators and RE predictors, the CVr's provide more reliable confidence intervals. This is what we found from the simulations in Sect. 3.2 and this will also be the case for SPM simulation results we provide later in this section. More real-data analysis are provided in the Supplementary Information.

SPM Simulations
Simulation assessment results ( Fig. 5; top panels) demonstrate that the Vc-based SEs (SEs) were accurate for -averaged simulation standard deviations in most years. The Vr-based SEs were also usually accurate for -averaged simulation RMSE, but they were slight over-estimates in the first half of the assessment times series for biomass It would be practically useful to have some indication of when this problem occurs, which is a useful area for future research. Vf-based CIs were unreliable because they do not account for the assessment model estimation bias which can be large relative to the SEs (bottom panels) which results in unreliable CIs. This example illustrates problems because of data limitations; that is, there is only one data point per year before 1980, and two data points per year since 1980, while there are two REs to predict for each year. In this case,l j ( , θ o ) in Eq. (10) is close to − −1 since itsl c ( , θ ) component in (2) is small relative to the case when there are many data each year, and hence the biases for RE predictions are close to minus the true RE values, − (or minus the deviation of true RE values from the marginal means of REs). Here all the double dots above denote second order derivatives respecting . Because there are less data for each year before 1980 than after 1980, the biases in the predictions of REs before 1980 are in general larger than after 1980, but the corresponding SEs tend to be smaller as indicated in the top panels of Fig. 5 because the model predictions of the REs tend to cluster around their marginal means due to lack of data. Therefore, there is a substantial decrease in the Bias/RE ratio after 1980 in the lower panels of Fig. 5. Also due to the data limitations, the MSEs of RE predictors are close to the marginal variances of REs (i.e., ) as suggested by Eq. (11).  Hence, combined with the previous result that bias approaches − , the distribution of bias/RMSE is close to standard normal especially before 1980, as evidenced in the bottom panels of Fig. 5. We calculated the average CI simulation coverage probability (CP) over years and the 250 simulated datasets. Unlike the random-walk simulations, the SPM annualaverage CPs depended on . The range for Vr CPs was 0.334-0.997 for biomass (5% and 95% quantiles: 0.829-0.993) and 0.385-0.995 (quantiles: 0.841-0.992) for harvest rates, although the -averages were accurate (0.945 for biomass and 0.947 for harvest rates). Hence, for some values of the Vr CIs can have CPs very different than the nominal 0.95 value, even when averaged over years. However, the -medians were 0.971 for biomass and 0.969 for harvest rates so for more than 50% of the -  Simulated parameter estimates (Table 3) demonstrate that Vc SEs are accurate for the simulation SEs, and Vr SEs are accurate for simulation RMSEs. An exception is σ δ H but the δ Ht 's were fixed at the model predictions when generating 250 × 250 simulated data and we expect in this situation that the Vr SEs of σ δ H can be inaccurate.

Discussion
We developed frequentist variance approximations for predictions of REs (i.e., ) in nonlinear mixed-effects models, and functions of these predictions and estimates of model parameters, θ . We focused on maximum likelihood estimators of θ (i.e.,θ) and the conditional mean predictor of given data (i.e.,ˆ ). Our main contribution is a variance approximation for the inferential setting involving repeated sampling of the data when , although unknown, are assumed to be fixed. This setting is more appropriate when we are particularly interested in statistical inferences based on the specific values that generated the data. This setting is also relevant when is not a RE but a high dimensional parameter that is modeled as a RE for nonparametric smoothing. There are well-known connections between smoothing methods and RE models (e.g., [4,34,37]). This is the case for some state-space fish stock assessment models in which subsets of will usually be time-dependent and could be considered to be complex but smooth functions of time.
We demonstrated in simulations that our Vc variance approximation (Eq. 9) is reasonably accurate. The more commonly used Vr variance approximation (Eq. 4, or equivalently Eq. 11), which is the MSE ofˆ when is random, represents an average level of the MSE when is treated as fixed. This "average" interpretation of MSE also appears in the generalized additive model (GAM) literature such as Marra and Wood [21] and Wood [36]. However, for particular values of the MSE ofˆ may differ substantially from Vr depending on the magnitude of the difference between E(ˆ | ) and , i.e., the bias. We also developed an accurate approximation for the bias (i.e., Eq. 10). We recommend Vr (Eq. 11) for statistical inferences about because it reflects an average level of MSE, although for specific values of the MSE may be quite different than the Vr estimate. Although Vc (Eq. 9) is a good approximation of Cov(ˆ | ), confidence intervals (CIs) based on Vc have worse coverage probabilities than CIs based on Vr because of the conditional bias inˆ .
The statistical properties of estimates of model quantities derived from nonlinear mixed-effect models can be complex, especially when these quantities are functions of parameters and REs. In practice, the types of confidence intervals illustrated in our stock assessment model example may be over-interpreted by managers or whoever is using the results for decision making. For users without extensive statistical expertise, the Vr-based confidence intervals may be better and more simply described from the GAM perspective, which is that they are usually slightly conservative when averaged over a large number of years; that is, they are reliable only in the annual average sense [21,36]. However, this was not always the case in our surplus production model (SPM) simulations, and in some cases (i.e., choices of ) the simulation CI coverage probability could be quite different from the nominal 95% value, even when averaged over years. We suggest that users should be cautioned that confidence intervals may be unreliable in specific instances, such as for a particular year. In the stock assessment content, managers may make fishing quota decisions based on the estimated probability that stock size in the most recent year is greater than a reference value, or that the harvest rate was less than a reference value. Our analyses indicate these probabilities may be substantially inaccurate in some cases, and it is difficult to know when this occurs. However, we anticipate that better assessment model formulation and inclusion of more available data can improve the accuracy of CIs.
It is well know that the parameters of SPMs may be poorly identified and estimated with available data. In this case the biases of parameter estimates and other derived quantities may be highly uncertain which can produce unreliable statistical inferences. It may be possible to improve inferences by improving the SPM formulation. For our American plaice case study, this could involve better modeling of how harvest rates change over years in light of major changes in fisheries management such as a fishing moratorium. There is also information available about the differences in the two index catchability parameters (i.e., q E and q C ) that could be used to improve model estimation. There are other formulations of SPMs that may be better (e.g., [27]). However, major improvements will likely involve very different assessment model formulations. There is substantial data about the length structure of the American plaice stock and fisheries that is informative about how reproduction rates have varied over time. There is also historic age information available that is informative about body growth rates. This information can be utilized within an integrated size-structured assessment model (e.g., [22]) to hopefully produce more precise and reliable estimates and inference about the stock population dynamics. However, this assessment model formulation research is beyond the scope of this paper.
When developing the formulas in Sect. 2.2, we assumed that ∼ f ( |θ) was a correct assumption. However, in simulation self-tests (described in Sect. 1), which are commonly used to examine the reliability of stock assessment models, the data are generated usingˆ and when the distribution ofˆ is considerably different than f ( |θ) then our Vc approximation may not be reliable. For instance, the covariance matrix forˆ is given by Eq. (9) in this conditional setting and Eq. (11) of Zheng and Cadigan [38] in the marginal setting, and they can be fairly different from that of f ( |θ). A complication is models that include some with little or no corresponding data, in which caseˆ may be close to zero and simulation self-tests will not reflect uncertainty about the value of . The SPM simulation results in Sect. 3.3.1 illustrated this case. Even though the true SPM was fit to the simulated data, the biases for RE predictions were substantial. This raises concerns on the reliability of simulation self-tests that generate data with these highly biased RE predictions, whereby correct models may be identified as inadequate. We will investigate simulation self-tests in a separate paper.
Even though we demonstrated in Eq. (17) that the conditional covariance Eq. (8) can give the asymptotic variance of the regression parameter estimators for the Gaussian process semiparametric regression [13], we also found that the parameter estimates are biased to an order of O(T −1/2 ) for finite samples in the conditional setup, which is a quite general case for semiparametric inferences, and in some cases the bias can be larger (see, e.g., Eq. 79 in Zheng and Sutradhar [40]). Therefore, we suggest that MSE (4) be used for constructing conditional CIs based on finite samples.
The conditional inference discussed in this paper shares some similarity with the GAM framework, and hence both can share many similar results. We have seen this in the interpretation of CI coverage. For another instance, Eq. (4) can give the posterior covariance matrix for the optimal coefficients fi of the base functions, V β , on page 327 of Wood [36]. Nevertheless, GAM is a large topic and its inference approaches can be different from the maximum likelihood method we are considering. Therefore, we do not attempt to build a full connection between GAM and our results in this paper. We leave the study of statistical inference about implementing GAM and GAMM (generalized additive mixed models) with TMB based on the results in this paper as future research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. quantities using do not have tildes. The marginal likelihood l(θ ) is not influenced by this reparameterization.
TMB MLE (maximum likelihood estimator)θ is obtained by solving the score equation based on (A.2) Note that = 0 as the definition ofˆ (θ) in TMB, where ∂l j (ˆ ,θ) ∂θ denotes neglectingθ inˆ (θ) when taking derivative with respect toθ . Hence, the score equation becomesl By expandingl(θ) at the true value θ o and rearranging terms, we havê When is MVN with mean μ and covariance matrix , using Eq. (B.1), we obtain According to the law of total variance [35, pages 385-386], Here we use the concept of ergodicity, which implies that a sufficiently large collection of random samples from a stochastic process can represent the average statistical properties of the entire process (see e.g., [8]). Ergodicity can be regarded as the counterpart of the law of large numbers for stochastic processes. Because the REs and data are all generated from the true model, according to ergodicity, as T gets large, the T mean Cov(θ | ) converges in probability to the population mean E{Cov(θ | )}, In the above derivation, we kept the leading term of O(T −1 ) and hence the approximation order is o(T −1 ). In (A.6) Cov{θ | } appears as a correction to the marginal variance Cov(θ) with the correction term being the covariance of I −1 ∂l r ( , θ o ) ∂θ o , I −1 I r I −1 . Therefore I −1 I r I −1 should be positive definite. However, because we are applying the estimatedθ andˆ , I −1 I r I −1 may not be positive definite. When I −1 I r I −1 is not positive definite, we recommend to use its nearest positive definite matrix [14], where its negative eigenvalues are set to 0. Transformation (A.1) indicates that is a function of and θ , and its property 1 givesl r ( , θ) = l r ( , θ ). We then can use chain rule to express I r in terms of instead of , In (A.5), because the randomness from in c is assumed to be negligible, and the expectation of E θ − θ o | with respect to is approximately 0, we have E{c} ≈ c ≈ 0. Therefore, The bias is in the order of O(T −1/2 ).

Appendix B: Approximation for9(Â o )
A first-order Taylor series expansion withl Here we assume that the distribution of REs given data is approximately MVN, which is necessary for TMB and ADMB to integrate out REs using Laplace approximation. This assumption implies that the higher-order terms not shown in Eq. (B.1) are all negligibly small. We assume has a multivariate normal (MVN) distribution with mean zero and covariance matrix which is also −l −1 r ( , θ o ), where l r ( , θ o ) is the loglikelihood of (see Eq. 2) and the derivatives are about . Note that for the MVN,l r ( , θ o ) does not depend on . In Eq. (B.1),l j ( , θ o ) is equal tol c ( , θ o ) + l r ( , θ o ) (see Eq. 2). We assume that E D| {l j ( , θ o )} ≈l c ( , θ o ) +l r ( , θ o ) in the first order approximation, which is a fairly general case in ecological and fisheries science modelings. For example, if f (D| , θ o ) is MVN and are its means or more generally f (D| , θ o ) belongs to exponential family and are its natural parameters, thenl c ( , θ o ) is constant. We use the notation J ( , θ o ) = −E D| {l j ( , θ o )}. The approximation we use to derive the covariance approximations ofˆ (θ 0 ) iŝ  ( , θ o ), which should involve only the data associated with i and its close neighbors. O p (T −1 ) term is mainly (θ − θ o ) 2 .θ is based on the data in all the T units, and hence as T increases (θ − θ o ) 2 becomes less correlated with A i . As a result, Applying this result to (C.2), we have that Cov{ˆ true value θ o , we obtain after some algebra, which is the same as Eq. (E.3), the result given in He and Severini [13]. Even though this result is somewhat obvious by directly evaluating the covariance of Eq. (E.2) conditional on γ , this example tested the applicability of our results to semiparametric regression models. Furthermore, direct evaluation based on (12) and (E.2) can reveal the bias E(β | γ ) − β o = (X V (θ o ) −1 X ) −1 X V (θ o ) −1 g γ , which agrees with the bias obtained using Theorem 2.