Greater Than the Sum of its Parts: Computationally Flexible Bayesian Hierarchical Modeling

We propose a multistage method for making inference at all levels of a Bayesian hierarchical model (BHM) using natural data partitions to increase efficiency by allowing computations to take place in parallel form using software that is most appropriate for each data partition. The full hierarchical model is then approximated by the product of independent normal distributions for the data component of the model. In the second stage, the Bayesian maximum a posteriori (MAP) estimator is found by maximizing the approximated posterior density with respect to the parameters. If the parameters of the model can be represented as normally distributed random effects, then the second-stage optimization is equivalent to fitting a multivariate normal linear mixed model. We consider a third stage that updates the estimates of distinct parameters for each data partition based on the results of the second stage. The method is demonstrated with two ecological data sets and models, a generalized linear mixed effects model (GLMM) and an integrated population model (IPM). The multistage results were compared to estimates from models fit in single stages to the entire data set. In both cases, multistage results were very similar to a full MCMC analysis. Supplementary materials accompanying this paper appear online.


Introduction
Bayesian hierarchical models (BHMs) have become ubiquitous in many fields of scientific enquiry because of their ability to flexibly model complex natural systems yet remain relatively easy to build and modify for the questions at hand (Hobbs and Hooten, 2015).Berliner (1996) clarified conceptualization of a BHM based on three submodels: (1) data, (2) process, and (3) parameter models.The data model accounts for measurement error or other differences between observable data and the true natural process for which the researcher would like to make inference.The process model describes the state of nature using a random process to account for its unknown intricacies.Finally, the parameter model describes our uncertainty about the parameters governing this random process.The full model is built by conditioning each submodel on the one above it.Traditionally, inference for BHMs has been conducted using some form of Markov Chain Monte Carlo (MCMC) because the model structure is ideally suited to it (Gelfand and Smith, 1990;Gelfand and Ghosh, 2015).If not for development of MCMC methods, widespread use of BHMs may never have come to pass (Green et al., 2015).With a large amount of data or complex submodels, however, MCMC can be computationally challenging or infeasible to implement (Hooten et al., 2021;Wikle, 2003).Herein, we propose a multistage method for making inference at all levels of a BHM using various software formats in each stage.By breaking the estimation procedure into multiple stages based on a partition of the full data it becomes computationally efficient.By allowing use of the most suitable software platform in the initial stage it can be made more efficient yet, not only in computing time, but also in development time saved by using software designed to facilitate each specific initial stage analysis.
To alleviate computational challenges in using MCMC to fit BHMs in "big data" situations there have been several avenues of approach developing in the literature.One of these is development of so called "2-stage" methods that seek to break the computation into pieces based on partitions of the data (Goudie et al., 2019;Hooten et al., 2016Hooten et al., , 2021;;Lunn et al., 2013;Mesquita et al., 2020;Scott et al., 2016).The first stage involves fitting separate data models to the partitions using MCMC, then in the second stage, the separate MCMC samples are combined to produce an MCMC sample as if the partitions had been simultaneously analyzed with the full BHM.A similar approach is the divide-and-conquer approach (Srivastava et al., 2018) averages posterior distributions from data partitions to approximate an analysis of the full data.
A alternative approach to default MCMC for computationally challenging BHMs is to use optimization methods for exploring the posterior parameter space rather than stochastic sampling (Green et al., 2015).Optimization of the posterior density aims to find the maximum a posteriori (MAP) estimate rather than the posterior mean for a point estimate.The main benefit is that optimizing the posterior with respect to the parameters often requires fewer evaluations of the posterior density.That comes at a cost, however, because asymptotic results are usually required to describe parameter uncertainty (Van der Vaart, 2000).
In an effort to provide a computationally efficient and flexible method for fitting a general class of BHMs, conditionally independent hierarchical models (CIHMs; Kass and Steffey 1989), we consider combining benefits of meta-analysis, 2-stage MCMC, and posterior optimization into a multistage method that can be used to quickly fit BHMs in big data situations or when the data-level models are complex themselves.Although meta-analysis and 2-stage MCMC are both accomplished in multiple stages, 2-stage MCMC was initially designed to analyze a BHM, whereas meta-analysis seeks to summarize results from various analyses by forming them into a BHM, namely, a multivariate normal linear mixed model (Gasparrini et al., 2012).We can use the meta-analysis approach to approximate the second stage combination of initial stage inference in a 2-stage MCMC.This allows us to use multiple approaches to approximate the posterior inference in the initial stage, including optimization methods, deterministic sampling, or MCMC if it is convenient.
In what follows, we provide notation and a working definition of the CIHM as we will use it.We show how the full posterior can be approximated using multivariate normal approximations to first stage posterior densities.Then we demonstrate how the first-stage results can be combined using the linear mixed model approach of standard meta-analysis (Gasparrini et al., 2012;Higgins et al., 2009) when the first-stage parameters are random effects.An optional third stage is introduced to retroactively revisit the first stage analysis and can improve estimates after the second-stage results are obtained.The method is then demonstrated for two different types of environmental data, a simple generalized linear mixed model (GLMM) and an integrated population model (IPM) for demographic parameter inference.

Conditionally independent hierarchical model
The description of BHMs by Berliner (1996) is very general and we only consider a specific, but still very broad, class of hierarchical models (CIHMs; Gelfand and Ghosh 2015;Kass and Steffey 1989).The general form of a CIHM is Data-level model: (1) where "[A|B]" is used to represent the conditional probability density or distribution function of A given B (Gelfand and Smith, 1990), y i is the observed data set (or data partition) for the ith unit (i = 1, . . ., n), θ i are random unit-level parameters, ψ are population-level parameters for θ i , γ i are distinct (relative to each y i ) unit-level parameters for modeling the ith data set, and η are global (fixed) population-level parameters.Typically, ψ and η are the most scientifically interesting, however, there can be interest in each of the θ i and γ i depending on the situation.Note, that unit-level effects are allowed to be correlated which is slightly more general than Kass and Steffey (1989).
To place the development of the multi-stage methodology in context, we will demonstrate its use on the analysis of two ecological data sets.The first one uses a generalized linear mixed model (GLMM) to assess the effect of wolf (Canis lupis) presence on the occurrence of moose (Alces alces) browse in willow stands.Although these data are not large in the modern sense, we demonstrate a 2-stage analysis of them because the model can easily be fit in its full hierarchical form for comparison.Another common group of models in the ecological literature are integrated population models (IPMs).Initially proposed by Besbeas et al. (2002), IPMs are routinely used to collect information in multiple sparse data sets for improved inference on shared parameters.To demonstrate the 2-stage meta-analysis approach for IPM model fitting, we analyze the classic northern lapwing (Vanellus vanellus) data from Besbeas et al. (2002).

Moose browse in the presence wolves (GLMM)
This example involves n = 2,914 binary responses, y ij , which indicate presence of moose browsing on tree j = 1, . . ., n i of plantation i = 1, . . ., 24.The original study assessed whether or not tree browsing is reduced in plantations exposed to high wolf activity (van Beeck Calkoen et al., 2018).In addition, the study accounted for inherent moose preferences for tree height.To perform a similar analysis, we use a random effects model to treat each plantation as a unit to learn about the fixed effect of high wolf presence on moose browsing while accounting for natural variation in moose response across the 24 plantations.The CIHM is given by , where h ij equals the height of the jth tree in the ith plantation and w i is a binary indicator of high wolf presence in the ith plantation.
To use the 2-stage approach we must hierarchically center the model because δ 0 and δ wolf w j are not identifiable within plantation units.Using hierarchical centering (Gelfand et al., 1996), the model can be rewritten as, where In terms of the CIHM specification (1), ψ = (δ, σ 2 1 , σ 2 2 ).The parameters γ and η are not used.

Lapwing demographic modeling (IPM)
The typical IPM (Schaub and Abadi, 2011) has the CIHM form where γ i are data-type distinct parameters and η are common parameters.Traditionally, there are no random unit-level effects (θ i ) in IPM models.Multiple data sets are used to obtain better estimates of the common parameters, η, than can be obtained from any single y i .
The lapwing data consist of n = 2 data sets, a ring-recovery data set, y 1 and an index measure of the abundance of adult female birds obtained from a generalized additive model of counts, y 2 .These data have been analyzed previously to demonstrate IPM methodology for ecological data (Besbeas et al., 2002;Besbeas and Morgan, 2019;Brooks et al., 2004;Goudie et al., 2019).
Ring-recovery models are based on marking animals and releasing them back into the wild.In subsequent years, the marked animals are then recovered after they have died.
The likelihood for this type of model is a product-multinomial where numbers of ring-recoveries in years following following release are multinomial distributed with cell probabilities that are functions of φ t (for t = 1963, . . ., 1997), the probability of survival from time t to t + 1 and λ t , the probability that an animal that died between t − 1 and t is recovered.To assess the influence of weather on survival and trends in recovery, the ring-recovery parameters are modeled with where φ yt is the survival of first year birds, φ at is adult female survival, and x t is the number of days below freezing.
The census index data consist of noisy measures of female adult lapwing abundance in the study area.To make inference about population dynamics we use the state-space model of Besbeas et al. (2002), where N t = (N yt , N at ) is the number of yearling and adults females respectively in year t, y t are the noisy count of adult females.The transition matrix, T t , is parameterized with φ t and the rate of offspring production, ρ t .The survival parameters are the same as the ring-recovery model, but the production is modeled on the log scale using Full model details are provided in online appendix S2.
The IPM melds the information in the two data sets to allow inference for all parameters of the joint model, Random unit-level θ parameters are not traditionally used, but see the analysis in Section 4.2 for an alternate version.In each of the previous studies that analyze these data, bespoke code was written in software such as Matlab, JAGS (Plummer, 2003), (de Valpine et al., 2017) or some other language to optimize the joint likelihood (integrated over latent abundance) or execute an MCMC.For a rather simple IPM model like this, it is straightforward to code.But for more complicated capture-recapture models or abundance models this can become an impediment for practitioners.Many of the individual submodels in IPMs have software available for fitting them separately.For the ring-recovery data in this analysis there has been production-level software to fit the model for over 20 years.Thus, we should be able to utilize this substantial development in capture-recapture data analysis.

Methods
We propose a 2-stage method (extended to an optional 3rd stage) for analyzing CIHM models (1).In the first stage each data set, y i is analyzed individually according to its individual data-level model in the CIHM using the best method and software available for that particular data type.Then, in the second stage, the results are combined (melded) using the unit-level and population-level models to approximate the inference as if the entire CIHM was fit in one step (e.g., via MCMC).In an optional third stage we can revisit estimates of the distinct unit parameters, γ i to see if further updating is possible.

Two-stage approximate Bayesian inference
If we were to fit the CIHM model in the traditional way we would use summaries from the full posterior distribution Suppose, however, we only possess a single y i .Then we might estimate ϑ i = (θ i , γ i , η ) using the posterior distribution where [ϑ i ] is a prior distribution that would be used in the absence of additional individuals.Using these individual posterior distributions, we can rewrite the full hierarchical model posterior as Stage I of the method we propose begins by first finding estimates θi and covariance matrices Ŝi such that we can approximate for each data set or partition.This approximation is motivated the standard large-sample asymptotic results.For large-sample theory to motivate the approximation, [y i |ϑ i ] should satisfy the standard regularity conditions on the likelihood function, namely the "true" ϑ i is not on the edge of the parameter space and [y i |ϑ i ] is continuously twice differentiable (Kass and Steffey, 1989;Le Cam and Yang, 2000).However, if the log-likelihood is approximately quadratic in ϑ i , then the approximation can work well even for modest sample sizes (Geyer, 2005).
An estimate and covariance matrix for use in ( 5) is the maximum likelihood estimate (MLE) of ϑ i and Ŝi is the negative inverse Hessian.For now we assume that the size of y i is large enough and the parameters identifiable enough such that θi is well defined and Ŝ−1 i is nonnegative definite.The main benefit of using the MLE is that there is virtually always software available for finding the MLE and large-sample covariance matrix for common data-level models.
If ϑ i is only weakly identified in [y i |ϑ i ] then the MLE may be difficult to calculate.However, for nearly unidentifiable parameters in [ϑ i |y i ], asymptotic approximations can be poor for even large samples because no matter how large the data set is, the posterior will be nearly (or exactly) equal to the prior distribution.Therefore, because we are free to choose the temporary prior to aid stage I model fitting, we suggest using a normal prior A reparameterization of ϑ i might be necessary such that a normal density is appropriate as a prior, but this will help the accuracy of the normal approximation in stage I even for poorly identified parameters.Note that the support of the temporary prior needs to be the same as the desired parameter model . With the use of the temporary prior, the approximation becomes, where θi and Ši are the posterior mode (or mean) and negative inverse Hessian (or covariance matrix) of [ϑ i |y i ] and This ratio was also used by Goudie et al. (2019) in an approximate 2-stage MCMC method.

Normal random unit-level effects
While there is no requirement that the unit-level random effects, θ, be normally distributed in (8), it is the case in the vast majority of applications, or θ i can often be parameterized such that a normal distribution is appropriate.Thus, for the remainder of the paper, we assume that, [θ|ψ] = N(θ|µ, Σ), where ψ = (µ , σ ) and Σ is a covariance matrix parameterized by σ.When this is the case, stage II becomes equivalent to fitting a multivariate normal random effects linear model.
First we reparameterize the model using where In the analysis of the two ecological data sets, we use the R package TMB to fit ( 9), but it is just a multivariate linear mixed model, so any software that can account for the parameter prior distributions can be used.TMB uses Laplace's method to find the MAP of ψ = (µ, σ), η, and γ using the marginal posterior [µ, σ, η, γ|y] then a generalized delta-method (Kass and Steffey, 1989) is employed to estimate u accounting for uncertainty in the other parameters (Kristensen et al., 2016;Skaug and Fournier, 2006).At first it might seem as if we are adding an another layer of approximation by using the Laplace method, however, given the normal approximation already made in (8) and the normality of the unit-level parameters, the Laplace method of approximating the marginal posterior density of (µ, σ, η, γ)|y obtained from ( 9) is actually exact because the log-posterior is quadratic in u (Goutis and Casella, 1999).After finding the MAP, traditionally, the inverse Hessian of ( 9) is used to estimate the covariance matrix of the full model ( 9), which is what we use in the examples that follow.

Stage III: revisiting stage I
Using the description to this point, we are set to perform 2-stage inference for a CIHM.
However, we noticed, in our initial analysis of the lapwing data in forthcoming Section 4.2, when S −1 i ≈ 0 (relative to ϑ i ) for a particular unit in in stage I, then updating γi might not be as efficient as desired.Therefore, we propose revisiting stage I estimation to update γ i given the results of the second stage using the generalized delta method initially proposed by (Kass and Steffey, 1989).Note, that we are not proposing that stage III is always necessary, but, that it is an option, especially if γ i is not well identified in the data model alone.
For a heuristic example of lack of updating for γi consider the simple 2 parameter model structure: Further, suppose that η and γ 2 jointly are only vaguely identifiable in [y 2 |η, γ 2 ], whereas it is fully identifiable in [y 1 |η].In stage I, we can obtain well informed estimates η1 and Ŝ1 .
However, for y 2 , Ŝ−1 2 will be small.In the extreme case that Ŝ−1 2 → 0 one can show (through the generalized least-square estimate of β = (η, γ 2 ) in ( 10)), that η → η1 for any η2 .However, γ2 → γ2 , remaining unchanged from stage I. So, we see satisfactory updating of η2 , but no updating of γ2 in stage II.This case is admittedly pathological, but it illustrates that the Hessian matrix of [y 2 |η = η2 , γ2 ] may not be the best representation of the likelihood shape at the full MAP.
Stage III proceeds by first marginalizing over γ i .This can be done analytically if possible.However, based on the initial normal approximation after stage I, after θi and Ŝi are calculated, the rows and columns associated with γ i can be ignored when moving to stage II.Let ξ represent all parameters except γ i , after stage II, we obtain the marginal posterior approximation: where V ξ i is the negative inverse of the Hessian of the log-posterior distribution represented by the stage II model ( 9) and the associated prior distributions.If γ i is a nuisance parameter and of little interest, we can stop here and there is no need for stage III.However, if as estimate of γ i is desired, we can obtain an updated mean and covariance matrix for the full posterior approximation such that where ξ = (γ i , ξ), γi is the value that maximizes the conditional posterior [γ i | θi , η, y i ] and V is a updated posterior covariance matrix.To calculate V, we need to define the matrices: Now the full covariance matrix can be approximated by (Kass and Steffey, 1989;Skaug and Fournier, 2006): This may not be applicable in all situations, because one must be able to evaluate [γ i | θi , η, y i ] to a proportional constant as well as its gradient and Hessian for various parameter values.This may not be practical for most production software used in stage I.
One could question whether or not θ i might suffer the same lack of sufficient updating in stage II as well as γ i if the likelihood is generally flat.However, this is much less likely due to the fact that there is a unit-level model, [θ|ψ], which is being explicitly fit in the second stage.This allows information in other data sets to directly influence and update θ i at stage II.However, y j , j = i, can only contribute to updates of γi via the off-diagonal elements of Ŝ−1 i .

Moose browse GLMM
To effectively use the 2-stage approach in this example, it was necessary to use a stage I temporary prior ( 6) and ( 7) because browsing is rare (or absent) in some of the plantations.
We used the normal g-prior (Hanson et al., 2014) where , and H i is the usual design matrix with a column for intercept and the other containing tree height (h ij ).This prior produces p ij that are approximately uniform(0,1) distributed for a randomly selected h ij .The effect of this prior was removed prior to stage II using (7).
For stage I model fitting we used two approaches.In the first version (A), we fitted each individual model using the R package mgcv (Wood, 2011) to obtain the posterior mode and Hessian covariance matrix from [ In the second version (B), we used the approximate posterior mean and covariance matrix obtained from a deterministic posterior sampling procedure detailed in Johnson et al. (2011) (see online appendix S1).
This deterministic procedure estimates the posterior mean and variance without asymptotic results.For version B, the likelihood must be calculated with bespoke code.We are not aware of any production-level software that returns only the value of the posterior for logistic regression, which makes this version more challenging in practice than the first version, but we explored here nonetheless.In stage II we used TMB (Kristensen et al., 2016) to fit the linear mixed model in (10).Code for this example as well as the IPM example are provided in an online supplement.
We examined predictive estimates θi = X i β + ũi , fixed effects β, and variance parameters (log σ1 , log σ2 ) with respect to a full MCMC analysis.Both versions of the 2-stage approach produced results similar to a gold standard MCMC analysis using all the data simultaneously (Figures 1 and 2).Heuristically, for both θ i and Σ, the mean and covariance matrix obtained from stage I, version B (deterministic posterior sample), produced estimates more similar to the full-model MCMC estimates than those derived from the Bayesian MAP fitting with mgcv, but only slightly so.Scientific inference for any of the three methods would be identical, especially for the fixed parameters β.

Lapwing IPM
To demonstrate how we might utilize production software, in the first stage, we fitted the ring-recovery model using MARK (White and Burnham, 1999) via the user interface R package RMark (Laake, 2013).MARK provides the MLE and the large-sample covariance matrix for α1 = (η 1 , γ1 ) and Ŝ1 .For the state-space model of the female count data, we used bespoke model code written for analysis with the R package TMB (Kristensen et al., 2016).However, the state-space model is the simpler of the two data models in terms of programming complexity.From TMB we obtain the MAP estimate and large-sample covariance matrix, α2 = (η 2 , γ2 ) and Ŝ2 .Because the survival and production parameters are not well identified in the state-space model, we used vaguely informative temporary stage I prior distributions.For (η y0 , η y1 ) and (η a0 , η a1 ) we used independent normal g-priors to induce approximately uniform priors on φ yt and φ at (Hanson et al., 2014).For the production parameters, (γ 0ρ , γ 1ρ ) we used a normal g-prior such that ρ t lies within [0.01, 2] with prior probability ≈ 0.95.Based on other analyses, this prior covers a fairly broad range of values for the average number of females produced per female.For the remaining parameters, temporary priors were not used.In stage II, we used the linear mixed model in (10) along with the population prior distributions: [η y0 , η y1 , η a0 , η a1 ] = N(0, 10 2 I), [γ λ0 , γ λ1 ] = N(0, 10 2 I), [γ ρ0 , γ ρ1 ] = N(0, 10 2 I) [log N y0 , log N a0 ] = N(0, 1000 2 I), and [log σ] ∝ constant (improper).
We also consider an additional complexity beyond the traditional IPM specification.
One of the tenets of the IPM is that the multiple data models share common parameters of scientific interest; in our case, the survival parameters, η.It seems reasonable, however, that differences in data collection procedures or model specification might imply separate "true" values of the survival parameters, that is, ηi is actually estimating η i = η.So, one might consider each [y i |η, γ i ] to be an imperfect source of information about η.This is how meta-analysts would traditionally envision combining these separate studies.
Therefore, we also fit a model with random effects θ i ∼ N(η, σ 2 θ I).If the available data suggest common survival parameters between the two data sets, then σθ will be small and all the stage II estimates of each θi ≈ η.However, if σθ 0, then a fixed η model is probably inappropriate and there are possible biases in one or both of the different data sets.Thus, σ θ serves as a regulator to assess the degree to which the data sets support a fixed η value.For this model we used [σ θ ] = Exponential(1) in addition to the previous prior specifications.
To judge the quality of the posterior approximation, we compared our results to those presented by Besbeas and Morgan (2019), hereafter, B&M.Table 2 (pg.483) of that publication provides the results of a full joint analysis for several different models and inference methods.Our specification is most similar to their "MLE KF" model.That is a model where log N y0 and log N a0 are explicitly estimated.We only compare our results to that model (Figure 3), but the other results are very similar.All of the multistage approximate fitting methods produced results very similar to the B&M results (Figure 3).
A full table of the multistage results is available in Appendix B. For the 2 and 3 stage fixed η models, estimates of η were virtually indistinguishable from those given by B&M, both point estimates and standard deviations.Also, the band reporting parameters, γ 1 , are nearly identical.From stage II to stage III, γρ1 became more similar to the B&M results while γρ0 became less similar.Although they are both close to the B&M results overall.A marked improvement in estimates from stage II to III is evident in the log σ, log N y0 , and log N a0 parameters, with all of them becoming nearly equivalent to the B&M estimates after stage III.
We now turn our attention to the random effect version of the model.The estimate of ση = 0.017 (0.006-0.049), which is small, but not trivial.Thus, there is some evidence that each data set contains different information about the true survival process.The population-level effect of frost days on survival becomes slightly steeper and more uncertain (Figure 3) for young animals.For adult animals, frost days have less effect on survival, but this is again, more uncertain.The increase in uncertainty and change in frost effect are plainly evident when examining survival on the natural scale (Figure 4).
Although they did not fit separate survival processes, Goudie et al. (2019) noted this discrepancy as well in their 2-stage MCMC analysis of these same data.

Discussion
We have proposed a multistage method for estimating parameters in a broad class of BHMs.Our approach is computationally efficient, flexible, and draws inspiration from traditional meta-analysis where individual studies using differing methods contribute unit-level parameter estimates which are then combined using a CIHM format.In stage I, practitioners can use software expressly designed for a particular data model.We use traditional GLM/GAM software in the first example, and a mixture of production software (MARK/RMark) and pseudo-coding software (TMB) in the IPM example.This increases the efficiency of the analysis in development time because software expressly designed for the data at-hand is being used.Analysis do not have to create bespoke MCMC that will often be substandard to production level software in terms of ease-of-use and computational speed.In addition, as with other 2-stage methods, the stage I analyses are "embarrassingly parallel," so, they can be executed in parallel if necessary, further increasing efficiency.In stage II, the multivariate linear mixed model is very quick to analyze because the dimension of the problem is substantially reduced from the data models.The linear mixed model represents a compression of the information in the various data sets about the parameters.Finally, the generalized delta-method allows one to revisit the estimates of distinct parameters that is usually not possible in traditional meta-analysis because the raw data are not available.
This efficiency comes at a cost, however, because it relies on approximations in two places during the development, the normal approximation in stage I and the normal approximation of the stage II posterior distribution.We do not view this as a detriment because this method is designed for computationally challenging situations.These usually occur in settings with a large amount of data and hence a substantial amount of information about the parameters.That is, situations for which the large sample approximations were developed.Both of the models in our examples could be fitted readily in a single stage BHM analysis.Even so, the multistage method produced estimates very close to the "gold standard" full BHM estimates.The 2-stage approach may not be optimal in all settings, thus if one is capable of fitting the full CIHM in one step that would be the recommended course of action.We suggest our 2-stage approach for instances where computation or development of software to fit the full model at once is impractical.
Even though the approximations produced good results in our examples we should examine the seriousness of these approximations and what options practitioners have for checking the degree of error introduced by these approximations.The first approximation is the Gaussian approximation of the data-level likelihood shape at the first stage.This approximation is justified by the standard large sample results that coincide for MLE and posterior modes (MAP) and means.There are some structural situations with stage I models where the assumptions underlying those results would be questionable.Gelman et al. (2013, Section 4.2) presents several scenarios where the assumptions of large sample theory fail to be met.Most of these situations are able to be fixed with appropriate reparameterization or prior selection.For example, parameters should be specified on the log or logit scale, as well as the use of the Gaussian temporary prior in (6).
If the data-level models are sufficiently regular, one may still want to assess whether the sample size is sufficient to warrant a Gaussian approximation.We note, in a variety of previous studies this assumption has often worked well with little technical justification (e.g., Geyer 2005, Goudie et al. 2019, Scott et al. 2016).If the validity of the MLE or MAP estimates and Hessian derived covariance are in question, one can also use bootstrap bias and variance corrections to improve the results (Geyer, 2005;Scott et al., 2016).One can also increase data size by combining the original data partitions where appropriate.To ascertain whether sample size is too small for the shape of the normal density function to be appropriate one could also perform a MCMC analysis of some of the smallest units to see how posterior density appears.If the approximating normal posterior is used as a proposal distribution in a simple independent Metropolis-Hastings algorithm, then the acceptance rate can be used to judge the closeness of the approximation.For example, an acceptance rate of 0.9 (i.e., 90% of proposals accepted) would imply that the stage I approximation is very close to the actual stage I posterior shape.Michelot et al. (2019) Although it is no longer a traditional linear mixed model, it is still relatively easy to fit because the weights are known.
In stage II, when using the generalized delta method for estimating θ and the stage III method for reassessing γ, we make the additional assumption that [µ, σ, η, γ|y] is approximately normal.Again, given we have already made the assumption that the stage I posteriors are approximately normal.This new approximation is not adding a substantial new element of error to the analysis because, for a given value of the random effect variance parameters, σ, normal random effects, and the original normal assumption, the stage II posterior is multivariate normal (assuming normal priors on β).However, if needed, then bootstrap, MCMC or other method could be used for the stage II inference instead of the MAP and Hessian.
We developed the proposed multistage method under the assumptions of the CIHM (1), however, we can relax some of the independence assumptions in the CIHM at the data level and the method is still valid in its current form.First, the data do not need to be strictly independent at the unit-level.That is, the multistage method can still be applied when the data models can be partitioned as where, y −i = {y 1 , . . ., y i−1 , y i+1 , . . ., y n }.Having a meaningful partition of the data is not the important part of any multistage analysis, it is the ability to partition the data-level model (Hooten et al., 2021).In stage I, these dependent data models can be individually fit in the same way.
Hierarchical Bayesian models have brought a wealth of analytical capability for extracting scientific signal out of complex and noisy data.Moreover they provide an honest assessment of uncertainty by accounting for many different forms of stochasticity and relationships in the observable data.By assembling relatively simple components, a complex model can emerge.With the advent of modern data collection capabilities, default methods for inference have started to become obsolete.Gelfand and Ghosh (2015) note that BHMs have been a great benefit to science but the current rate of data acquisition and model complexity are "beginning to stretch our computational capabilities" and future inference will have to be the result of "an artful mixture of model specification and approximation."Hierarchical conditioning provides a beneficial way to build large complex models, thus it is sensible that inferential computing for these models may also be best accomplished in a hierarchical fashion (McCaslin et al., 2021).In the 2-stage analyses this was estimated using θi = X i β + ũi , where the tilde represents second stage estimates.S1.
) for [ϑ i |y i ]/[ϑ i ], the desired full posterior density function for all the parameters can be approximated (up to a proportional constant) by [ψ, η, γ, θ|y] ≈ n i=1 N( θi |ϑ i , Ŝi ) [θ|ψ][γ][ψ, η].(8) Note, in (8) we have put ϑ i in the mean position of the normal specification.Due to the symmetry of the normal density function it is equivalent, but this way θi can be interpreted as an observation.In stage II we can make full inference for all model parameters by maximizing the approximate full posterior (8) to obtain the posterior mode ( ψ , η , γ , θ) , or maximum a posteriori (MAP) estimate and the negative inverse Hessian of the log posterior density estimates the posterior covariance matrix.Hats are used to represent stage I estimates and tildes to represent stage II estimates.For the remainder of the paper we will assume that some measure of location, θi , and scale Ŝi are available where Ŝ−1 i is nonnegative definite.These can come from any combination of MLEs, MAPs,or MCMC samples, whatever software is most appropriate or available for the data model can be used.Different methods can be used between stage I units as well.

Figure 1 :
Figure 1: Random effect inference for Example I using 2-stage and full MCMC methods.

Figure 2 :Figure 3 :
Figure 2: Fixed effect and variance parameter inference for Example I using 2-stage and full MCMC methods.

Figure 4 :
Figure 4: Estimates of year and age-specific survival for lapwings.The different age classes are represented in the rows; adult and first year.The columns illustrate estimates from the two different models.The first column is the traditional IPM where the survival parameters are common between the data sets (i.e., η is fixed).In the second column we modeled θ i ∼ N(η, σ 2 θ I), where σ θ is estimated in the second stage.