Assessing Ecosystem State Space Models: Identifiability and Estimation

Hierarchical probability models are being used more often than non-hierarchical deterministic process models in environmental prediction and forecasting, and Bayesian approaches to fitting such models are becoming increasingly popular. In particular, models describing ecosystem dynamics with multiple states that are autoregressive at each step in time can be treated as statistical state space models (SSMs). In this paper, we examine this subset of ecosystem models, embed a process-based ecosystem model into an SSM, and give closed form Gibbs sampling updates for latent states and process precision parameters when process and observation errors are normally distributed. Here, we use simulated data from an example model (DALECev) and study the effects changing the temporal resolution of observations on the states (observation data gaps), the temporal resolution of the state process (model time step), and the level of aggregation of observations on fluxes (measurements of transfer rates on the state process). We show that parameter estimates become unreliable as temporal gaps between observed state data increase. To improve parameter estimates, we introduce a method of tuning the time resolution of the latent states while still using higher-frequency driver information and show that this helps to improve estimates. Further, we show that data cloning is a suitable method for assessing parameter identifiability in this class of models. Overall, our study helps inform the application of state space models to ecological forecasting applications where (1) data are not available for all states and transfers at the operational time step for the ecosystem model and (2) process uncertainty estimation is desired.


Introduction
Many ecological prediction and forecasting applications use mechanistic process-based models to simulate the dynamics of ecosystems (Luo et al., 2011).These process models are typically discretizations of ordinary or partial differential equations describing how the system dynamics evolve over time and space.Further they are often linearly conditioned on the value at the previous time step.These models are especially important in applications where available data limits the use of empirical models or predictions are being made into novel conditions not captured in existing data.However, process-based models can be challenging to calibrate due to the large numbers of parameters, and thus robust uncertainty estimation is also difficult (Luo et al., 2009).
While the importance of quantifying uncertainty in process models is recognized (Dietze et al., 2018), it can be challenging to fully account for all important sources of uncertainty.
Research to date has largely focused on estimating and reducing uncertainty that arises from initial conditions, parameters, and observational noise, usually through data integration techniques (see Jiang et al., 2018;White et al., 2019;Baracchini et al., 2020, for examples).Estimating and propagating these sources of uncertainty without process uncertainty assumes that the process model perfectly describes the temporal evolution of the ecosystem up to error in the collection of observations.To estimate process uncertainty, state space modeling frameworks (Hamilton, 1994) are increasingly used to account for stochastic elements in the system evolution.Since many ecosystem process models already account for initial condition uncertainty, parameter uncertainty, and observational uncertainty, it is straightforward to convert them into a state space framework by adding an uncertainty structure to the underlying process model.
The Bayesian state space paradigm is a well-suited approach to estimate distributions of parameters and latent states (model states that are not directly observed) in ecosystem models using observations.As a result it has seen a growing use in ecological forecasting applications (see Thomas et al., 2017;Dowd and Meyer, 2003, for examples).State space models treat all forecast terms as probability distributions and allow for more effective quantification, partitioning, and propagation of uncertainty in models.Prior distributions on parameters allow ecosystem scientists to enforce strict upper and lower bounds on parameters and incorporate biological information into the modeling process in a principled way.This focus on uncertainty and process precision estimation prompted us to choose a Bayesian framework over a point estimate focused method such as the Kalman filter (Kalman, 1960).
The parameter and latent state posterior distributions in Bayesian state space models are typically estimated using Markov Chain Monte Carlo (MCMC).
The added flexibility of the state space model does come with drawbacks.Analyzing ecosystem models as state space models increases the number of parameters that require estimation (i.e., parameters describing the distribution of the process uncertainty) and adds the requirement to estimate latent states for each ecosystem model state.The addition of latent state estimation can add anywhere from tens to hundreds of thousands of additional parameters to estimate because ecosystem models commonly use a daily time step that requires an additional parameter per state for each day of the simulation.Additionally, a parameter is required for describing the process variance for each of the states in the model.
Finally, data for ecosystem models may only be available at timescales that are less frequent than the model timestep (i.e., annual or greater).These large gaps between observed data may present a challenge to constraining both the latent states and process variances.
Furthermore, identifiability (or equifinality (Luo et al., 2009)) is a common concern when using Bayesian methods to estimate parameters in ecosystem process models, which often have many highly correlated parameters.Although problems with parameter identifiability do not necessarily impair latent state or process variance estimation, these parameters correspond to properties of ecosystems and often have important physical interpretation.
Therefore, it is crucial to ensure that they can be successfully identified and estimated.
Data cloning has been used to assess identifiability of parameters in both phylogenetic models (Ponciano et al., 2012) and ecological models (Lele et al., 2007).DC is done by applying Bayesian inference to a dataset that is constructed by duplicating the initial dataset r times.As r increases, the resulting posterior parameter estimates approach the maximum likelihood solution.Data cloning can used to determine whether parameters are nonestimable or unidentifiable through a visual investigation of posterior plots with increasing values of r (Ponciano et al., 2012).Parameters are said to be non-estimable when there argmax Θ L(θ|X), i.e. there are multiple sets of parameter values that maximize the likelihood function (Lele et al., 2010;Rannala, 2002;Rothenberg, 1971;Ponciano et al., 2012;Cole, 2020).Ponciano et al. (2012) introduce terminology for various situations where we have parameters in the model that are not estimable: non-separability, lack of information, nonidentifiable, and identifiable but not estimable.We use Ponciano et al. (2012)'s definitions for these terms throughout the remainder of this paper and so we introduce these definitions here.Non-separability occurs when the model is structured such that it is not possible to separate parameters from one another, and may be due to parameter redundancy (see also Cole, 2020).Parameters that are non-estimable due to non-separability are referred to as nonidentifiable (NI).Lack of information occurs when the dataset does not contain sufficient information about the parameters to estimate them, resulting in wide posterior distributions that have not been properly constrained by observed data.Parameters that are nonestimable due to lack of information are referred to as INE (identifiable but not estimable).A combination of data cloning and effective sample size (ESS) estimation leads to a thorough analysis procedure for ecosystem state space models that has not commonly been applied.
To address challenges estimating latent states and parameters when applying Bayesian state space modeling frameworks to ecosystem models, we present a simulation study using a forest ecosystem state space model that predicts carbon cycling among multiple states (a.k.a. "stocks" in the carbon cycling model).Using synthetic data with introduced data gaps, we address four questions focused on estimating variance parameters, latent states, and process parameters. 1) Under what temporal gaps in data does it become difficult to estimate the process variance?2) Is it possible to recover the true latent states under different temporal gaps in data and under scenarios for availability of data describing the transfers among states (i.e., fluxes)? 3) Can we increase the sampling efficiency for the variance of latent states in these cases by disconnecting the time step of the latent states from the time step of the model?and 4) Can we determine when model parameters are identifiable using data cloning?Our study is designed to help inform the application of ecosystem state space models to ecological forecasting applications where process uncertainty estimation is desired, and where data is not available for all stocks and transfers at the ecosystem model operational time-step (i.e., a daily time-step), such as data collected through the U.S. National Ecological Observatory Network (NEON).

Process Model
We used the Data Assimilation Linked Ecosystem Carbon model designed for simulating forests composed of evergreen trees (DALECev, Williams et al., 2005).It is a simple model describing carbon dynamics (Figure 1) and is similar to other ecosystem models used in data assimilation and carbon stock forecasting applications, for example PnET (Aber and Federer, 1992), 3PG (Landsberg and Waring, 1997), TECOS (Xu et al., 2006).The model can be written as a set of equations that are approximately linear and autoregressive in time.
While DALECev has been widely used (Williams et al., 2005;Smallman et al., 2017;Fox et al., 2009;Bloom and Williams, 2015) it is not traditionally fit as a state space model as we do here by adding a process error term into the model.).The DALECev model includes 11 process parameters, p i , for i ∈ 1 . . .11, where each parameter represents the daily rate of an ecological process (such as turnover, decomposition, or soil organic matter mineralization), an allocation of a particular flux (transfer rates between stocks), or a parameter used in the calculation of a flux.
Fluxes represent a number of physical processes that move carbon through the ecosystem, including respiration (R), photosynthetic allocation (A), turnover (L), and transfer to another stock (D).The model uses a submodel, the Aggregated Canopy Model (ACM) from Williams et al. (1997), to simulate the input of carbon through gross photosynthetic production (GPP; G in Equations 1 -3).Following Fox et al. (2009), all parameters in the ACM submodel were fixed except for p 11 , therefore G is a function of p 11 and meteorological driver inputs D (t) .D (t) includes maximum and minimum temperatures, radiation, and atmospheric carbon dioxide for day t.For the DALECev model, a given carbon stock C s at time t can be generically expressed as the carbon at time t − 1 minus the respiration (carbon lost from the system) or transfer to another stock, plus carbon gained by through the allocation of photosynthesis (carbon gained from outside the system) or transfer from another stock, where for stock s: Here t−1,s is an error term that acknowledges that our knowledge of the physical process is not perfect, and that there are normally-distributed stochastic deviations from the mean behavior.
Thus, our specific system of equations that simulates how model states change through time can be written as: where T (t) is the average temperature for day t.These updates are referred to as the process model, f (•).For any carbon stock C the process model can be written in the form where A t , b t are coefficients that can vary with time.Any stocks that cannot be written in this way can be approximately written in this form using a linearization.
The eleven different fluxes, the building blocks for Equations 1 -5 in the DALECev model are: t) , p 11 )p 2 (7) These fluxes can be combined to form net fluxes.Unlike some of the individual fluxes or stocks, some of these net fluxes are measurable.One important net flux is net ecosystem exchange (NEE) that is measured using eddy covariance techniques (Baldocchi, 2014) and deployed in many ecological observation networks (Metzger et al., 2019).NEE is the net of G and Equations 7, 14, and 15 and is given by: Additionally, soil respiration, S r , is a net flux that is commonly measured in ecosystem studies.S r is the net of autotrophic respiration by roots (a component of Equation 7) and heterotrophic respiration by soil micro-organisms(Equations 14 and 15): For this study we have fixed the value of c to be 0.3.In practice c must either be specified or given a very strong prior, as it can be challenging to constrain by other available data.

State Space Model
We estimate the stocks for the DALECev model using a state space model (Hamilton, 1994;Petris et al., 2009;Durbin and Koopman, 2012;Auger-Méthé et al., 2020).In the state space framework, we treat the five carbon stocks as components of the state vector, and the additional flux data collected on respiration, photosynthetic allocation, turnover, and transfers as operations on the state vector.Let C denote the vector of carbon stocks from the model and C obs denote the observations of the stock, with observations at a subset of time points I ⊂ {1, ..., T }.Then, Equations 1 -5 can be written compactly using matrix notation as: where To relate our observations for an arbitrary carbon stock s,obs to the latent carbon stock s , we assume the following relationship: In an ecological context, we are assuming that our observed carbon stock is normally distributed and unbiased, with a center at the true (latent) carbon stock, and a fixed precision τ s .Similarly to how adding the error term for the DALECev model was an acknowledgement of imperfect process knowledge, adding an error term for the observations is an acknowledgement of measurement error in the data that we observe.
The state space model has two key assumptions: the state process is first order Markov, and the observation model is independent conditional on the latent states.Using normally distributed error terms for the process model and the observations, we can write these assumptions using the matrix notation above as: with all τ parameters assumed to be known.This assumption is not uncommon in terrestrial carbon models, as the measurement error is generally well understood.Fixing the measurement error can also lead to better estimation of other precisions, process parameters, and states (see Auger-Méthé et al., 2016, for a thorough discussion on this).The combination of a linear process model, normally distributed process error, and normally distributed measurement error means that we are fitting DALECev as a Normal Dynamic Linear Model (NDLM) (West and Harrison, 1997).
The fluxes are modelled with a simple observation model.For a given flux F j , with flux data collected at a subset I j ⊂ {1, ..., T } and observation F j,obs , we assume the relationship With models assigned for our physical process, observations, and fluxes, we can write the full likelihood of the model as: Many prior studies using terrestrial carbon models do not fit process error (see Jiang et al. (2018) for example), instead using observational uncertainty in the estimation of the process model likelihood.The state space approach used here is designed specifically to help isolate the process uncertainty from observational and parameter uncertainty.This partitioning of uncertainty is critical in understanding the system because no one source or type of uncertainty is likely to dominate total model uncertainty across all of ecological applications (Dietze et al., 2018) and they influence forecast uncertainty in different ways, e.g., process uncertainty propagates from one timestep to another while observation uncertainty does not.
The Bayesian state space paradigm outlined here allows for quantification of multiple sources of uncertainty (process, initial condition, and observation) in the context of temporal gaps in observations, and the state space model gives a natural setting to leverage observed data with process based models.

Inference for parameters and latent states
We estimate the stocks, process precisions, and process parameters for the DALECev model using a Bayesian state space model (Hamilton, 1994;Durbin and Koopman, 2012;Petris et al., 2009;Auger-Méthé et al., 2020).Process parameters, process precisions, and latent states were estimated using MCMC (Robert and Casella, 2005).MCMC is a flexible method that uses Markov chains to generate samples of the parameters from their posterior distribution.Parameter uncertainty is usually inherently included in MCMC methods, and the samples from the posterior can be used to generate credible intervals for where the true values of the parameters lie.Equipped with the model likelihood from 21, we need to choose prior distributions for the process parameters, process precisions, and initial conditions.Prior choices for the process parameters are assumed to be uniform with limits informed by the range of values gathered from expert opinion in the Reflex project supplemental material (Fox et al., 2009) to reflect the change in sites to the Talladega National Forest (see description below).The values for p (L) and p (U ) can be found in Table 1.Each process precision was given a univariate conjugate Jeffreys prior (Jeffreys, 1946), to allow for closed form Gibbs sampling of the process precision parameters.Thus the priors are given by Building on the likelihood and priors, we can derive the full conditional distributions for all latent stocks and and precision parameters.The full condition distributions for latent carbon stocks at interior (between the initial and final) time steps with observed data can be written as: The full condition distributions for latent carbon stocks at interior time steps without observed data at those time points can be written as: The full conditional distributions for the initial latent state and final latent states are: where 1 T ∈I is an indicator function that is equal to 1 if there is an observation for C k at the final time point, and 0 otherwise.Finally, the full conditional distributions for the precisions are: where Γ is the univariate gamma distribution using the rate parameterization.
We estimated the posterior distributions of the parameters using MCMC in the R programming language (R Core Team, 2016), specifically the Random Walk Metropolis Hastings (RWMH) (Metropolis et al. (1953); Hastings (1970)) with a block updating algorithm.We sampled process parameters using a univariate proposal distribution during the initial burnin phase (2,000 iterations).After burn in, we sampled parameters for 500 iterations where we jointly sampled highly correlated process parameters using a truncated normal proposal that accounts for their covariance.We recalculated the empirical covariances used in the block updates every 500 iterations.We updated the analytic full conditional distributions given in Equations 25 -29 for the latent states, process precisions, and initial conditions using a Gibbs sampler (Geman and Geman, 1984).Process parameters were updated using the RWMH.Including burn-in, 10,000 total posterior samples were collected.
Careful attention was paid to the initialization of the latent states for the MCMC.We initialized the unobserved latent states in one of two ways.For small data gaps a Gaussian process (Kennedy and O'Hagan, 2001) was fit using the laGP package (Gramacy, 2016) in R (R Core Team, 2016).For large data gaps a particle filter was used (as implemented in Gordon et al. (1993)).The Gaussian process has the advantage of not requiring initial estimates for the process parameters and process precisions at the cost of not incorporating knowledge of the ecosystem model and certain properties of the carbon stocks, such as positivity.However, with low observation data frequency, the particle filters use of the process model helps to provide better initial estimates of latent states, speeding up convergence.To initialize the particle filter, maximum likelihood estimates for the process parameters are computed from the flux data and stock observations.In the event of minimal flux data, a space filling design, in this case a Latin Hypercube sample (LHS) (McKay et al., 1979), of the process parameters is used to create sets of process parameters.These sets of process parameters are then run through a bootstrap particle filter using the initial condition distributions, with the process precision set to a fixed percentage of the mean of the initial condition distribution.
The combination of process parameters and latent states with the highest likelihood from the corresponding LHS and bootstrap filtering are then used for their respective initializations.

Simulation study design
We designed a simulation study to evaluate the ability of standard MCMC methods to estimate process precisions, latent states, and process parameters for the DALECev model, and to identify and address potential problems that may arise when using these methods for the types of (sometimes sparse) data that are available.More specifically, we had three primary objectives.The first was to look at when we can expect precision estimates to break down, and whether changing the timestep of the latent states can help to alleviate issues with estimation.The second was to examine how well the model recovers the latent carbon stocks with only annual observed data, which is the type of data that we can expect to collect to parameterize models like DALECev.Third we wanted to assess parameter identifiability (via data cloning) when fitting the models to different data that are available and use this information to help inform data collection schemes.
We began by generating a set of synthetic datasets for use in our analysis.Our simulation study was created to emulate conditions at the Talladega National Forest in Alabama, U.S.A (32.95046 • N, -87.39327 • W).We chose this site for two reasons.First, the site has a canopy dominated by evergreen tree species (longleaf pine (Pinus palustris), loblolly pine (Pinus taeda), and slash pine (Pinus elliotti )) that matches the canopy type expected by the DALECev model.Second, the site is part of the National Ecological Observatory Network (NEON) and thus has ongoing data collection that can be used in future applications of the methods described here.For the synthetic data set, initial conditions and driver data for reflective of what would be expected at Talladega.Exact values for the process parameters used in the simulations can be found in Table 2. Random initial conditions were drawn from their respective prior distributions.At each time step process noise is added to the states, with observational noise added to the latent states at the end of the model run to create a dataset of observations.Data gaps for synthetic datasets were created by removing observations that are not in the analysis time step.

Effects on estimates of precision and the latent states of varying the state process time resolution
The ability to estimate process precision is crucial in ecosystem models, as that is often the main source of uncertainty (Dietze et al., 2018;Thomas et al., 2017).Poor estimation of process precision may lead to more uncertainty in estimates of process parameters and of latent states, which can then affect forecasts and make them unreliable.For models like DALECev, gaps between observations of the states are commonly greater than 1 year, a much slower time scale than the assumed process dynamics, resulting in many unobserved states.In order to reliably apply DALECev in practice, process precision and latent state estimation should be robust to annual or longer data gaps.
To determine when the estimation of process precision becomes difficult, we examined three different observation scenarios: daily state observations, monthly state observations, and yearly state observations, each with daily flux observations.Initial conditions were drawn from the prior distributions, and driver data from the site was used to run each model for two years.We repeated the generation of the synthetic dataset 15 times.For each dataset, observations were removed to introduce synthetic data gaps that matched the different observation scenarios.
We evaluate the performance of precision estimation using the effective sample size (ESS) (Robert and Casella, 2009) of the Markov chains for the precision parameters as a metric.
Samples from Markov chains are auto-correlated, meaning that they are correlated with previous samples in the chain.The ESS gives us a metric for determining how efficiently our chain has estimated parameters/states, with low ESS informing us that our samples are highly correlated and to proceed cautiously with using the posterior samples for prediction or interpretation.
To evaluate ESS under different gaps in data, we used MCMC (as described above) to estimate posterior parameter distributions for each dataset and analyzed the process precision MCMC chains using ESS as a measure of sampling efficiency.We identified the data gaps where a large degradation in sampling efficiency occurred.
It also can be difficult to obtain information about latent states when there are large gaps between observed data points.We explored whether changing the latent state timestep is a solution for alleviating problems with low effective sample sizes without introducing estimation problems for process parameters.These problems may arise from differences in the flux data and observation data likelihoods having different time steps.To analyze these differences, the model was run with a daily latent state resolution and was compared to the model being run with a monthly latent state resolution.
Consider an NDLM on a daily process model for an arbitrary carbon stock C, with a process model of the form: Let T * = {t i , i = 1, .., I} be a proper subset of the timesteps of the model.For an NDLM with a process model of the form in Equation ( 30), state transitions can be rewritten in the form This allows the stocks of the model to operate on a different time step than the fluxes, so that daily flux information can be used without requiring estimation of a large number of latent states that have very little data to constrain them.It also gives the model more flexibility, allowing models to change time steps for inference purposes, to decrease computational costs, and to allow for varying time steps across the stocks themselves.For values for A that are constant or similar through time and approximately uniformly spaced entries of T * , we may treat the precision in Equation ( 31) as a fixed value φ * .Using this approach, we follow the advice given in Auger-Méthé et al. ( 2020), which says "make simplifying assumptions when data are limited".
To examine how sampling efficiency is influenced by changing the time step at which latent states are estimated, we used two different synthetic datasets.The first synthetic dataset is a set of of 30 synthetic time-series that have monthly carbon stock observations and daily flux observations.The second synthetic dataset is a set of 30 synthetic time-series that have annual carbon stock observations and daily flux observations.Each synthetic dataset was analyzed using both a daily time step and a monthly time step in the state process model.We compared the effective sample sizes of process precisions between the different observation regimes to assess whether changing the time step of latent state estimation increases sampling efficiency of process precision parameters.

Impact of observation gaps on estimation of latent states
While efficient estimates of process precision are important, the primary goal of an ecological state space model is to accurately track the evolution of the latent states through time.This involves both latent state estimation and estimation of the underlying process parameters.
A particular focus was placed on annual carbon stock availability, as it is the most realistic case when working with actual data.
Latent state estimation was assessed by creating synthetic datasets with daily, monthly, and annual carbon stock data availability.Posterior estimates of the la.The latent state estimates were compared to their true (synthetic) values.Synthetic datasets were analyzed using both a daily model time step and the monthly model time step for the latent states.
Coverage rates of latent state posterior distributions were assessed to confirm that they reach roughly the nominal rate.This was achieved by checking if the true latent state values were contained in the 95% highest posterior density intervals, and taking the average for each stock over all time steps.
We also assessed convergence of process parameters.MCMC inference was conducted on identical datasets with four different starting conditions for the process parameters.We then used the Gelman Rubin test statistic (Gelman and Rubin, 1992) across the MCMC chains to assess convergence of process parameters.

Parameter identifiability under differing data availability
Identifiability of parameters was assessed using data cloning to analyze three synthetic datasets with annual carbon stock observations.Each dataset had different levels of flux data observations: 1) all fluxes observed; 2) fluxes available from NEON with GPP data; 3) only fluxes available from NEON with NEE data.These were chosen so that we could compare the ideal case to data that would be commonly available for terrestrial carbon models.Our MCMC inference proceedure was performed on each of the synthetic datasets with r = 1, 5, 25 data cloning replicates, and posterior distributions of p 2 , p 3 , p 4 , p 11 were analyzed across datasets and data cloning replicates.
Revisiting Equations ( 9) and ( 10) from Section I, we see that A r and A w give additional information for p 2 , p 3 , and p 4 , a set of parameters that are highly correlated due to their entanglement in the carbon update equations.The absence of one or more of these fluxes, like in the case of using NEON data only, may make these parameters difficult to identify.For scientists, a data cloning analysis can serve more purposes than just assessing identifiability of parameters.Simulated data can be used a priori to determine what data is most important to collect in their experiment.

Estimating process precision under data gaps
As would be expected, we found that increasing the time between observations leads to a noticeable decrease in MCMC sampling efficiency (Figure 2).The criteria we used to assess whether or not the posterior samples size was large enough for for inference was an ESS of 200 or higher.By this metric, both monthly and annual observation data give ESS's that are too low to make credible posterior inference.In particular, yearly observed data produces precision estimates with very low effective sample sizes.This is particularly concerning, as annual observations are the most realistic of the three data availability scenarios we explored.This finding motivates the need to alter the sampling methods in order to increase the sampling efficiency of these precision parameters.

Comparison of Precision Estimation by Observation Frequency
Observation Frequency Log Effective Sample Size

Effects on estimates of precision and the latent states of varying the state process time resolution
In Figure 3 we compare the effective sample size of several simulations with different timesteps.
We found that changing the model timestep from daily to monthly improves the estimation of process precision for both monthly and annual data gaps.This showcases the increased performance for the change of the latent state timestep as observation sparsity increases.
For both monthly and annual data observations, the monthly timestep model performs significantly better.As expected, effective sample sizes for precisions decrease across timestep type as observation sparsity increases.An important goal of analyzing ecosystem process models from a state space framework is to track and predict the evolution of latent states through time.In Figure 4 we show posterior latent state estimates for carbon stock data observed annually, with all flux data observed daily and the model running on a monthly time step.Nearly all of the posterior distributions of the process precisions in Figure 4 show right skewness.This may come from a trade-off between model precisions, where one precision being over-estimated leads to another precision being under-estimated.Another possibility is that our independence assumption between carbon stocks is being violated.

Monthly
Some carbon stock updates, including C lit and C som , involve other carbon stocks.This may be causing an underestimation of precision, which in turn affects the estimation of other process precision parameters.

Parameter identifiability under differing data availability
Our analysis of the data cloning results involves the following set of considerations.First, for identifiable parameters, we expect that as r increases, the variance of the resulting estimate decreases.This can be seen when the resulting posteriors grow tighter around the mean as values of r get larger.Second, identifiable but non-estimable parameters (INE) are parameters that may be identifiable, but do not have a necessary amount of data to tease out the precise values.These are characterized by relatively flat posterior distributions (Ponciano et al., 2009).Finally, parameters that are non identifiable (NI) tend to have multi-modal posterior distributions, indicating that there are several values of the parameter that produce a high value of the likelihood.Functions of multiple non identifiable parameters can be estimable, but the individual parameters themselves are not.See (Ponciano et al., 2012) for a simple example.
We found that data cloning served as an effective way to assess identifiability of parameters.The simulation study showed that we expect to be able to identify p 3 and p 4 using data that is available to ecosystem modelers via NEON.More specifically, in Figure 5 (top row) we show that the posterior narrows as r increases for p 3 -that is, it is identifiable with the observed flux data.We can also see that without additional flux data to help constrain p 4 , the parameter was identifiable but non-estimable (INE) (Figure 5, middle row).Although the uncertainty for both p 3 and p 4 is considerably lower in the NEON GP P case (Figure 5, right two columns), the parameters are not centered near the true values.The NEON GP P case does show that p 11 can be estimated from data, but it is not estimable with only the NEON N EE data.As r, the number of times the data is cloned, was increased to 25 replicates, the posterior estimates for p 4 started to become more similar.However, it was not centered around the true values used for the simulation, which may be due to a lack of flux data available to constrain them.p 4 is likely to be INE -it is identifiable, but is non estimable given the collected data.p 11 continues to be tightly centered around the true value for the NEON GP P case when as r increased.
The results of our data cloning analysis demonstrate that NEON flux data will require additional flux observations in order to estimate both the p 3 and p 4 parameters.Posterior estimates of these parameters from the initial dataset are quite different, with the NEON GP P case having comparatively tighter estimates.A potential cause for these differences may be due to NEON GP P 's ability to estimate p 11 well from the start and not getting stuck near a local minimum.As the number of data cloning replicates increases, the posteriors begin to look very similar.Under the NEON GP P case, p 11 is identifiable, while with flux data from NEON N EE, p 11 appears to be identifiable but non estimable, with the posterior remaining relatively uniform even with r = 25 replicates.Our findings illustrate one of the shortfalls of using data cloning: though we are able to determine whether parameters are estimable or identifiable, we cannot be sure that our chains have converged to the true value.

Discussion
Estimating the posterior distributions of parameters in multi-state state space models can be challenging when observations of the states are not readily available.This is especially apparent in ecological models where states are often sampled at more coarse intervals than the model time-step, resulting in many latent states without direct constraint from observations.Here we introduce a method for changing the timestep used for generating latent states in processes model so that it is coarser than the operational timestep of the process model (i.e., timestep of the difference equations).Our analysis revealed a large increase in the efficiency of estimation of the posterior distributions of the process precision parameters when using a coarser time-step of latent state estimation, while producing latent state estimates that achieved the nominal coverage rate.One strength of the approach we present is that it preserves the operational time of the process model used to simulate the ecosystem dynamics.As a result, no adjustments were required to the process model.
Another strength is that the equations used to change the timestep of the latent states do not require the new timesteps to be equally spaced, giving the flexibility to allow the latent states of the model to operate on any time scale.However, changes to the latent state timestep do influence the interpretation of the process uncertainty parameters because they represent the distribution of process error that propagates from one latent state to the nextlonger time-intervals between latent states will likely lead to process error distributions with larger variance.
Beyond data gaps in time, gaps in data where particular states and/or fluxes are never observed presents challenges in the ability to estimate the posterior distributions of parameters (identifiablility).To examine identifiablility in the focal ecosystem model (DALECev), we confirmed that we were able to successfully recover process parameters and process precisions when all states and fluxes were observed at all time steps and then in the case where there were annual temporal gaps in the observations of states.This indicates that under ideal data collection the parameters are identifiable using the approach presented here.However, a lack of identifiability occurred when a subset of the flux data was not available to constrain model parameters, as is the case in applications using real observations.In this case, our approach had difficulty recovering multiple process model parameters in the DALECev model.
In particular, p 2 , p 3 , and p 4 were difficult to estimate without all of the related fluxes used to constrain them.These parameters govern the proportional allocation of photosynthesis (GPP) to respiration, foliage, and roots (Equations 1 -3), thus requiring observations of their individual production in order to constrain the individual parameters.
Our inference about identifiability of process parameters was based on the application of data cloning.Other methods of assessing identifiability were considered, including Hessian methods (Viallefont et al., 1998;Little et al., 2009) and symbolic algebra methods (Cole, 2019;Cole and McCrea, 2016).Hessian methods and symbolic algebra methods were deemed unfit due to the time series of the model being long, which would lead to problems with numerical stability using the Hessian method (Bulla and Berzel, 2008) and lack of computational resources to perform the symbolic algebra calculations in MAPLE (Cole, 2019).
Identifiability is a problem that has long plagued ecological and biological modeling (Luo et al., 2009), and data cloning is a simple method that can be used with simulated data prior to the design of an experiment to assist the design data collection schemes that mitigate identifiability challenges.Our simulation study used data cloning with observed flux data that would be available from NEON, and showed that additional flux data is required to constrain a subset of model parameters.It also illustrated that while some parameters are shown to be identifiable through data cloning, they are not necessarily centered at the true parameter values.The types of data measured at a NEON site is not atypical for a terrestrial ecosystem study, particularly those in the Ameriflux and Fluxnet networks.With simulation results showing that some parameters are not identifiable or identifiable but nonestimable, it is crucial that scientists have access to methods to help them assess whether they can trust the results they obtain from their modeling framework.
Data cloning has other uses aside from assessing identifiability and aiding in experimental data collection for ecosystem modelers.A well documented problem with soliciting prior distributions for parameters in Bayesian analyses is that non-informative prior distributions on one scale may become highly informative prior distributions when transformed (see Lele (2020) and references therein for a thorough treatment).These falsely non-informative priors can lead to biases in parameter estimation and prediction, in turn leading to incorrect decisions made by stakeholders and policy makers (Lele, 2020).Data cloning methods can be used to expose accidental biases introduced through using these priors that are non-informative on one scale, as the data cloning posterior will approach the maximum likelihood solution as r increases, and maximum likelihood estimation is invariant to re-parameterizations.
Our study focused on the development and evaluation of methods, and sets the foundation for future work.First, while simulation studies with data synthesized from the DALEC model is necessary to test our methods, it is important to test the performance of these methods using real observations.Second, the results for latent state updates discussed here have been for the univariate case where covariance between states is not considered, though the states can be updated en bloc with multivariate normal Gibbs updates.Multivariate latent state updates could be complemented with conjugate Gibbs updates for the covariance matrix, allowing full estimation of the covariance structure and (potentially) better latent state estimates.The methods presented here could be extended to non-linear Markovian process models by linearizing the update equations, though we have not presented any examples here.Third, the Gibbs updates shown in this paper are only applicable to state space models where both the observation and process model errors are normally distributed.While this is a common assumption in terrestrial carbon models, other applications may have error structures that do not meet this assumption.For example, error structures may be needed to maintain ecological realism, such as positivity of a particular latent state.More complex error structures or model dynamics require alternative fitting methods, such as particle methods (see Kantas et al., 2014, for a thorough review) or Gaussian Process Regression (Turner and Deisenroth, 2010).Finally, we have shown that it may not be possible to identify or recover process parameters for the DALECev model under yearly data gaps when using only data available from NEON.However, it is likely that assimilating additional data not observed by NEON, such as satellite-derived leaf area index (e.g., MODIS LAI), and incorporating stronger priors that reflect general ecological principles will help to constrain model parameters further (Bloom and Williams, 2015).
In conclusion, to address the growing popularity of state space modeling in the ecological forecasting research, we propose methods that help to assess and fix problems with process precision estimation and identifiability of process parameters that frequently arise in ecosystem state space modeling when observations are scarce.The state space framework augmented with data cloning to assess identifiability of parameters presented here is flexible enough to be adapted and applied to a broad range of problems including non normal-normal error structures, non-linear process models, and spatio-temporal models.The methods discussed here will allow practitioners to more effectively and efficiently address and overcome common suites of problems that arise when using state space models.

Figure 1 :
Figure 1: A schematic of the DALECev model with boxes denoting stocks of carbons and arrows denoting fluxes of carbon.
where δ j is a known precision.This specification for the fluxes assumes that flux observations are unbiased but contain measurement error.Each individual flux has different observation time points I j to account for the fact fluxes are measured using different methods, and the methods may work on different timescales, as well as giving flexibility in the case of data collection failure.

Figure 2 :
Figure 2: A plot of log average ESS of estimated precision parameters versus observation frequency.In each case we used the daily time step model and averaged over fifteen MCMC runs of 10,000 iterations of each model configuration.The dotted line indicates the cut-off of an ESS of 200 (5.30).

Figure 3 :
Figure3: A plot of observation frequency vs. log average effective sample sizes for the estimated precision parameters (inverse variance) for different observation collection regimes using the daily step and the monthly step models.In each case we averaged over fifteen MCMC runs of each model configuration.The dotted line indicates the cut-off of anESS  of 200 (5.30).Precision log ESS is pooled over all five precisions corresponding to their respective carbon stocks.

Figure 4 :
Figure 4: (a-e) Estimates for latent states for a monthly timestep with annual data observations (Left) along with histograms of corresponding marginal estimated process precision post burn-in (Right).Each panel corresponds to a particular carbon stock: C f represents foliage carbon, C w represents wood carbon, C r represents root carbon, C lit represents litter carbon, and C som represents soil organic matter carbon.

Figure 5 :
Figure 5: Results of data cloning for selected process parameters under the two NEON data flux scenarios, with 1 data cloning replicate and 25 data cloning replicates.Marks on the x axis denote the upper and lower bounds of the uniform priors given to the process parameters.

Table 2 :
Parameter values used to generate synthetic data sets of carbon stocks and fluxes using the DALECev model.Parameter values were chosen such that carbon stock data, leaf area index (LAI), and NEE were reflective of what would be expected at Talladega.
(Fox et al., 2009)ions are taken from the REFLEX project supplemental material(Fox et al., 2009)the carbon stocks were derived from NEON data National Ecological Observatory Network (2020), with specified initial mean and initial uncertainty.Process parameter values for simulations were chosen such that carbon stock data, leaf area index (LAI), and NEE were

Table 3 :
).Each panel corresponds to a particular carbon stock: C f represents foliage carbon, C w represents wood carbon, C r represents root carbon, C lit represents litter carbon, and C som represents soil organic matter carbon.Average empirical coverage and percent of precisions contained in the 95% HPD credible interval, averaged over 30 model runsIn Table3we show the average performance of the monthly time step model over 30 different model runs.Coverage rate of posterior credible intervals for the latent states are used to quantify how well latent states are being tracked.For each latent variable, the average probability that the true process precision lies within the 95% credible interval is also given.