Likelihood-free inference in state-space models with unknown dynamics

Likelihood-free inference (LFI) has been successfully applied to state-space models, where the likelihood of observations is not available but synthetic observations generated by a black-box simulator can be used for inference instead. However, much of the research up to now has been restricted to cases in which a model of state transition dynamics can be formulated in advance and the simulation budget is unrestricted. These methods fail to address the problem of state inference when simulations are computationally expensive and the Markovian state transition dynamics are undefined. The approach proposed in this manuscript enables LFI of states with a limited number of simulations by estimating the transition dynamics and using state predictions as proposals for simulations. In the experiments with non-stationary user models, the proposed method demonstrates significant improvement in accuracy for both state inference and prediction, where a multi-output Gaussian process is used for LFI of states and a Bayesian neural network as a surrogate model of transition dynamics.


Introduction
Likelihood-free inference (LFI) methods [1][2][3] estimate parameters θ of a statistical model, given an observed measurement x * and a black-box simulator g θ .These methods use synthetic observations x θ ∼ g θ (x | θ) produced by the simulator to assist the inference without requiring an analytical formulation of the likelihood p(x | θ).LFI has been successfully applied to identifying parameters of complex real-world systems, such as financial markets [4][5][6], species populations [7][8][9] and cosmology models [10][11][12].A special type of applications of LFI are time-dependent systems, which can be described using state-space models (SSMs) [13,14] where observed measurements x t ∈ R n are emitted given a series of latent variables, the states θ t ∈ R m , as illustrated in Figure 1.Typically, state-space inference methods [13,15,16] require an observation model g θ in the form of the likelihood p(x t | θ t ) to find the posterior distribution p(θ 1:T | x 1:T ).When the observation model is unavailable, state-space learning methods [17,18] are commonly used to infer g θ from the observed time-series data.However, when g θ is inferred, the states become very difficult to interpret for domain experts, since the states are no longer informed by a known model.An alternative solution to this problem is to use a simulator in place of g θ .LFI methods are able to infer the states and avoid learning g θ by using a simulator as the observation model.Simulators are widespread in SSM settings [19][20][21] since they enable incorporation of additional prior knowledge about data-generating mechanisms without the need for a tractable likelihood p(x t | θ t ).In this paper, we focus on LFI for SSMs, which fall under the category of approximate methods in the broader context of SSM inference.
An essential aspect of SSMs that is often overlooked in the LFI literature is the complexity of transition dynamics h θt .Current LFI methods for SSMs [22,23] proceed by assuming dynamics to be either too simplistic (e.g.linear) or readily available for sampling, which cannot be applied to complex phenomena in meteorology [24,25], cosmology [26,27] or behavioral sciences [21,28].The underlying dynamics in these domains are often too complex and may greatly vary from one case to another (e.g.behaviour of different people).Unless the true data-generating process of transition dynamics are both linear and Gaussian, these LFI approaches would lead to poor state estimates and predictions.Even though there have been advances in LFI, for instance, in developing more efficient sampling-based methods [29], proposing generation mechanisms of better-matching statistics [30], and establishing theoretical convergence guarantees [23,31,32], they do not address this fundamental limitation.
In this paper, we introduce a method capable of likelihood-free state inference and state prediction in discrete-time SSMs.Our method operates in a LFI setting, where a time-series of observations x t and a simulator g θ capable of replicating these observations are provided.The goal of the method is to infer the states θ 1:T = {θ 1 , ..., θ T } which can produce the observed time-series x 1:T = {x 1 , ..., x T }, using as few simulations as possible to reduce their potentially high computational cost.This setting is broader than is typically assumed by traditional LFI methods, since we do not assume the transition dynamics h θt to be known (neither in its closed-form, nor its function family) or available for sampling, and also because the number of simulations can be limited to be small.Instead of assuming the transition dynamics, we learn a non-parametric model and use it as their surrogate (or replacement) in state approximation and prediction.
This paper contains three main contributions.First, we propose a solution to the previously unaddressed problem of state prediction in SSMs with unknown transition dynamics and limited simulation budget.We use samples from LFI approximations of state posteriors p(θ t | x t ) to accurately model the state transition dynamics, with accuracy shown by empirical comparisons with state-of-the-art SSM inference techniques.Second, focusing on problems where LFI has to be sample-efficient, i.e. the number of simulations needs to be reduced as much as possible, we improve upon the current LFI methods for the state inference task by leveraging time-series information.This is done by using a multi-objective surrogate for the consecutive states (e.g. for time-steps j and j + 1) and sampling from a transition dynamics model to determine where to next run simulations.Lastly, we demonstrate that the proposed method is needed to tackle the crucial case of user modelling, where user models are non-stationary because user's beliefs, preferences and abilities change over time.
x θ have a smaller discrepancy than a user-defined threshold , then they were produced by simulator parameters θ that could plausibly replicate the observed measurement x * .This assumption, which is common for ABC approaches, results in the following approximations of the likelihood function L(•) and the posterior p(θ|x * ): Here κ (•) is the kernel with the maximum at zero, and whose bandwidth acts as an acceptance/rejection threshold.For instance, in ABC with rejection sampling, κ (δ(θ)) = ξ [0, )](δ(θ)) , where ξ [0, )](δ(θ)) equals one if δ(θ) ∈ [0, ) and zero otherwise.Unfortunately, ABC approaches need to simulate a lot of synthetic observations to get an accurate approximation of the posterior and therefore, unsuitable for inference with computationally heavy simulators.
Since many applications (including those considered in this paper) need to minimize the number of simulations, other methodologies, such as Bayesian optimization for LFI (BOLFI) [37] have emerged.In BOLFI, a Gaussian process (GP) surrogate is used for a discrepancy measure δ(θ), where the minimum of the GP surrogate mean function µ(θ) can be used as and a Gaussian CDF F (( − µ(θ))/ ν(θ) + σ 2 ) with mean 0 and variance 1 as Here, ν(θ) + σ 2 is the posterior variance of the GP surrogate.
A main advantage of modelling the discrepancy with a GP is the uncertainty estimation.The GP's predictive mean µ(θ (i) ) and variance ν(θ (i) ) are used to calculate the utility (e.g., Expected Improvement, Brochu et al. [38]) of sampling the objective function at the next candidate point θ (i+1) , where i denotes the number of a simulation.By maximizing this so-called acquisition function A(•) with respect to θ, one chooses where to run simulations next.
Because BOLFI actively chooses where to run simulations, its posterior approximation requires much fewer synthetic observations than other LFI methods that do not use active learning.However, BOLFI was not designed for SSMs, and hence does not make use of any temporal information as would be typical for SSMs to improve the quality of inference.
An alternative approach to sample-efficient LFI is global sequential neural estimation (SNE), which proceeds by learning the statistical relationship between observations and simulator parameters directly, through a neural network surrogate.This surrogate does not require retraining when the observation changes, if trained with a large enough sample set, which makes these SNE-methods especially suitable for a sequence of related inference tasks, such as required in time-series prediction.The SNE neural network can be used as a surrogate for the posterior, likelihood or likelihood ratio, resulting in SNPE [39][40][41], SNLE [42], and SNRE [43,44] methods respectively.These SNE methods address a more difficult problem than we do, of learning a model across all possible tasks (i.e.observed datasets).The price is that they require significantly more simulations than Bayesian optimization (BO) approaches, see Section 4.3 in [45].
Similarly to SNE approaches, Fengler et al. [46] introduce likelihood approximation networks (LANs), which approximate the likelihood for time-dependent generative models and apply them in dynamical systems in cognitive neuroscience.The key distinction of their approach is the assumption that the time-component is one of the inputs of the observation model, which allows them to learn the observation model at an arbitrary time-step.This assumption shifts the role of the dynamics onto the observation model, which is often beneficial for diffusion models [47,48], but not for the models of human behaviour [49][50][51].In contrast, our approach does not rely on the explicit dependency of the observation model on time, which enables state predictions in cases when the transition dynamics are unknown at the cost of amortization.
The issue of handling non-linear transition dynamics, in general, has been primarily addressed outside of the LFI literature.This large and growing set of methods includes extended Kalman filters [15,16], GP-SSMs [17,18], sequential Monte Carlo [52][53][54] and Bayes filtering [55,56].Although they are not directly applicable to the LFI setting considered in this paper, we summarised them in Table 1 along with the relevant LFI literature to highlight important connections.
3 Likelihood-free inference in state-space models In this section, we introduce a multi-objective approach to LFI in SSMs, which improves sample-efficiency of existing methods by using the model for discrepancy shared across consecutive states while also learning the model of the transition dynamics.The main elements of the solution are presented in Figure 2. To estimate state points θ t , given x t , we employ a multi-objective surrogate δ θ for discrepancies, and then approximate the posterior over states p(θ t | x t ) with (1).At the same time, we randomly pair consecutive posterior samples (θ j , θ j+1 ) and train a non-parametric surrogate for the state transition h θt , whose predictive posterior p(θ t+1 | x t ) proposes candidates for future simulations.We summarize our approach in Algorithm 1, where θ * denotes simulator parameter points shared across all time-steps.
Figure 2: An overview of our approach, in which the δ θ surrogate is used for LFI of states and h θt for the unknown transition dynamics.The δ θ models the corresponding discrepancies δ t ≡ δ t (θ * ) of several observations (green) inside a moving window (here, with the size of two), from which posteriors are extracted according to Equation (1) in P. h θt is trained with paired samples D from posteriors of consecutive states (grey); its predictive samples are used as proposals (orange) for simulations S. , and can be either amortized (do not need to be retrained when the observations change) or non-amortized (valid only for the observed data).These methods also vary in the number of simulations required for inference and the type of dynamics these methods assume.We report the single observation budget (per a time-step) for non-amortized methods and the total budget for amortized methods.

Multi-objective state inference
As an extension to BOLFI, we employ a multi-objective surrogate model for the discrepancies δ t (θ * ) = ρ(x t , x θ ) at different t, thus considering multiple discrepancy objectives simultaneously and leveraging information between consecutive states.More specifically, we pass discrepancies of the consecutive states to the surrogate separately (e.g. ), but through the use of shared parameters of the multi-objective surrogate they become associated.This approach allows using a discrepancy model of the previous state to infer the current state, instead of simply discarding it.Moreover, it allows having a much more flexible surrogate for LFI of states than the traditional GP used in BOLFI.These changes do not need any additional data to fit the surrogates, because all synthetic observations x θ for discrepancy objectives can be shared across all states (therefore, we use θ * instead of θ t in the context of simulations).When we consider a new observation x t+1 , we simply need to recalculate the discrepancy values for all synthetic observations.Once we have a trained surrogate for discrepancy objectives, we infer state posteriors p(θ t |x t ), similarly as in BOLFI.
There is an additional challenge for adapting multi-objective surrogates in SSMs, namely high computational cost associated with considering too many objectives.The time-series can potentially have hundreds of time-points, and expanding the number of considered objectives may be detrimental for the performance of the surrogate.We avoid this problem by limiting the number of objectives the surrogate can have.Instead of considering all available time-steps as objectives, we propose to consider only L recent objectives by gradually including new objectives and discarding old ones that have little impact on current states.The size of this moving window depends on how rapidly the transition dynamics change.As the size of the window L grows, the model becomes less sensitive to the noise from the dynamics, at the cost of increased computations and decreased adaptability to the most recent state transitions.Overall, the moving window reduces the number of objectives L considered at a time, making multi-objective modelling in the SSM setting feasible.In Appendix A, we further investigate the influence of the moving window size hyperparameter on state inference and prediction, and show that having only two objectives (L = 2) is the most beneficial choice in terms of the quality of posterior approximations and low computational time.

Learning state transition dynamics
While we gradually improve LFI posterior approximations p(θ t |x t ) by acquiring new simulations, we use empirical samples from the latest available approximations to learn a stochastic model of transition dynamics.This model should be able to learn from noisy samples of LFI posterior approximations p(θ t |x t ), and be flexible enough to fit arbitrary function families the dynamics may follow.In addition, it should be able to handle uncertainty associated with samples outside the training distribution, as samples from posterior approximations tend to be concentrated around the main mode of the learning data.For these reasons, the appropriate transition model should be Bayesian and non-parametric (or semi-parametric).Such a model would take the uncertainty associated with posterior approximations into account and be flexible enough to follow possibly non-linear transition dynamics.
We propose to train this model in an autoregressive fashion, by forming a training set of K randomly paired sample points from posteriors (e.g.p(θ t−1 | x t−1 ), p(θ t | x t )).More specifically, we assume the Markov property in the transition dynamics and use pairs of states instead of their whole trajectories.For each SSM time interval, we group consecutive state posterior samples in a training set, and expand it when new state posteriors become available (as we move forward in time).Thus, the transition model does not need to be retrained when new observations present themselves and can be actively used throughout state inference for determining where to run simulations next.This can be done by sampling the predictive posterior p(θ T +1 |x T ) from the trained model h θ T : This way, the state transition model h θt does not inform state posteriors directly, but only provides simulation candidates for the LFI surrogate.Ultimately, accumulating more simulations improves the discrepancy surrogate for the LFI of states and, by extension, quality of posterior samples, while higher-quality posterior samples allow more accurate learning of state transition dynamics.

Computational complexity and model choices
The proposed multi-objective approach to LFI requires some choice of the surrogate models.Here, we present the model choices for our approach, followed by a resulting complexity analysis of the Algorithm 1.
Following the requirements for the surrogates from Sections 3.1 and 3.2, we chose a linear model of coregionalization (LMC) [61] as a multi-objective surrogate for discrepancies and a Bayesian Neural Network (BNN) [62,63] as a surrogate for state transition dynamics.LMC is one of the simplest multi-objective models; it expresses each of its L outputs f l as a linear combination f l (θ * ) = Q q=1 a l,q u q , as shown in Figure 3, where the u q ∼ GP (0, ν(θ * )) are latent GPs and the a l,q are linear coefficients that need to be solved.As for BNN, it can be represented as an ensemble of neural networks, where each has its weights ω (h) drawn from a shared, learnt probability distribution [64] with , where µ (h) and χ (h) are the hyperparameters that need to be learned.Previously, neural networks have been successfully applied in SSM settings for either modelling the transition dynamics or the observation model [65,66].
Given the aforementioned model choices, the resulting computational complexity of the Algorithm 1 is dominated by three major stages: training of the multi-objective surrogate δ θ , posterior extraction from discrepancy surrogates (Equation ( 1)) and training of the transition dynamics model h θt .Both LMC and BNN are trained by minimizing the variational evidence lower bound (see more details about in Appendix C).Starting with the cost of training δ θ , it depends on the number of synthetic observations |S| (the cardinality of S), on the size of the moving window L and on the user-specified number M of inducing points [67] for the LMC, resulting in the complexity O(|S|LM 2 ), as opposed to O(|S|M 2 ) by traditional GPs used in BOLFI.Moving on to the posterior extraction, this stage consists of finding the appropriate (e.g. by minimizing the GP mean function), and then applying Equation (1), which is bounded by the complexity of calculating the variance of the surrogate for each of the I samples from the posterior over states, resulting in O(LM 2 I).Finally, if we apply variational inference [68] to train the transition dynamics model h θt , the computational cost is linear in the number W of BNN parameters -O(W KES p ), where K is the overall amount of training data for h θt , E is the number of epochs, and S p is the number of parameter samples from the posterior distribution that is required to get the distribution of outputs.Depending on the choice of hyperparameters, the computational complexity of the Algorithm 1 is bounded by either O(|S|LM 2 ), O(LM 2 I) or O(W KES p ).Most of these parameters are common in LFI (e.g.|S|, I), and the rest are specific to surrogate choices, which can be replaced with fewer-parameter alternatives if needed.We provide recommendations for choosing these hyperparameters in Appendix C.

Theoretical properties
Convergence Here, we show that our method learns a reasonable approximation of states and transition dynamics when provided with sufficient data.The state approximations for p(θ t |x t ) are obtained through the unnormalized likelihood function in (1), which was shown in Proposition 1 of Gutmann and Corander [37] to be a non-parametric approximation of the true likelihood: Proposition 1. Maximizing the synthetic likelihood L(θ * ) in (1) corresponds to maximizing a lower bound of a non-parametric approximation of the log likelihood, when the kernel function κ It is straightforward to show that for our LMC model Proposition 1 also holds when its kernel is a Gaussian CDF and the model simulates the empirical discrepancy.Corollary 1. Proposition 1 holds for the LMC model of discrepancy, assuming the Gaussian CDF kernel Proof.According to properties of the Gaussian CDF, the kernel F (•) is convex, which does not change the inequality expression in Proposition 1.This inequality holds since the Jensen's inequality yields a lower bound for both L(θ * ) and its logarithm when the functions are convex.The argument of κ (•) also remains the same, since the LMC models the empirical discrepancy.
As for the approximations of state transitions p(θ t+1 |θ t ), its convergence follows from the universal approximation theorem of neural networks Hornik et al. [69], according to which every non-linear function can be represented by a neural network containing a single hidden layer composed of neurons whose transfer function is bounded.This also applies in our case, since we use the BNN model for transition dynamics.Following the central limit theory, the expectation of our approximation converges to the target distribution provided sufficient number of parameters and data.
Restrictions on the class of systems for modelling Our model choices impose additional limitations on the class of systems that are difficult to model with our framework.The first limitation is high predictive variance when learning systems with long-term dependencies, as we train the BNN using a single trajectory of few observations (50 in our experiments from Section 4).The BNNs are very flexible, however, such lack of data can make them a poor option when it comes to modelling systems with non-linear dynamics and long-term dependencies.Aside from that, BNNs do not pose additional theoretical restrictions on the class of systems for modelling.The second limitation is on the type of the observational distribution, which is modeled in our framework by the LMC.While LMCs are much more flexible than vanilla GPs, they may also struggle with modelling asymmetric, skewed or multimodal noise in the observation model.The reason for that is that these GP-based surrogates assume all noise to be Gaussian, making the state posterior approximations less reliable in cases where this assumption is violated.The third limitation also stems from using GPs in LFI, which suffers from the curse of dimensionality limiting the dimension of observation model to be lower than 10.On other hand, GPs allow our framework to be very sample-efficient, requiring only few synthetic observations to approximate the likelihood.Nevertheless, when the simulation budget for the observation model is not restricted to the order of hundred simulations, we recommend using more complex surrogates, such as SNEs or LANs from Table 1, along with our approach to modelling state transitions.

Experiments
We assess the quality of our method for state inference and prediction tasks in a series of SSM experiments, where a simulator serves as the observation model g θ .In the experiments, our method uses the surrogate choices of LMC and BNN, as was described in Section 3. 3. We demonstrate that it can accurately learn state transition dynamics and improve upon existing LFI methods for the state inference task.Moreover, we investigate sample-efficiency of the proposed method and demonstrate its effectiveness in non-stationary user modelling case studies.We compare our method against traditional SSM methods in cases with available closed-form likelihoods and against LFI methods when only a simulator is available and traditional methods cannot be applied.

Experimental setup
We simulated time series of observations based on single sampled trajectories from ground-truth transition dynamics (available for the evaluation purposes, but unknown to the methods) of five SSMs, described in Section 4.2.Our goal was to estimate the simulator parameters that likely produced these observations, and learn the model of transition dynamics for state prediction based on the sampled trajectory.
For the state inference task, we compare the quality of state estimates by our approach against other LFI methods: BOLFI [37], SNPE [39], SNLE [42], and SNRE [43].We use a fixed simulation budget for all these methods, with 20 simulations to initialize the models, and then with 2 additional simulations for each new time-step.For the SNE approaches (SNPE , SNLE, SNRE), we provided all simulations at once, since that is their intended mode of operation.As for the prediction task, we sampled state trajectories from the transition model and evaluated them against trajectories from the ground-truth dynamics.We performed these experiments in SSMs with simulators that have tractable likelihoods p(x t | θ t ), providing the closed-form of the ground-truth likelihoods to the state-of-the-art SSM inference methods GP-SSM [17,60] and PR-SSM [59], while our method was still doing LFI.For all methods in the prediction task, we provided 50 observations, and then sampled trajectories that had the same length of 50 time-steps.
We also compared two variants of our method that differ only in the way the next simulations are sampled: LMC-BLR, where samples were taken from Bayesian linear regression (BLR) models that linearized the transition dynamics along 50 observed time-steps; and LMC-qEHVI, where a popular acquisition function for multitask BO, q-expected hypervolume improvement (qEHVI) [70], was used to provide samples.The role of these variants was to evaluate how the choice of the future simulations impacts the quality of state inference and prediction.
All models were assessed in terms of the root-mean-squared error (RMSE) between the state estimates and their groundtruths.The experiments were repeated 30 times with different random seeds.Additional details on implementation of the methods can be found in Appendix C; all code for replicating the experiments is included in the Supplement.

The state-space models
In this section, we present two case studies with non-stationary user models and three SSMs with tractable likelihoods.In user modelling experiments, we simulated behavioural data from humans who completed a certain task in two different experiments, described in Sections 4.2.1 and 4.2.2.For the first task, the user evaluated dataset embeddings for a classification problem, and the evaluation score was used as behavioural data.During the second task, the user searched for a target on a display and the search time was measured.Our task in the experiments was to track the changing parameters of user models and learn their dynamics.
In addition to non-stationary user models, we also experimented with three models with tractable likelihoods, common in SSM literature: linear Gaussian (LG), non-linear non-Gaussian (NN) and stochastic volatility (SV) models.In the LG model, the state transition dynamics and observation model are both linear, with high observational white noise.The NN model is a popular non-linear SSM [71], where each observation has two unique solutions.Lastly, we used the SV model [72], which is used for predicting volatility of asset prices in stock markets [73,74].We provide full descriptions of these SSMs in Appendix B.

UMAP parameterization
With the first non-stationary user model, we infer uniform manifold approximation and projection (UMAP) [75] parameters that best satisfy the human subject's needs.We assume that the subject uses UMAP to generate lowdimensional data representations for a classification task, and that their preferences change with time.Specifically, in the beginning, the subject does not have any prior knowledge of the dataset, so data exploration takes the priority.As they become more familiar with the data, the priority gradually shifts from exploration to maximizing classification accuracy of their model.Our hypothesis is that by modelling the change in subject's preferences, we can anticipate the preferences and propose good UMAP parameter candidates faster to the human user.
For this experiment, the subject's needs are simulated by an evaluation function that takes the embedding of a handwritten digit dataset [76] as an input and assigns a corresponding preference score.The evaluation function is based on the weighted sum of the density based cluster validity (DBCV) score U(•) [77] and the c-support vector classification (SVC) [78,79] accuracy P(•).Both functions depend on the UMAP parameter values θ * = {θ d , θ dist , θ n } which need to be inferred: the dimension of the target reduced space θ d , an algorithmic parameter θ dist that controls how densely the points packed, and the neighbourhood size θ n to use for local metric approximation.By regulating the weight w t , one can prioritize data exploration U or maximization of a classification accuracy P with 25) . ( We make a few simplifying modelling choices which are justified by cognitive science literature [80,81], as discussed below.We assume the subject cannot explicitly specify the ideal embedding, and hence the SSM observations x t are latent.But since the subject has the ability to evaluate embeddings with the preference score, we model the preference score δ t directly as an objective.In summary, the UMAP algorithm serves as an observation model and the transition dynamics are unknown, but implicitly regulated through (3).We use uniform priors for states throughout the experiments: Note, that the ground truth values for this SSM are unknown, and we approximate them with ABC with rejection sampling allowing it use significantly more simulations than our methods uses.This procedure along with details on the implementation of the SSM are described in Appendix C.

Eye movement control for gaze-based selection
For the second non-stationary user model, we infer properties of human gaze in a series of simulated eye movement control trials [82,83].In these trials, the user model of the human searches for a target on a 2D screen by performing eye movements (saccades), based on its beliefs about the target location and information from peripheral vision.When the gaze location matches the location on the screen, the task is considered complete and a new target appears.As the human subject performs more tasks, they get weary, which results in an increasing latency between saccades θ l .We believe that modelling the dynamics of eye movement latency will allow robust inference of the individual characteristics that are independent of weariness.
The subject needs several moves to reach the target because the actions and observations are corrupted with two noise sources: the ocular motor noise θ om and the spatial noise of peripheral vision θ s .The quantitative values for these two variables vary for each individual, while the saccade latency θ l varies between different trial instances.We assess the performance of the subject based on total time x ∈ R it took for the gaze to reach the target, Here, the x t are SSM observations, θ * = {θ om , θ s , θ l } are state values that need to be inferred, Â(e) is the saccade amplitude function, and e is the saccade index.The increase of the latency θ l serves as state transition dynamics h θ that needs to be modelled for prediction, while eye movement control task environment is considered as an observation model g θ .To produce behavioural data, ground truth values of 0.01 and 0.09 for θ om and θ s respectively were used in the observation model.Finally, we calculate a Euclidean distance directly applied to x as a discrepancy and use the following priors for the states: The user model was implemented with reinforcement learning by [84].More details on implementation can be found in Appendix C.

Results and analysis
The results for the inference and prediction tasks are presented in Tables 2 and 3, respectively.The lower the RMSE, the better is the quality of estimation.In the inference task, the proposed LMC-based methods clearly outperformed BOLFI and SNE approaches.This indicates that considering multiple objectives at the same time was beneficial for the state inference, and that the model actually leverages information from consecutive states without hindering the performance.Additionally, it can be seen that all LMC-based variants performed differently, which can only be attributed to how the next simulations were chosen, since the surrogate was exactly the same in all three methods.As the results show, having BNN as a model for state transition was beneficial for experiments with non-stationary user models, while having BLR was more preferable for the simpler models.This suggests that BLR is expressive enough to replicate simple transitions, but struggles with more complex ones, for which BNN was more suitable.
The comparisons with GP-SSMs and PR-SSMs for learning transition dynamics show that our method learns accurate dynamics or, at least, relative to the SSM method baselines.The SSMs methods showed worse results than BLR and BNN approaches.This can be explained by the lack of observations for learning state transitions by the SSM methods, which also explains the high variance in the sampled trajectories from these methods.As for comparisons between BLR and BNN, BLR performs better only in LG and SV models, while BNN performs better in more complex case studies.Moreover, it should be noted that trajectory sampling from BLR is possible only by retaining all local linearizations of the dynamics, which is a far more limiting approach than having one single model.Therefore, BNN is a more preferable transition dynamics model.
The empirical time costs for running the LFI methods are shown in Table 4.It can be seen that the SNPE method was the fastest for the computationally cheap simulators (of SSMs with tractable likelihoods), while the LMC-qEHVI required the least amount of time for the non-stationary user models.This is expected, since the SNEs learn the model only once, and then simply use it for all observations, which is suitable for the computationally cheap simulations with simple LFI solutions.However, for the non-stationary user models, where there are no closed-form likelihoods available, learning a single model actually requires much more time.To summarize, the LMC variants are clearly preferable for the computationally heavy simulators, which dominate the cost of training a transition dynamics model and a multi-objective surrogate.
Finally, Figure 4 shows how the performance of the LFI models changes with different simulation budgets: 2, 5, and 10 simulations per each time-step.As expected, in general, all methods improved their performance with increased budgets.However, there is little difference in how these methods compare with respect to each other.This indicates that the results are not sensitive to the precise simulation budget.
In all experiments, we attribute the success of the proposed LMC-BNN method to a more flexible multi-output surrogate, and a more efficient way of choosing simulation candidates.The LMC allows multi-fidelity modelling (e.g.decomposing a stochastic process into processes with different length-scales), which allows leveraging information from multiple consecutive time-steps, unlike standard GPs.At the same time, samples from the transition model provide better candidates for simulations than the alternatives.The flexible surrogate along with adaptive acquisition make our method particularly suitable for online settings, where only a handful of samples are possible per time-step.

Discussion
We proposed an approach for state inference and prediction in the challenging SSM setting, where the transition dynamics are unknown and observations can only be simulated.Importantly, our model of transition dynamics was obtained with few simulations, making it suitable for cases with computationally expensive simulators.This is important because typically sample-efficient LFI approaches discard any temporal information from observed time-series, and cannot do state prediction, which is necessary for choosing the next simulations when simulation budget is limited.We proposed a solution for both these challenges: we use a multi-objective surrogate model for the discrepancy measure between observed and synthetic data, which connects the consecutive states through shared parameters, and we train an additional surrogate for state transitions with samples from LFI state posteriors.Additionally, our method does not restrict the family of admissible solutions for the state transitions to be linear or Gaussian, unlike existing LFI methods for SSMs [29,31], making it more widely applicable.
Although our method uses a more flexible surrogate for LFI of states, we demonstrated that it requires neither additional data nor significantly more training time than traditionally used GP surrogates.We reached the sample-efficiency goal by sharing synthetic observations across all discrepancy objectives, allowing the method to use the same simulations an indefinite amount of times.As for the decreased training time, we proposed a moving window approach that allowed the surrogate to focus only on a few recent SSM time-steps at a time.In conclusion, having a more flexible surrogate improved state inference and provided better samples from state posteriors for learning the unknown dynamics.
The main limitation of our approach is that the proposed transition dynamics model does not account for long-term state dependencies.Our state transition surrogate considers only the most recent state as an input, assuming the Markov property, and therefore cannot forecast far into the future.The resulting predictions have very low variance and have a tendency to converge to similar values, which can be explained by training only on a single trajectory.This limits our method to cases where observations are highly informative.

A Moving window experiments
We summarize the performance of our multi-objective LFI approach with various moving window lengths in Tables 5,  6 and 7, where the moving window size is shown in parentheses after the method's name.The main finding of these experiments is that including only two objectives inside the moving window is sufficient to get the most performance benefits, while having more objectives leads to increased computation time and inconsistent performance results.The increase in computation (Table 7) is evident from the complexity analysis in Section 3.3, as it makes the multi-objective surrogate training time to grow linearly.As for the inconsistent performance, the results become worse when transition dynamics have a rapid change rate, which makes the consequent objectives dissimilar and, hence, multi-objective modelling more difficult.For example, LMC-BNN and LMC-qEHVI improved for the LG and NN cases (see Tables 5,  6), which had mostly smooth trajectories, but struggled with the SV case, which had occasional erratic transitions, as shown in Figure 5c.These results indicate that one should choose the moving window size according to the transition dynamics' change rate, which is difficult to determine when dynamics are unknown.Therefore, because any supposed improvements of having bigger moving windows are small and computationally more costly, when dealing with unknown dynamics we recommend setting moving window size to two objectives.
It is also noteworthy to mention how different acquisition methods are impacted by having additional objectives.Starting with LMC-BNNs, with more objectives, the transition dynamics model gets samples from more frequently updated state posterior approximations, since each objective stays longer inside a moving window and their corresponding posteriors are updated every time the window moves.This is beneficial for simple transition dynamics where the multi-objective surrogate maintains a good fit with high number of objectives (e.g.LG, NN), but not for the SV case, where the surrogate struggles with modelling a lot of objectives.At the same time, these performance losses and gains remain negligible.The next acquisition method, LMC-qEHVI, is not impacted by the quality of the extracted posteriors.Instead, it prioritizes optimization of objectives inside the moving window without taking into the account future states.Specifically, it works only when future objectives are very similar to the current ones, and completely fails when transition dynamics have erratic behaviour (as in SV, UMAP and Gaze).Lastly, the BLR acquisition linearizes the transition dynamics locally and when this locality increases by including more objectives, the considered locality becomes less and less linear.As a result, increasing the moving window size hinders the performance of the LMC-BLR in all cases.In conclusion, small window size for all three methods is preferable.   .
where σ (o,e) and σ (b,e) are observation and belief uncertainties respectively, A (e) is the amplitude, and E (e) is the eccentricity of the saccade at the episode e.The user model was trained on 10 000 episodes using the Proximal Policy Optimization (PPO) algorithm [85], provided by the Stable Baselines3 library [86].We used default parameters, a multilayer perceptron policy and a clipping parameter set to 0.15.The environment was implemented by Chen et al. [84] in Python with the Open AI gym framework [87].

C.2 UMAP parameterization
In the UMAP parameterization model, the ground truth for states is not available, because the human-subject cannot specify the ideal embeddings.Therefore, we approximate the ground truth by using ABC with rejection sampling.Usually, this requires running millions of simulations for each time-step, but we make use of the weighted form of the preference score that allows us to use the same simulations across all time-steps.We simulate 1, 500, 000 embeddings with state points sampled from the prior, and then calculate their corresponding U and P from Equation 4.1 of the main text.For each time-step t, we calculate the preference scores δ and retain only 0.06% of those states that showed the lowest δ values.Then, we use a Gaussian kernel density estimator (KDE) on the retained state points, and maximize the corresponding PDFs to find the estimations of the ground truth.The bandwidth b of the kernel was calculated according to a Scott's rule [88] , where n is the number of data points and d is the number of dimensions.
The preference score was computed as a weighted sum between the relative validity U and the classification accuracy P on the validation set.The relative validity U is an approximation of the Density Based Cluster Validity (DBCV) score [77], which is often used as a quality measure of clustering.Intuitively, it shows how separable all the clusters are.We use the HDBSCAN* package [89] and the HDBSCAN Boruvka KDTree [90] algorithm to cluster the points of the embeddings.We set the smallest size grouping to 60, the density parameter of clusters to 10 and leave the rest parameters to their default values.The classification accuracy P was calculated for the C-Support Vector Classifier (SVC) [78,79] with the Scikit-learn package [91] and default parameters.
The embeddings for the UMAP parameterization model were computed for the handwritten digit dataset [76].The dataset was randomly split on the training and validation sets, resulting in 1198 and 599 8x8 pixel images of digits in 10 digit classes.The UMAP algorithm was implemented in the UMAP-learn package [92].

C.3 Implementation details of methods
All methods in the following subsections were integrated in the Engine for Likelihood-Free Inference (ELFI) [93] with the code available for application and further development through the link: https://github.com/AaltoPML/LFI-in-SSMs-with-UD.

C.3.1 BOLFI
For BOLFI, we used GP surrogate in BO with the LCBSC acquisition function.The posterior samples were extracted with importance weighted resampling.Below are the technical details for each of these BOLFI components: GP surrogate.The implementation for the GP surrogate was provided by the GPy package [94].It was initiated with zero mean function, and with the following priors for the RBF kernel lengthscale l, its variance σ 2 k , and the variance of the bias term σ 2 b : where θ min , θ max are the lower and upper bounds for θ (i) , (θ (i) , δ (i) ) are initial training points, and 1 is an all-ones vector.The GP was minimized by using the SCG optimizer [95] on the GP negative log-likelihood with a maximum number of iterations of 50.All inputs θ (i) for the GP were centred and normalized.
Posterior sampling.The BOLFI posterior was obtained by weighting the prior samples and using corresponding likelihoods as weights with F (•) being a Gaussian CDF with the mean 0 and the variance 1, where is the minimum of the GP surrogate mean function µ(•) obtained by using the L-BFGS-B optimizer [97] with a maximum number of iterations of 1000.

C.3.2 Muti-objective LFI with transition model
For our method, we used an LMC surrogate with the proposals for simulations coming from a Bayesian Neural Network (BNN) transition model.In the experiments, we also used the qEHVI multi-objective acquisition function and the Bayesian Linear Regression (BLR) alternatives for the transition model component.The technical details for the main components as well as for their alternatives are described below along with a recommended default set of parameters, which performed consistently well in all experiments.Algorithm 2 follows the algorithm from the main section, but provides a lower-level description of the method.
LMC surrogate.The LMC model was implemented in BoTorch [98].Its latent GPs were initialized with linear mean functions µ(x) = kx + b, and RBF kernels.The lengthscales of the kernels were parameterized in log scale, and initialized with 0 along with the constant b of the mean function.For the RBF kernel, ARD was also enabled to assign separate lengthscales for each input dimension.Furthermore, GP latents were initialized with 50 inducing points uniformly sampled inside the input bounds for each latent GP.The LMC training used Adam optimizer from the tensor computation package PyTorch [99] with a learning rate of 0.1 to minimize the variational evidence lower bound (ELBO).The optimized parameters included LMC coefficients, inducing points locations, and hyperparameters for the mean functions and kernels.Lastly, we used the default training step size and 1000 epochs for training with all inputs for the LMC centred and normalized.
qEHVI acquisition.The qEHVI acquisition function used a Quasi-MC sampler [100] with scrambled Sobol sequences [101] and a sample size of 128.The reference point that was used for calculating the hypervolume for each objective was set to -5.The acquisition points were acquired sequentially using successive conditioning [102] with a maximum number of restarts of 20, a batch size limit of 10, and a maximum number of iterations for the optimizer of 200.It was also implemented in BoTorch.
Bayesian neural network (BNN) [64] was built from stacked Bayesian layers implemented in torch zoo (BLiTZ) [63].We used an architecture with 2 hidden layers, where each layer had 256 nodes.During the training process, stochastic gradient descent [103] was used with a learning rate of 0.001 for minimizing the ELBO loss with squared L2 norm.The loss was calculated based on 10 samples from the model, for each of 100 batches of training data in a single epoch.The training data was randomly selected from previously stored approximated posterior samples from all states (1000 samples per each state) with replacement, resulting in a total of 1 000 000 points, where one point was a pair of consecutive state points.The training data was updated each time the moving window moved.

Figure 1 :
Figure 1: Graphical representation of an SSM.Latent states θ t (orange) produce observations x t through the observation simulator g θ (blue) and follow the Markovian transition dynamics h θt (red).

Figure 3 :
Figure 3: Graphical representation of the LMC.The discrepancy outputs δ t ≡ δ t (θ * ) are modelled as a linear combination of latent functions u q .The model shares the same parameter values θ * between all objectives.

Figure 4 :
Figure 4: The performance of LFI methods for the state inference tasks with various simulation budgets in two non-stationary user modelling experiments.The box plots were computed from 30 repetitions with different random seeds.The horizontal line on box plots shows the median, the bar shows upper and lower quartiles, and the whiskers indicate the rest of the quartiles.The diamond points indicate outliers.

Figure 5 : 2 (
Figure 5: Transition trajectories (different colours) of states (y-axis) sampled from three SSM transition dynamics across 50 time-steps (x-axis) with different random seeds.The complexity of the dynamics gradually increases in SSMs, starting with the smooth LG (a) dynamics, where the difference between consecutive states is very small, followed by NN (b) with more erratic behaviour, and finishing with the SV model (c), whose dynamics has occasional drastic changes in state values.

Table 1 :
Comparison of inference methods in SSMs with references to selected representative works.LFI methods use simulators to infer states (or simulator parameters)

Table 2 :
Comparison of LFI methods (rows) in different SSMs (columns) for the state inference task.The performance was measured with 95% confidence interval (CI) of the RMSE between estimated vs ground truth state values for 50 time-steps.The best results in each column are highlighted in bold.± 1.7 17.93 ± 1.34 57.15 ± 2.35 75.85 ± 1.26 80.75 ± 1.35

Table 3 :
Comparison of transition dynamics models (rows) in different SSMs (columns).The performance was measured with 95% CI of the RMSE between sampled vs ground truth trajectories of length 50.The best results in each column are highlighted in bold.

Table 4 :
Time comparison of LFI methods (rows) in different SSMs (columns) for training 50 time-steps.The running time is shown in minutes along with 95% CI.The best results in each column are highlighted in bold.

Table 5 :
Comparison of LFI methods (rows) in different SSMs (columns) for the state inference task with different moving window lengths (a number in parentheses).The performance was measured with 95% confidence interval (CI) of the RMSE between estimated vs ground truth state points for 50 time-steps.The best results in each column are highlighted in bold.

Table 7 :
Time comparison of LFI methods (rows) in different SSMs (columns) with different moving window lengths (a number in parentheses) for training 50 time-steps.The running time is shown in minutes along with 95% CI.The best results in each column are highlighted in bold.