1 Introduction

Likelihood-free inference (LFI) methods (Sunnåker et al. 2013; Sisson et al. 2018; Cranmer et al. 2020) estimate the parameters \(\varvec{\theta }\) of a statistical model, given an observed measurement \(\textbf{x}_*\) and a black-box simulator \(g_{\varvec{\theta }}\). These methods use synthetic observations \(\textbf{x}_{\varvec{\theta }} \sim g_{\varvec{\theta }} (\textbf{x}\mid \varvec{\theta })\) produced by the simulator to assist the inference without requiring an analytical formulation of the likelihood \(p(\textbf{x}\mid \varvec{\theta })\). LFI has been successfully applied to identifying parameters of complex real-world systems, such as financial markets (Peters et al. 2012; Barthelmé and Chopin 2014; Ong et al. 2018), species populations (Beaumont et al. 2002b; Beaumont 2010; Bertorelle et al. 2010) and cosmology models (Schafer and Freeman 2012; Alsing et al. 2018; Jeffrey et al. 2021). A special type of application of LFI is time-dependent systems, which can be described using state-space models (SSMs) (Kalman et al. 1960; Koller and Friedman 2009) where observed measurements \(\textbf{x}_t \in \mathbb {R}^n\) are emitted given a series of latent variables, the states \(\varvec{\theta }_t \in \mathbb {R}^m\), as illustrated in Fig. 1.

Compared to traditional Bayesian estimation, in a simulator-based setting, our primary aim is to understand how latent states evolve in relation to both the logic of the simulator and real-world observed data. Typically, state-space inference methods (Kalman et al. 1960; Anderson and Moore 2012; Zerdali and Barut 2017) require an observation model \(g_{\varvec{\theta }}\) in the form of the likelihood \(p(\textbf{x}_t \mid \varvec{\theta }_t)\) to find the posterior distribution \(p(\varvec{\theta }_{1:T} \mid \textbf{x}_{1:T})\). When the observation model is unavailable, state-space learning methods (Frigola et al. 2014; Melchior et al. 2019) are commonly used to infer \(g_{\varvec{\theta }}\) from the observed time-series data. However, when \(g_{\varvec{\theta }}\) is inferred, the states become very difficult to interpret for domain experts since the states are no longer informed by a known model. An alternative solution to this problem is to use a simulator in place of \(g_{\varvec{\theta }}\). LFI methods are able to infer the states and avoid learning \(g_{\varvec{\theta }}\) by using a simulator as the observation model. Simulators are widespread in SSM settings (Ghassemi et al. 2017; Shafi et al. 2018; Georgiou and Demiris 2017) since they enable the incorporation of additional prior knowledge about data-generating mechanisms without the need for a tractable likelihood \(p(\textbf{x}_t \mid \varvec{\theta }_t)\). In this paper, we focus on LFI for SSMs, which fall under the category of approximate methods in the broader context of SSM inference.

An essential aspect of SSMs that is often overlooked in the LFI literature is the complexity of transition dynamics \(h_{\varvec{\theta }_t}\). Current LFI methods for SSMs (Toni et al. 2009; Dean et al. 2014) proceed by assuming dynamics to be either too simplistic (e.g., linear) or readily available for sampling. In contrast, our approach stands out as especially valuable when the transition dynamics are complex, non-linear, and not immediately known, especially under a limited simulation budget of the observation model. Such complexities in state transitions, which deviate from simple linear or Gaussian norms, are frequently observed in diverse domains like meteorology (Errico et al. 2013; Zeng et al. 2020), cosmology (Lange et al. 2019; He et al. 2019) or behavioural sciences (Gimenez et al. 2007; Georgiou and Demiris 2017). In meteorology, for example, intricate dynamics (Kalnay 2003) are driven by a vast web of interconnected factors shaping weather patterns. In the realm of behavioural sciences (Kahneman and Tversky 2013; Fiske and Taylor 2013), human decision-making stands as a testament to complexity. Choices are shaped not only by an individual’s past experiences but also by their current emotional states and cognitive biases. An instance of this is how new information can sway subsequent decisions, a phenomenon we delve deeper into in our later experiments. Traditional LFI methods, when not tailored to address these non-linear and non-Gaussian dynamics, frequently result in less-than-optimal state estimates and predictions. While there have been commendable advancements in LFI, such as the creation of more efficient sampling-based methods (Jasra et al. 2012), innovative statistic-matching generation mechanisms (Martin et al. 2019), and theoretical convergence confirmations (Dean et al. 2014; Martin et al. 2014; Calvet and Czellar 2015), they still fall short in addressing this core challenge.

Fig. 1
figure 1

Graphical representation of an SSM. Latent states \(\varvec{\theta }_t\) (orange) produce observations \(\textbf{x}_t\) through the observation simulator \(g_{\varvec{\theta }}\) (blue) and follow the Markovian transition dynamics \(h_{\varvec{\theta }_t}\) (red)

In this paper, we introduce a method capable of likelihood-free state inference and state prediction in discrete-time SSMs. Our method operates in a LFI setting, where a time-series of observations \(\textbf{x}_t\) and a simulator \(g_{\varvec{\theta }}\) capable of replicating these observations are provided. The goal of the method is to infer the states \(\varvec{\theta }_{1:T} = \{ \varvec{\theta }_1,..., \varvec{\theta }_T \}\) that can produce the observed time-series \(\textbf{x}_{1:T} = \{ \textbf{x}_1,..., \textbf{x}_T \}\), using as few simulations as possible to reduce their potentially high computational cost. This setting is broader than is typically assumed by traditional LFI methods since we do not assume the transition dynamics \(h_{\varvec{\theta }_t}\) to be known (neither in its closed-form nor its function family) or available for sampling, and also because the number of simulations can be limited to a small number. Instead of assuming the transition dynamics, we learn a non-parametric model and use it as their surrogate (or replacement) in state approximation and prediction.

This paper contains three main contributions. First, we propose a solution to the previously unaddressed problem of state prediction in SSMs with unknown transition dynamics and a limited simulation budget. We use samples from LFI approximations of state posteriors \(p(\varvec{\theta }_t \mid \textbf{x}_t)\) to accurately model the state transition dynamics, with accuracy shown by empirical comparisons with state-of-the-art SSM inference techniques. Second, focusing on problems where LFI has to be sample-efficient, i.e., the number of simulations needs to be reduced as much as possible, we improve upon the current LFI methods for the state inference task by leveraging time-series information. This is done by using a multi-objective surrogate for the consecutive states (e.g., for time-steps j and \(j+1\)) and sampling from a transition dynamics model to determine where to next run simulations. Lastly, we demonstrate that the proposed method is needed to tackle the crucial case of user modelling, where user models are non-stationary because users’ beliefs, preferences, and abilities change over time.

2 Background

Approximate Bayesian computation (ABC) (Beaumont et al. 2002a; Csilléry et al. 2010; Sunnåker et al. 2013) is arguably the most popular family of LFI methods. In its simplest variant, ABC with rejection sampling (Tavaré et al. 1997; Pritchard et al. 1999), the simulator parameters are repeatedly sampled from the prior \(p(\varvec{\theta })\) to generate synthetic observations \(\textbf{x}_{\varvec{\theta }}\). These synthetic observations are then compared to the observed measurement \(\textbf{x}_*\) using the so-called discrepancy measure \(\delta (\varvec{\theta }) = \rho (\textbf{x}_*, \textbf{x}_{\varvec{\theta }})\), where \(\rho (\cdot , \cdot )\) is a distance function, e.g. Euclidean. If synthetic observations \(\textbf{x}_{\varvec{\theta }}\) have a discrepancy smaller than a user-defined threshold \(\epsilon \), then they are considered to be produced by simulator parameters \(\varvec{\theta }\) that could plausibly replicate the observed measurement \(\textbf{x}_*\). This common assumption in ABC approaches results in the following approximations of the likelihood function \(\mathcal {L}(\cdot )\) and the posterior \(p(\varvec{\theta }\, |\, \textbf{x}_*)\):

$$\begin{aligned} \mathcal {L}(\varvec{\theta }) \approx \mathbb {E} [\kappa _\epsilon ( \delta (\varvec{\theta }) )], \quad p( \varvec{\theta }\, | \,\textbf{x}_*) \propto \mathcal {L}(\varvec{\theta }) \cdot p(\varvec{\theta }). \end{aligned}$$
(1)

Here \(\kappa _\epsilon (\cdot )\) is a kernel with it maximum at zero and whose bandwidth \(\epsilon \) acts as an acceptance/rejection threshold. For instance, in ABC with rejection sampling, \(\kappa _\epsilon (\delta (\varvec{\theta })) = \xi _{[0, \epsilon )}(\delta (\varvec{\theta }))\), where \(\xi _{[0, \epsilon )}(\delta (\varvec{\theta }))\) equals to one if \(\delta (\varvec{\theta }) \in [0, \epsilon )\) and zero otherwise. Unfortunately, ABC approaches need many simulations of synthetic observations to accurately approximate the posterior, making them unsuitable for inference with computationally intensive simulators.

2.1 Bayesian optimisation for LFI

Since many applications, including those considered in this paper, aim to minimise the number of simulations, other methodologies have emerged, such as Bayesian optimisation for LFI (BOLFI) (Gutmann and Corander 2016). In BOLFI, a Gaussian process (GP) surrogate is used for a discrepancy measure \(\delta (\varvec{\theta })\), where the minimum of the GP surrogate mean function \(\mu (\varvec{\theta })\) can be used as \(\epsilon \) and a Gaussian CDF \(F( (\epsilon - \mu (\varvec{\theta })) / \sqrt{\nu (\varvec{\theta }) + \sigma ^2})\) with mean 0 and variance 1 as \(\mathbb {E}[\kappa _\epsilon (\cdot )]\) in Eq. (1). Here, \(\nu (\varvec{\theta }) + \sigma ^2\) is the posterior variance of the GP surrogate.

A main advantage of modelling the discrepancy with a GP is the ability to estimate uncertainty. The GP’s predictive mean \(\mu (\varvec{\theta }^{(i)})\) and variance \(\nu (\varvec{\theta }^{(i)})\) are used to calculate the utility (e.g., expected improvement, Brochu et al. 2010) of sampling the objective function at the next candidate point \(\varvec{\theta }^{(i+1)}\), where i denotes the number of a simulation. Maximising this so-called acquisition function \(\mathcal {A}(\cdot )\) with respect to \(\varvec{\theta }\) helps determine where to run simulations next. Because BOLFI actively chooses where to run simulations, its posterior approximation requires much fewer synthetic observations than other LFI methods that do not use active learning. However, BOLFI was not specifically designed for SSMs and hence does not make use of any temporal information typical for SSMs to enhance inference quality.

2.2 Sequential neural estimation

An alternative approach to sample-efficient LFI is global sequential neural estimation (SNE), which learns the statistical relationship between observations and simulator parameters directly through a neural network surrogate. If trained with a sufficiently large sample set, this surrogate does not need retraining when the observation changes, making SNE methods particularly suitable for a sequence of related inference tasks, such as those required in time-series prediction. Although there exist amortised versions of neural approximation methods, the specific sequential variants highlighted here are not naturally amortised. In our dynamic framework, these methods are employed to address separate LFI problems across different time-steps, ensuring the use of consistent priors throughout. The SNE neural network can be used as a surrogate for the posterior, likelihood, or likelihood ratio, resulting in SNPE (Papamakarios and Murray 2016; Goncalves et al. 2018; Greenberg et al. 2019), SNLE (Papamakarios et al. 2019), and SNRE (Durkan et al. 2020; Hermans et al. 2020) methods respectively. These SNE methods address a more difficult problem than we do: learning a model across all possible tasks (i.e., observed datasets). The price is that they require significantly more simulations than Bayesian optimisation (BO) approaches, as seen in Sect. 4.3 of Aushev et al. (2020).

2.3 Likelihood approximation networks

Likelihood approximation networks (LANs), introduced by Fengler et al. (2021), share similarities with SNE approaches. LANs approximate the likelihood for time-dependent generative models in dynamical systems within cognitive neuroscience. Their key distinction is the assumption that the time component is one of the inputs of the observation model, allowing them to learn the observation model at an arbitrary time-step. This assumption shifts the role of the dynamics onto the observation model, which is often beneficial for diffusion models (Reynolds and Rhodes 2009; Wieschen et al. 2020), but not for models of human behaviour (Schall 2019; Futrell et al. 2020; Pothos and Chater 2002). In contrast, our approach does not rely on the explicit dependency of the observation model on time, enabling state predictions when the transition dynamics are unknown at the cost of amortisation.

2.4 Non-linear dynamics in non-LFI methods

The issue of handling non-linear transition dynamics, in general, has been primarily addressed outside of the LFI literature. This large and growing set of methods includes extended Kalman filters (Anderson and Moore 2012; Zerdali and Barut 2017), GP-SSMs (Frigola et al. 2014; Melchior et al. 2019), sequential Monte Carlo (Doucet et al. 2001; Smith 2013; Septier et al. 2013) and Bayes filtering (Smidl and Quinn 2008; Karl et al. 2016). Although they are not directly applicable to the LFI setting considered in this paper, we summarise them in Table 1 alongside relevant LFI literature to highlight important connections.

Table 1 Comparison of inference methods in SSMs with references to selected representative works

3 Likelihood-free inference in state-space models

In this section, we introduce a multi-objective approach to LFI in SSMs, which improves the sample-efficiency of existing methods by using the model for discrepancy shared across consecutive states while also learning the model of the transition dynamics. The main elements of the solution are presented in Fig. 2. To estimate state points \(\varvec{\theta }_t\), given \(\textbf{x}_t\), we employ a multi-objective surrogate \(\widetilde{\delta }_{\varvec{\theta }}\) for discrepancies and then approximate the posterior over states \(p(\varvec{\theta }_t \mid \textbf{x}_t)\) with Eq. (1). At the same time, we randomly pair consecutive posterior samples \((\varvec{\theta }_j, \varvec{\theta }_{j+1})\) and train a non-parametric surrogate for the state transition \(\widetilde{h}_{\varvec{\theta }_t}\), whose predictive posterior \(p(\varvec{\theta }_{t+1} \mid \textbf{x}_t)\) proposes candidates for future simulations. We summarise our approach in Algorithm 1, where \(\varvec{\theta }_*\) denotes simulator parameter points shared across all time-steps. For in-depth details, please refer to “Appendix C (Section C.3.2)”.

Fig. 2
figure 2

An overview of our approach, in which the \(\widetilde{\delta }_{\varvec{\theta }}\) surrogate is used for LFI of states and \(\widetilde{h}_{\varvec{\theta }_t}\) for the unknown transition dynamics. The \(\widetilde{\delta }_{\varvec{\theta }}\) models the corresponding discrepancies \(\delta _t \equiv \delta _t(\varvec{\theta }_*)\) of several observations (green) inside a moving window (here, with the size of two), from which posteriors are extracted according to Eq. (1) in \(\mathcal {P}\). \(\widetilde{h}_{\varvec{\theta }_t}\) is trained with paired samples \(\mathcal {D}\) from posteriors of consecutive states (grey); its predictive samples are used as proposals (orange) for simulations \(\mathcal {S}\)

figure a

3.1 Multi-objective state inference

As an extension to BOLFI, we employ a multi-objective surrogate model for the discrepancies \(\delta _{t}(\varvec{\theta }_*) = \rho (\textbf{x}_t, \textbf{x}_{\varvec{\theta }})\) at different t, considering multiple discrepancy objectives simultaneously and leveraging information between consecutive states. More specifically, we pass discrepancies of the consecutive states to the surrogate separately (e.g., \(\delta _{t-1}(\cdot ), \delta _{t}(\cdot ))\), but through the use of shared parameters of the multi-objective surrogate, they become associated. This approach allows using a discrepancy model of the previous state to infer the current state instead of simply discarding it. Moreover, it allows for a much more flexible surrogate for LFI of states than the traditional GP used in BOLFI. These changes do not need any additional data to fit the surrogates because all synthetic observations \(\textbf{x}_{\varvec{\theta }}\) for discrepancy objectives can be shared across all states (therefore, we use \(\varvec{\theta }_*\) instead of \(\varvec{\theta }_t\) in the context of simulations). When we consider a new observation \(\textbf{x}_{t+1}\), we simply need to recalculate the discrepancy values for all synthetic observations. Once we have a trained surrogate for discrepancy objectives, we infer state posteriors \(p(\varvec{\theta }_{t} \,|\, \textbf{x}_{t})\), similarly as in BOLFI. This can be achieved, for example, through importance resampling, where prior samples are weighted according to the likelihood function \(\mathcal {L}(\varvec{\theta })\) from Eq. (1).

3.1.1 Moving window approach

There is an additional challenge in adapting multi-objective surrogates in SSMs: the high computational cost associated with considering too many objectives. Time-series can potentially have hundreds of time-points, and expanding the number of considered objectives may be detrimental to the performance of the surrogate. We avoid this problem by limiting the number of objectives the surrogate can have. Instead of considering all available time-steps as objectives, we propose to consider only L recent objectives by gradually including new ones and discarding old ones that have little impact on current states. The size of this moving window depends on how rapidly the transition dynamics change. As the size of the window L grows, the model becomes less sensitive to the noise from the dynamics, at the cost of increased computations and decreased adaptability to the most recent state transitions. Overall, the moving window reduces the number of objectives L considered at a time, making multi-objective modelling in the SSM setting feasible. In “Appendix A”, we further investigate the influence of the moving window size hyperparameter on state inference and prediction and show that having only two objectives (\(L=2\)) is the most beneficial choice in terms of the quality of posterior approximations and low computational time.

3.2 Learning state transition dynamics

While we progressively improve LFI posterior approximations \(p(\varvec{\theta }_t \,| \,\textbf{x}_t)\) by acquiring new simulations, we use empirical samples from the latest available approximations to learn a stochastic model of transition dynamics. This model should be able to learn from noisy samples of LFI posterior approximations \(p(\varvec{\theta }_t\, |\, \textbf{x}_t)\), and be flexible enough to fit arbitrary function families the dynamics may follow. In addition, it should be able to handle uncertainty associated with samples outside the training distribution, as samples from posterior approximations tend to be concentrated around the main mode of the learning data. For these reasons, the appropriate transition model should be Bayesian and non-parametric (or semi-parametric). Such a model would account for the uncertainty associated with posterior approximations and be flexible enough to follow possibly non-linear transition dynamics.

We propose to train this model in an autoregressive fashion by forming a training set of K randomly paired sample points from posteriors (e.g., \(p(\varvec{\theta }_{t-1} \mid \textbf{x}_{t-1})\), \(p(\varvec{\theta }_t \mid \textbf{x}_t)\)). More specifically, we assume the Markov property in the transition dynamics and use pairs of states instead of their whole trajectories. For each SSM time interval, we group consecutive state posterior samples in a training set, and expand it when new state posteriors become available (as we move forward in time). Thus, the transition model does not need to be retrained when new observations present themselves and can be actively used throughout state inference to determine where to run simulations next. This can be done by sampling the predictive posterior \(p(\varvec{\theta }_{T+1} \,|\, \textbf{x}_{T})\) from the trained model \(\widetilde{h}_{\varvec{\theta }_T}\):

$$\begin{aligned} p(\varvec{\theta }_{T+1} \,|\, \textbf{x}_{T}) \approx \int \widetilde{h}_{\varvec{\theta }_T}(\varvec{\theta }_{T+1} \,|\, \varvec{\theta }_{T}) \cdot p(\varvec{\theta }_{T} \,|\, \textbf{x}_{T}) d\varvec{\theta }_{T}. \end{aligned}$$
(2)

The posterior described above should be recognized as an approximate representation, informed by the data and model, rather than an exact reflection of the true posterior. All later mentions of the posterior pertain to this approximation. Within this framework, the state transition model \(\widetilde{h}_{\varvec{\theta }_t}\) influences state posteriors indirectly, primarily serving as a source of simulation candidates for the LFI surrogate. Ultimately, accumulating more simulations improves the discrepancy surrogate for the LFI of states and, by extension, the quality of posterior samples, while higher-quality posterior samples allow for more accurate learning of state transition dynamics.

3.3 Computational complexity and model choices

In this section, we discuss the model choices and the resulting complexity analysis for the proposed multi-objective approach to LFI, as illustrated in Algorithm 1.

3.3.1 Model choices for surrogates

To meet the requirements for the surrogates as stated in Sects. 3.1 and 3.2, we have chosen a linear model of coregionalization (LMC) (Fanshawe and Diggle 2012) for discrepancies and a Bayesian neural network (BNN) (Kononenko 1989; Esposito 2020) for state transition dynamics.

  1. 1.

    Linear model of coregionalization. LMC is one of the simplest multi-objective models. It expresses each of its L outputs \(f_l\) as a linear combination \(f_l(\varvec{\theta }_*) = \sum _{q=1}^Q \text {a}_{l,q} u_q\), as shown in Fig. 3, where the \(u_q\sim GP(0, \nu (\varvec{\theta }_*))\) are latent GPs and the \(\text {a}_{l,q}\) are linear coefficients that need to be solved.

  2. 2.

    Bayesian neural network. BNN can be represented as an ensemble of neural networks, where each has its own weights \(\omega ^{(h)}\) drawn from a shared, learned probability distribution (Blundell et al. 2015) with \(\omega ^{(h)} \sim \mathcal {N}(\mu ^{(h)}, \log (1 + \chi ^{(h)}))\), where \(\mu ^{(h)}\) and \(\chi ^{(h)}\) are the hyperparameters that need to be learned. Previously, neural networks have been successfully applied in SSM settings for either modelling the transition dynamics or the observation model (Rivals and Personnaz 1996; Bonatti and Mohr 2021).

Fig. 3
figure 3

Graphical representation of the LMC. The discrepancy outputs \(\delta _t \equiv \delta _t(\varvec{\theta }_*)\) are modelled as a linear combination of latent functions \(u_q\). The model shares the same parameter values \(\varvec{\theta }_*\) between all objectives

3.3.2 Complexity analysis

Given the aforementioned model choices, the resulting computational complexity of Algorithm 1 is primarily influenced by three main stages: training the multi-objective surrogate \(\widetilde{\delta }_{\varvec{\theta }}\), extracting the posterior from discrepancy surrogates (Eq. 1) and training the transition dynamics model \(\widetilde{h}_{\varvec{\theta }_t}\). Both LMC and BNN are trained by minimising the variational evidence lower bound (see more details in “Appendix C”).

  1. 1.

    Training the multi-objective surrogate. The cost of training \(\widetilde{\delta }_{\varvec{\theta }}\) depends on the number of synthetic observations \(|\mathcal {S}|\) (the cardinality of \(\mathcal {S}\)), on the size of the moving window L and on the user-specified number M of inducing points (Alvarez and Lawrence 2011) for the LMC. This results in a complexity \(\mathcal {O}(|\mathcal {S}| L M^2)\), compared to \(\mathcal {O}(|\mathcal {S}| M^2)\) for traditional GPs used in BOLFI.

  2. 2.

    Posterior extraction. This stage consists of finding the appropriate \(\epsilon \) (e.g., by minimising the GP mean function) and then applying Eq. (1). The complexity of this step is bounded by the calculation of the variance of the surrogate for each of the I samples from the posterior over states, resulting in \(\mathcal {O}(L M^2 I)\).

  3. 3.

    Training the transition dynamics model. When employing variational inference (Zhang et al. 2018) to train the transition dynamics model \(\widetilde{h}_{\varvec{\theta }_t}\), the computational cost is linear in the number W of BNN parameters, resulting in \(\mathcal {O}(W K E S_p)\). Here, K represents the overall amount of training data for \(\widetilde{h}_{\varvec{\theta }_t}\), E is the number of epochs, and \(S_p\) is the number of parameter samples from the posterior distribution that is required to obtain the distribution of outputs.

Depending on the choice of hyperparameters, the computational complexity of Algorithm 1 is bounded by either \(\mathcal {O}(|\mathcal {S}| L M^2)\), \(\mathcal {O}(L M^2 I)\) or \(\mathcal {O}(W K E S_p)\). Most of these parameters are common in LFI (e.g., \(|\mathcal {S}|, I\)), and the rest are specific to surrogate choices, which can be replaced with fewer-parameter alternatives if needed. We provide recommendations for choosing these hyperparameters in “Appendix C”.

3.4 Theoretical properties

In this section, we analyse the convergence properties and limitations of our LFI method for state-space models. We discuss how our method approximates states and transition dynamics, and outline the restrictions imposed by our choice of models on the class of systems that can be effectively modelled using our method. While the approach discussed in this section provides a robust framework for state inference in SSMs, it is vital to note that the results are approximate in nature, owing to inherent limitations such as the finite moving window. This section primarily aims to lay out the conditions under which our method can be seen as offering a good approximation rather than an exact solution.

3.4.1 Convergence

In the convergence analysis, we examine the ability of our method to learn a suitable approximation of states and transition dynamics when provided with sufficient data. The state approximations for \(p( \varvec{\theta }_t \,|\, \textbf{x}_t )\) are obtained through the likelihood function in Eq. (1), which Proposition 1 of Gutmann and Corander (2016) identifies as a non-parametric approximation of the true likelihood:

Proposition 1

Maximising the synthetic log-likelihood \(\text {log} \mathcal {L}(\varvec{\theta }_*)\) in Eq. (1) corresponds to maximising a lower bound of a non-parametric approximation of the log likelihood when the kernel function \(\kappa _\epsilon (\cdot )\) is convex.

$$\begin{aligned} \text {log }\mathcal {L}(\varvec{\theta }_*) \ge \text {log } \kappa _\epsilon ( \mathbb {E} [ \delta _t(\varvec{\theta }_*)] ) \end{aligned}$$

For our LMC model, we can demonstrate that Proposition 1 holds when the kernel is a Gaussian CDF, as specified below and \(\epsilon \) is the minimum of the GP surrogate mean function:

Corollary 1

Assuming the Gaussian CDF kernel \(F( (\epsilon - \mu (\varvec{\theta }_*)) / \sqrt{\nu (\varvec{\theta }_*) + \sigma ^2})\) from Sect. 2 and \(\epsilon = \min _{\varvec{\theta }_*} \mu (\varvec{\theta }_*)\), Proposition 1 holds for the LMC model of discrepancy.

Proof

The Gaussian CDF kernel \(F(\cdot )\) is known to be convex on the interval \((-\infty , 0]\). By setting \(\epsilon \) as the minimum of the GP surrogate mean function, the argument of \(F(\cdot )\) is restricted to the range \((-\infty , 0]\) with the maximum at 0 (note that since \(\mu (\cdot )\) models discrepancy, it is always non-negative). Consequently, the inequality expression in Proposition 1 is preserved, while Jensen’s inequality ensures a lower bound for both \(\mathcal {L}(\varvec{\theta }_*)\) and its logarithm when the functions are convex. \(\square \)

As for the approximations of state transitions \(p(\varvec{\theta }_{t+1} \,|\, \varvec{\theta }_{t})\), their convergence follows from the universal approximation theorem of neural networks Hornik et al. (1989). This theorem states that every continuous function can be approximated by a neural network with a single hidden layer of neurons whose transfer function is bounded. Our use of the BNN model for transition dynamics compiles with this theorem. Under certain conditions, such as the availability of sufficient parameters and data, the central limit theorem guarantees that the expectation of our approximation converges to the target distribution.

3.4.2 Restrictions on modelling classes

Our choice of models imposes additional limitations on the class of systems that may be challenging to model using our method. The first limitation concerns high predictive variance when learning systems with long-term dependencies. While our method is robust across a variety of applications, it encounters challenges when dealing with time series that possess long memory. If the size of the moving window is shorter than the memory inherent in the time series, our method may fail to capture crucial long-term dependencies. Although it can adeptly handle abrupt changes to a certain extent, effectively addressing the complexities presented by long-memory dynamics in time series is a topic for future development. Given that the training of the BNN involves a single trajectory consisting of a limited number of observations (50 in our experiments from Sect. 4), the flexibility offered by BNNs might be insufficient to model systems characterised by both non-linear dynamics and significant long-term dependencies accurately. It is important to note, however, that BNNs do not introduce additional theoretical restrictions on the class of systems that can be modelled.

The second limitation concerns the type of observation distribution that our method models through the LMC. Although LMCs offer greater flexibility than vanilla GPs, they may have difficulty modelling asymmetric, skewed, or multimodal noise in the observation model when the simulation budget is constrained. This issue is prevalent among various LFI methods, as they often rely on models, such as GP-based surrogates, that make simplifying assumptions, such as assuming Gaussian noise. These assumptions can compromise the reliability of state posterior approximations when they are violated.

The third limitation stems from using GPs in LFI, which are subject to the curse of dimensionality, restricting the observation model’s dimensionality to fewer than 10. This constraint, however, is intrinsically connected to our method’s sample-efficiency, a significant advantage, as it requires only a few synthetic observations to approximate the likelihood. If the simulation budget for the observation model is not limited to the order of a hundred simulations, we recommend using more complex surrogates, such as SNEs or LANs from Table 1, alongside our approach to modelling state transitions.

4 Experiments

We assess the quality of our method for state inference and prediction tasks in a series of SSM experiments, where a simulator serves as the observation model \(g_{\varvec{\theta }}\). In the experiments, our method uses the surrogate choices of LMC and BNN, as described in Sect. 3.3. We demonstrate that it can accurately learn state transition dynamics and improve upon existing LFI methods for the state inference task. Moreover, we investigate the sample-efficiency of the proposed method and demonstrate its effectiveness in non-stationary user modelling case studies. We compare our method against traditional SSM methods in cases with available closed-form likelihoods and against LFI methods when only a simulator is available and traditional methods cannot be applied.

4.1 Experimental setup

We simulated time series of observations based on single-sampled trajectories from ground-truth transition dynamics (available for evaluation purposes but unknown to the methods) of five SSMs, described in Sect. 4.2. Our goal was to estimate the simulator parameters that likely produced these observations, and learn the model of transition dynamics for state prediction based on the sampled trajectory.

4.1.1 Comparison methods

For the state inference task, we compare the quality of state estimates by our approach against other LFI methods: BOLFI (Gutmann and Corander 2016), SNPE (Papamakarios and Murray 2016), SNLE (Papamakarios et al. 2019), and SNRE (Durkan et al. 2020). We use a fixed simulation budget for all these methods, with 20 simulations to initialise the models and then two additional simulations for each new time-step. For the SNE approaches (SNPE, SNLE, and SNRE), we provided all simulations at once since that is their intended mode of operation. As for the prediction task, we sampled state trajectories from the transition model and evaluated them against trajectories from ground-truth dynamics. We performed these experiments in SSMs with simulators that have tractable likelihoods \(p(\textbf{x}_t \mid \varvec{\theta }_t)\), providing the closed-form of the ground-truth likelihoods to the state-of-the-art SSM inference methods GP-SSM (Frigola et al. 2014; Ialongo et al. 2019) and PR-SSM (Doerr et al. 2018), while our method was still doing LFI. For all methods in the prediction task, we provided 50 observations and then sampled trajectories that had the same length of 50 time-steps.

We also compared two variants of our method that differ only in the way the next simulations are sampled: LMC-BLR, where samples were taken from Bayesian linear regression (BLR) models that linearized the transition dynamics along 50 observed time-steps; and LMC-qEHVI, where a popular acquisition function for multitask BO, q-expected hypervolume improvement (qEHVI) (Daulton et al. 2020), was used to provide samples. The role of these variants was to evaluate how the choice of future simulations impacts the quality of state inference and prediction.

All models were assessed in terms of the root mean squared error (RMSE) between the state estimates and their ground-truths. The experiments were repeated 30 times with different random seeds. Additional details on the implementation of the methods can be found in “Appendix C”; all code for replicating the experiments is included in the Supplement.

4.2 The state-space models

In this section, we present two case studies with non-stationary user models and three SSMs with tractable likelihoods. In user modelling experiments, we simulated behavioural data from humans who completed a certain task in two different experiments, described in Sects. 4.2.1 and 4.2.2. It is worth noting that unlike typical experiments, where the underlying dynamics can be assumed to be stationary, user preferences and behaviour can change over time, making them harder to model with traditional approaches. For the first task, the user evaluated dataset embeddings for a classification problem, and the evaluation score was used as behavioural data. During the second task, the user searched for a target on a display, and the search time was measured. Our task in the experiments was to track the changing parameters of user models and learn their dynamics.

In addition to non-stationary user models, we also experimented with three models with tractable likelihoods, common in SSM literature: linear Gaussian (LG), non-linear non-Gaussian (NN), and stochastic volatility (SV) models. In the LG model, the state transition dynamics and observation model are both linear, with high observational white noise. The NN model is a popular non-linear SSM Kitagawa (1996), where each observation has two unique solutions. Lastly, we used the SV model Barndorff-Nielsen and Shephard (2002), which is used for predicting the volatility of asset prices in stock markets (Taylor 1994; Shephard 1996). For in-depth insights into these models and a detailed report on the auto-correlation function (Parzen 1963; Brockwell and Davis 2009) across all five SSMs, shedding light on the intricacies of their transition dynamics, please refer to “Appendix B”.

4.2.1 UMAP parameterisation

In this study, we utilise the first non-stationary user model to observe and analyse how an individual’s preferences evolve over time while interacting with data. Consider a scenario where an individual is engaged in data categorization without prior familiarity with the data. Initially, their primary focus might be on exploring and understanding the data. However, as they become acquainted with the data, their attention shifts towards enhancing the accuracy of their categorizations. By employing uniform manifold approximation and projection (UMAP) (McInnes et al. 2018a), we adapt the data presentation to align with the user’s shifting needs. The ultimate objective is to accurately predict and apply the most suitable UMAP settings for the user at different times.

User modelling

Drawing insights from cognitive science (Slovic et al. 2002; Lichtenstein and Slovic 2006), it is evident that individuals can possess preferences over various data presentations, even if they are unable to articulate their specific desires explicitly.

We quantify these preferences through an evaluation function, which assigns scores to different presentations of handwritten digit data (Alpaydin and Kaynak 1998)The assigned scores are influenced by two primary metrics: the density-based cluster validity (DBCV) score (\(\mathcal {U}(\cdot )\)) (Moulavi et al. 2014) and the c-support vector classification (SVC) accuracy (\(\mathcal {P}(\cdot )\)) (Boser et al. 1992; Cortes and Vapnik 1995). A weight parameter, \(w_t\), adjusts the balance between these metrics, thereby allowing emphasis on either data exploration (\(\mathcal {U}\)) or classification accuracy (\(\mathcal {P}\)):

$$\begin{aligned} \delta _{t}(\varvec{\theta }_{*,t})&= (1 - w_t) \cdot \mathcal {U}(\varvec{\theta }_{*,t}) + w_t \cdot \mathcal {P}(\varvec{\theta }_{*,t}) , \end{aligned}$$
(3)
$$\begin{aligned} w_t&= \frac{1}{1 + e^{-0.1\cdot (t - 25)} }. \end{aligned}$$
(4)

Both \(\mathcal {U}(\cdot )\) and \(\mathcal {P}(\cdot )\) are dependent on the time-variant UMAP settings, denoted as \(\varvec{\theta }_{*,t}\).

State-space modelling framework

  • Observation model: The observation model is constituted by the combination of the UMAP algorithm and the evaluation function (as defined in Eq. (3)). While the UMAP algorithm provides a low-dimensional data representation (or embedding), the parameters of which are dictated by the latent states, the evaluation function serves as a subjective lens through which data presentations are assessed. Subsequently, the scores of the evaluation function serve as observations. To cater to evolving human preferences, adjustments to the UMAP settings are needed.

  • Transition dynamics: In our framework, transition dynamics are unknown; we lack precise knowledge regarding the necessary adjustments to the UMAP parameters since the evaluation function is beyond our direct control.

  • Latent states: The latent states within our model are represented by the time-dependent UMAP parameters: \(\varvec{\theta }_{*,t} = \{{\theta _{d, t}, \theta _{dist, t}, \theta _{n, t}}\}\). These include the dimension of the reduced space (\(\theta _{d}\)), the point density-dictating parameter (\(\theta _{dist}\)), and the neighbourhood size parameter for local metric approximation (\(\theta _{n}\)). The priors for these parameters are:

    $$\begin{aligned} \theta _{d, t}&\sim \text {Unif}(1, 64) \in \mathbb {Z}, \\ \theta _{dist, t}&\sim \text {Unif}(0, 0.99) \in \mathbb {R}, \\ \theta _{n, t}&\sim \text {Unif}(2, 200) \in \mathbb {Z}. \end{aligned}$$

It is crucial to note that the ideal UMAP settings (referred to as ground truth states) remain elusive for our task. To address this, we generated 1,500,000 embeddings using parameter settings sampled from the prior, calculated their corresponding \(\mathcal {U}\) and \(\mathcal {P}\) values, and retained only a very small number of parameter settings (0.06%) exhibiting the best preference score values at each time step. Subsequently, we applied a Gaussian kernel density estimator to these parameter settings, allowing us to derive estimates of the ground truth for evaluating the performance of the methods. More implementation details can be found in “Appendix C”.

4.2.2 Eye movement control for gaze-based selection

In our next study, we seek to delve deeper into the mechanics of human eye movement during target-search tasks on a two-dimensional (2D) screen, as previously explored by (Zhang et al. 2010; Schuetz et al. 2019). The study observes individuals as they engage in repeated tasks, noticing an improvement in their task performance as they develop beliefs about the target location. However, fatigue sets in over time, leading to increased latency between eye movements. This latency masks the true individual characteristics of the human gaze, like ocular motor noise and peripheral vision’s spatial noise, making it challenging to identify consistent gaze features as fatigue develops. Our goal is to create a robust model that accurately depicts eye movement latency, helping identify consistent gaze features amidst signs of fatigue.

User modelling

The user model serves as a surrogate for human behaviour within a simulated environment, aiming to locate a target on a 2D screen. This environment comprises:

  • Reinforcement learning agent: Acting as a surrogate for the human subject, the agent learns to locate and focus on a target on a 2D display. The agent’s training uses ground truth values for ocular motor noise (\(\theta _{om, t}\)), spatial noise of peripheral vision (\(\theta _{s, t}\)), and varied values for eye movement latency (\(\theta _{l, t}\)). Given that both actions and observations experience noise from \(\theta _{om, t}\) and \(\theta _{s, t}\), the agent needs multiple attempts to locate the target accurately:

    $$\begin{aligned} x_t&= \sum _{e} (2.7 \cdot \hat{A}^{(e)}(\theta _{om}, \theta _{s}) + \theta _{l, t}). \end{aligned}$$
    (5)

    where \(\hat{A}^{(e)}\) represents the eye movement amplitude function, e is the eye movement index, and \(x_t\) represents the total time recorded for the agent to locate the target. The training process, spanning 10,000 episodes, utilises a multilayer perceptron policy derived from the proximal policy optimisation (PPO) algorithm (Schulman et al. 2017).

  • Virtual environment: Within this simulated space, the agent acts, observes, and updates its beliefs, with all elements (including the target’s location) represented by coordinates \({c_1, c_2}\) in the range of \(-1\) to 1. As the agent assimilates noisy target observations, it adjusts its gaze and updates its belief system through a predefined mechanism. Once the agent’s gaze aligns with the target location, the task is deemed complete. This virtual environment was created by Chen et al. (2021) using the Open AI gym framework (Brockman et al. 2016).

State-space modelling framework

  • Observation model: The simulation that drives the eye movement task functions as the observation model, detailed in Eq. (5). However, due to the inherent complexity and opaque characteristics of the reinforcement learning components used within, it remains ambiguous how the parameters of the reinforcement learning policy (or the latent states) influence the time it takes for the agent to locate the target \(x_t\). These timings \(x_t\) subsequently form the observations for the SSM.

  • Transition dynamics: Our experiments assume a specific form for eye movement latency’s transition dynamics, which remains unknown to the methods:

    $$\begin{aligned} \theta _{l,t}&= 12 \cdot log(t + 1) + 37. \end{aligned}$$
    (6)

    It is crucial to note that while latency changes, the way it influences the user model’s policy remains elusive, making the inference of other latent states non-trivial.

  • Latent states: The properties of human gaze behaviour serve as the latent states in our SSM, denoted by \(\varvec{\theta }_{*,t} = { \{\theta _{om, t}, \theta _{s,t}, \theta _{l,t}}\}\). These parameters, which also shape the reinforcement learning policy, have the following prior distributions:

    $$\begin{aligned} \theta _{om, t}&\sim \text {Unif}(0, 0.2) \in \mathbb {R}, \\ \theta _{s, t}&\sim \text {Unif}(0, 0.2) \in \mathbb {R}, \\ \theta _{l, t}&\sim \text {Unif}(30, 60) \in \mathbb {R}. \end{aligned}$$

For generating observed data, the observation model utilised ground truth values of 0.01 for \(\theta _{om, t}\) and 0.09 for \(\theta _{s, t}\) across all t. The \(\theta _{l,t}\) value, however, was changing according to the aforementioned transition dynamics. Detailed information about the implementation is available in “Appendix C”.

4.3 Results and analysis

Table 2 Comparison of LFI methods (rows) in different SSMs (columns) for the state inference task
Table 3 Comparison of transition dynamics models (rows) in different SSMs (columns)
Table 4 Time comparison of LFI methods (rows) in different SSMs (columns) for training 50 time-steps
Fig. 4
figure 4

The performance of LFI methods for state inference tasks with various simulation budgets in two non-stationary user modelling experiments. The box plots were computed from 30 repetitions with different random seeds. The horizontal line on box plots shows the median; the bar shows the upper and lower quartiles; and the whiskers indicate the rest of the quartiles. The diamond points indicate outliers

The results for the inference and prediction tasks are presented in Tables 2 and 3, respectively. The lower the RMSE, the better the quality of estimation. In the inference task, the proposed LMC-based methods clearly outperformed the BOLFI and SNE approaches. This indicates that considering multiple objectives at the same time was beneficial for state inference and that the model actually leverages information from consecutive states without hindering performance. Additionally, it can be seen that all LMC-based variants performed differently, which can only be attributed to how the next simulations were chosen since the surrogate was exactly the same in all three methods. As the results show, having BNN as a model for state transition was beneficial for experiments with non-stationary user models, while having BLR was more preferable for simpler models. This suggests that BLR is expressive enough to replicate simple transitions but struggles with more complex ones, for which BNN is more suitable.

4.3.1 Learning transition dynamics

The comparisons with GP-SSMs and PR-SSMs for learning transition dynamics show that our method learns accurate dynamics, or, at least, relative to the SSM method baselines. The SSM methods showed worse results than the BLR and BNN approaches. This can be explained by the lack of observations for learning state transitions by the SSM methods, which also explains the high variance in the sampled trajectories from these methods. As for comparisons between BLR and BNN, BLR performs better only in LG and SV models, while BNN performs better in more complex case studies. Moreover, it should be noted that trajectory sampling from BLR is possible only by retaining all local linearizations of the dynamics, which is a far more limiting approach than having one single model. Therefore, BNN is a more preferable transition dynamics model.

4.3.2 Empirical time costs

The empirical time costs for running the LFI methods are shown in Table 4. It can be seen that the SNPE method was the fastest for the computationally cheap simulators (of SSMs with tractable likelihoods), while the LMC-qEHVI required the least amount of time for the non-stationary user models. This is expected since the SNEs learn the model only once and then simply use it for all observations, which is suitable for the computationally cheap simulations with simple LFI solutions. However, for non-stationary user models, where there are no closed-form likelihoods available, learning a single model actually requires much more time. To summarise, the LMC variants are clearly preferable for the computationally heavy simulators, which dominate the cost of training a transition dynamics model and a multi-objective surrogate.

4.3.3 Simulation budget impact

Finally, Fig. 4 shows how the performance of the LFI models changes with different simulation budgets: 2, 5, and 10 simulations per time-step. As expected, in general, all methods improved their performance with increased budgets. However, there is little difference in how these methods compare with respect to each other. This indicates that the results are not sensitive to the precise simulation budget.

4.3.4 Key findings of experiments

In all experiments, we attribute the success of the proposed LMC-BNN method to a more flexible multi-output surrogate and a more efficient way of choosing simulation candidates. The LMC allows multi-fidelity modelling (e.g., decomposing a stochastic process into processes with different length-scales), which allows leveraging information from multiple consecutive time-steps, unlike standard GPs. At the same time, samples from the transition model provide better candidates for simulations than the alternatives. The flexible surrogate along with adaptive acquisition make our method particularly suitable for online settings, where only a handful of samples are possible per time-step.

While the experiments underline the strengths of our LMC-BNN method, it is equally important to be candid about its drawbacks. As highlighted earlier in this section, for more basic models, opting for BLR proves to be a more streamlined and beneficial approach than BNN. Moreover, as detailed in Sect. 4.3.2, in certain contexts, when the simulations are computationally cheap, SNPE methods outpace our method in terms of performance speed. For a comprehensive dive into the specific constraints and limitations of our modelling technique, please consult Sect. 3.4.2.

5 Discussion

We proposed an approach for state inference and prediction in the challenging SSM setting, where the transition dynamics are unknown and observations can only be simulated. Importantly, our model of transition dynamics was obtained with few simulations, making it suitable for cases with computationally expensive simulators. This is important because typically sample-efficient LFI approaches discard any temporal information from observed time-series and cannot do state prediction, which is necessary for choosing the next simulation when the simulation budget is limited. We proposed a solution for both of these challenges: we use a multi-objective surrogate model for the discrepancy measure between observed and synthetic data, which connects the consecutive states through shared parameters, and we train an additional surrogate for state transitions with samples from LFI state posteriors. Additionally, our method does not restrict the family of admissible solutions for the state transitions to being linear or Gaussian, unlike existing LFI methods for SSMs (Jasra et al. 2012; Martin et al. 2014), making it more widely applicable.

Although our method uses a more flexible surrogate for the LFI of states, we demonstrated that it requires neither additional data nor significantly more training time than traditionally used GP surrogates. We reached the sample-efficiency goal by sharing synthetic observations across all discrepancy objectives, allowing the method to use the same simulations indefinitely. As for the decreased training time, we proposed a moving window approach that allowed the surrogate to focus only on a few recent SSM time-steps at a time. In conclusion, having a more flexible surrogate improved state inference and provided better samples from state posteriors for learning the unknown dynamics.

The main limitation of our approach is that the proposed transition dynamics model does not account for long-term state dependencies. Our state transition surrogate considers only the most recent state as an input, assuming the Markov property, and therefore cannot forecast far into the future. The resulting predictions have a very low variance and a tendency to converge to similar values, which is expected when training on a single trajectory. Despite this limitation, our method remains effective in cases where the observation model serves as the primary source of information, while the transition dynamics model still plays a complementary role in state inference.