Efficient inference in state-space models through adaptive learning in online Monte Carlo expectation maximization

Expectation maximization (EM) is a technique for estimating maximum-likelihood parameters of a latent variable model given observed data by alternating between taking expectations of sufficient statistics, and maximizing the expected log likelihood. For situations where sufficient statistics are intractable, stochastic approximation EM (SAEM) is often used, which uses Monte Carlo techniques to approximate the expected log likelihood. Two common implementations of SAEM, Batch EM (BEM) and online EM (OEM), are parameterized by a “learning rate”, and their efficiency depend strongly on this parameter. We propose an extension to the OEM algorithm, termed Introspective Online Expectation Maximization (IOEM), which removes the need for specifying this parameter by adapting the learning rate to trends in the parameter updates. We show that our algorithm matches the efficiency of the optimal BEM and OEM algorithms in multiple models, and that the efficiency of IOEM can exceed that of BEM/OEM methods with optimal learning rates when the model has many parameters. Finally we use IOEM to fit two models to a financial time series. A Python implementation is available at https://github.com/luntergroup/IOEM.git.


Introduction
Expectation Maximization (EM) is a general and widely used technique for estimating maximum likelihood parameters of latent variable models (Dempster et al. 1977). It involves iterating two steps: computing the expected log-likelihood marginalizing over the latent variable conditioned on parameters and data (the E step), and optimizing parameters to maximize this expected log-likelihood (the M step). In important special cases the E-step is analytically tractable; examples include linear systems with Gaussian noise (Shumway and Stoffer 1982) and finite-state hidden Markov models (Baum 1972). In general however, Monte Carlo techniques such as Stochastic EM (SEM; Celeux and Diebolt 1985;Celeux et al. 1995) and Monte Carlo EM (MCEM; Wei and Tanner 1990) are necessary to approximate the required integral. The stochastic nature of Monte Carlo techniques result in noisy parameter estimates, and to address this, methods such as Stochastic Approximation EM (SAEM ;Nowlan 1991;Celeux and Diebolt 1992;Delyon et al. 1999) were developed that make smaller incremental updates parameterized by a learning rate γ or learning schedule {γ t }.
In this paper we focus on models where the latent variable has a longitudinal structure and follows a Markov model (see e.g. Lopes and Tsay 2010 for examples in financial econometrics). For such models, the required samples from the posterior distribution can be generated using Sequential Monte Carlo (SMC) techniques (see Doucet et al. 2001;Doucet and Johansen 2009 and references therein). In one approach, the Batch EM (BEM) algorithm processes a contiguous chunk of data to generate latent variable samples from the posterior, which are used in the M step to update parameters. An alternative approach is online EM (OEM; Mongillo and Denève 2008;Cappé 2009), in which parameters are continuously updated as data are processed. Analogous to SAEM, OEM algorithms have a parameter γ controlling the learning rate, an idea apparently first introduced in this context by Jordan and Jacobs (1993). Several recent papers have addressed related problems. For instance Yildirim et al. (2013) use a particle filter to implement an online EM algorithm for change point models (see also Fearnhead 2006;Fearnhead and Vasileiou 2009), which uses a pre-specified learning schedule (called "step-size sequence" in their work) to control convergence. Le Corff and Fort (2013) introduced a "block online" EM algorithm for hidden Markov models that combines online and batch ideas, controlling convergence through a block size sequence τ k .
All these algorithms thus require choosing tuning parameters in the form of a batch size, block sequence, learning rate or a learning schedule. It turns out that this choice can strongly influences the performance of these algorithms. For instance, for BEM, very large batch sizes lead to inaccurate estimates because of slow convergence, whereas very small batch sizes lead to imprecise estimates due to the inherent stochasticity of the model within a small batch of observations. The optimal batch size in BEM or the optimal learning rate in OEM depends on the particularities of the model. This raises the question of how to choose this tuning parameter. Several authors have proposed adaptive acceleration techniques for EM methods that obviate the need for choosing tuning parameters (Jamshidian and Jennrich 1993;Lange 1995;Varadhan and Roland 2008), but these methods require that the E-step is analytically tractable. In the context of (stochastic) gradient descent optimization (Bottou 2012), several influential adaptive algorithms have recently been proposed (Zeiler 2012;Kingma and Ba 2015;Mandt et al. 2016;Reddi et al. 2018) that have few or no tuning parameters. In principle, these methods can be used to find maximum likelihood parameters, but unless data is processed in batches, applying these methods to state-space models with a sequential structure is not straightforward. In addition, EM approaches enjoy several advantages over gradient descent methods, including automatic guarantees of parameter constraints and increased numerical stability (Xu and Jordan 1996;Cappé 2009;Kantas et al. 2009;Chitralekha et al. 2010).
Here we introduce a novel algorithm, termed Introspective Online EM (IOEM), which removes the need for setting the learning rate by estimating optimal parameterspecific learning rates from the data. This is particularly helpful when inferring parameters in a high dimensional model, since the optimal learning rate may differ between parameters. IOEM can be applied to inference in state-space models with observations Y t and state variables X t governed by transition probability function f (x t+1 |x t , θ) and observation probability function g(y t |x t , θ), for which f (x t |x t−1 , θ)g(y t |x t , θ) belongs to an exponential family with sufficient statistic s(x t−1 , x t , y t ). Broadly, IOEM works by estimating both the precision and the accuracy of parameters in an online manner through weighted linear regression, and uses these estimates to tune the learning rate so as to improve both simultaneously.
The outline of this paper is as follows. Section 2 introduces BEM, OEM, and a simplified version of IOEM in the context of a one-parameter autoregressive state-space model. Section 3 introduces the complete IOEM algorithm required for inference in the full 3-parameter autoregressive model. Section 4 discusses simulation results of the algorithms for these two models. In addition we consider a 2-dimensional autoregressive model to show the benefit of the proposed algorithm when inferring many parameters, and we demonstrate desirable performance in the stochastic volatility model, an important case as it is nonlinear and hence relevant to actual applications of SAEM. In Sect. 5 we apply IOEM with the autoregressive and stochastic volatility models to a financial time series, and we end the paper with a brief discussion.

EM algorithms for a simplified autoregressive model
Here we review BEM (Dempster et al. 1977), OEM (Cappé 2009) and SMC (Doucet and Johansen 2009), and present the IOEM algorithm in a simplified context. This illustrates the main ideas behind IOEM before presenting the full algorithm in Sect. 3.
We consider a simple noisily-observed autoregressive model with one unknown parameter, equivalent to an ARMA(1,1) model. We observe the sequence of random variables Y 1:t := {Y k } k=1,...,t that depend on the unobserved sequence X 1:t := {X k } k=1,...,t as follows: where W t and V t are i.i.d. standard normal variates, a = 0.95 and σ 2 w = 1 are known parameters, and σ 2 v is unknown. Under this model, we have the following transition and emission densities: We have chosen σ 2 v as the unknown parameter as it is the most straightforward to estimate, allowing us to introduce the idea of IOEM while avoiding certain complications that we address in Sect. 3. As f and g are members of the exponential family of distributions, the M step of EM can be done using sufficient statistics, and the E step amounts to calculating their expectation. In this model, the parameter σ 2 v has the sufficient statistic The estimate of σ 2 v is obtained by settingσ 2 v,t =Ŝ t . More generally, for an unknown parameter θ ,θ t = Λ(Ŝ t ) where Λ is a known function mapping sufficient statistics to parameter estimates.
To estimate S t , we use sequential Monte Carlo (SMC) to simulate particles X (i) 1:t and their associated weights w(X (i) approximates the distribution p(X 1:t |Y 1:t , θ). The standard MCEM approximation of p(X 1:t |Y 1:t ,θ) would require storage of all observations Y 1:t and simulation of X (i) 1:t each timeθ is updated, and ideally an increasing Monte Carlo sample size as the parameter estimates near convergence. To avoid this, we employ SAEM (Celeux and Diebolt 1992) which effectively averages over previous parameter estimates as an alternative to generating a new Monte Carlo sample every time an estimate is updated, and hence is more suitable to online inference. This method as proposed in Cappé and Moulines (2009) approximates the expectation in (2) recursively. The outline of the SMC with EM algorithm we consider in this paper is as follows (Doucet and Johansen 2009): Algorithm 1 Sequential Importance Resampling (bootstrap filter) For time t ≥ 1: 2. Compute normalized weights satisfying w t (X (i) t is shorthand for the t th coordinate of X (i) 1:t . In models with multiple unknown parameters, each parameter is updated in step 3 of the algorithm, however we will refer only to a single parameter θ to keep the notation simple.
Throughout this paper we follow common practice in using the fixed-lag technique in order to reduce the mean square error between S t andŜ t (Cappé and Moulines 2005;Cappé et al. 2007). We choose a lag Δ > 0 and at time t, using particles X (i) 1:t shaped by data Y 1:t , we estimate the t − Δ th term of the summation in (2). We will use X (i) 1:t (t − Δ) to denote the t − Δ th coordinate of the particle X (i) 1:t , but we will continue to write X (i) t as a shorthand for X (i) 1:t (t). (See Table 1 for an overview of notation used in this paper).
The fixed-lag technique involves making the approximation where we assume that S t can be written as This allows S t to be updated in an online manner by computing the component-wise sufficient statisticss with some learning schedule γ t ; in (3) γ t = 1/(t − Δ). This approach is slightly different from that of Cappé and Moulines (2005); see Sect. 7.1 for a discussion.
Choosing a large value of Δ allows SMC to use many observations to improve the posterior distribution of X t−Δ . However the cost of a large Δ is an increased path degeneracy due to the resampling procedure, which increases the sample variance. The optimal choice for Δ balances the opposing influences of the forgetting rate of the model and the collapsing rate of the resampling process due to the divergence between the proposal distribution and the posterior distribution. For the examples in this paper we chose Δ = 20 as recommended by Cappé and Moulines (2005), which seems to be a reasonable choice for our models.
There are various other techniques to improve on this basic SMC method, including improved resampling schemes (Douc and Cappé 2005;Olsson et al. 2008;Doucet and Johansen 2009;Cappé et al. 2007), and choosing better sampling distributions through lookahead strategies or resample-move procedures (Pitt and Shephard 1999;Lin et al. 2013; Doucet and Johansen 2009), which are not discussed further here. Instead, in the remainder of this paper, we focus on the process of updating the parameter estimateŝ θ t . The remainder of this section describes the options for step 3 of Algorithm 1.

Batch expectation maximization
Batch Expectation Maximization (BEM) processes the data in batches. Within a batch of size b, the parameter estimate stays constant (θ t =θ t−1 ) and the update to the sufficient statistics is collected at each iteration t. At the end of the mth batch we have t = mb, at which timeŜ The batch size determines the convergence behavior of the estimates. For a fixed computational cost, choosing b too small will result in noise-dominated estimates and low precision, whereas choosing b too large will result in precise but inaccurate estimates due to slow convergence.

Online expectation maximization
BEM only makes use of the collected evidence at the end of each batch, missing potential early opportunities for improving parameter estimates. OEM addresses this issue by updating the parameter estimate at every iteration. The approximation of S at time t is a running average of {s k } k=Δ+1,...,t , weighted by a pre-specified learning schedule. The choice of learning schedule determines how quickly the algorithm "forgets" the earlier parameter estimates. In OEM at time t, where {γ t } t=1,2,... is the chosen learning schedule, typically of the form for a fixed choice of c ∈ (0.5, 1] (Cappé 2009). Note that when using lag Δ, γ t = (t − Δ) −c for t ≥ Δ. This update rule ensures that at time t,Ŝ O E M is a weighted sum of {s k } k=Δ+1,...,t where the terms k has weight Although this method can outperform BEM as parameters are updated continuously, its performance remains strongly dependent on the parameter c determining the learning schedule γ t , and a suboptimal choice can reduce performance by orders of magnitude. At one extreme, the estimates will depend strongly only on the most recent data, resulting in noisy parameter estimates and low precision. At the other extreme, the estimates will average out stochastic effects but be severely affected by false initial estimates, resulting in more precise but less accurate estimates. Again, the best choice depends on the model.
A pragmatic approach to the problem of choosing a tuning parameter in OEM takes inspiration from Polyak (1990). In this method, a learning schedule that emphasizes incoming data is used to ensure quick initial convergence, while imprecise estimates are avoided at later iterations by averaging all OEM estimates beyond a threshold t 0 .θ Choosing an appropriate threshold t 0 can be more straightforward than choosing c for γ t = t −c , but it still requires the user to have an intuition for how the estimates for each parameter will behave. We will refer to this method as AVG, use c = 0.6, and set t 0 = 50,000 which is half the total iterations for our examples.

Introspective online expectation maximization
We now introduce IOEM to address the issue of having to pre-specify a learning schedule {γ t } t=1,... . The algorithm is similar to OEM, but instead of pre-specifying γ t , we estimate the precision and accuracy in the sufficient statistic updates {s k } k=Δ+1,...,t and use these to determine γ t+1 . More precisely, we keep online estimates of a weighted regression on the dependent variables {s k } k=Δ+1,...,t where k −t serves as the (shifted) explanatory variable:s where k ∼ N (0, σ 2 ), and data point (k −t,s k ) has weight (6) as before. This weighted regression results in intercept and slope estimatesβ 0 ,β 1 and variance estimatesσ 2 0 , σ 2 1 , where at convergenceβ 0 is the sought-after estimate andβ 1 0. We do not use standard weighted regression, in which weights are inversely proportional to the variance of the observation, as this assumption is not justified here and would lead to biased estimates ofσ 2 0,1 . Instead we assume that observations share an unknown variance, and we use the weights to modulate the influence of each observation to the regression estimates, to reduce the impact of the bias in earlier observations; see Sect. 7.2 for details.
We next use the regression coefficients to estimate the past iteration where the drift term |β 1 |(k − t) is of the same order as the uncertaintyσ 0 in the main estimateβ 0 : whereσ 1 ensures that division by zero does not occur, and α tunes the algorithms's sensitivity to model misfit due to underlying parameter changes; we use α = 1 unless stated otherwise. We propose a learning rate γ reg t+1 that results in a characteristic forgetting time 1/γ reg t+1 matching this distance: This choice ensures that a substantial slope estimate |β 1 | indicating thatβ 0 has low accuracy puts large weight on the incoming statistic, improving accuracy, whereas a largeσ 0 reflecting low precision in estimateβ 0 results in a small weight, smoothing out successive estimates and improving precision. We impose restrictions on γ t+1 which keep it between the most extreme valid learning schedules for OEM. Taken together, the update step for γ becomes where c > 0.5 is chosen to be very close to 0.5 and guarantees convergence. These restrictions ensure that our algorithm satisfies the assumptions of Theorem 1 of Cappé and Moulines (2009), namely that 0 < γ t < 1, ∞ t=1 γ t = ∞, and ∞ t=1 γ 2 t < ∞. Hence for any model for which f and g satisfy the assumptions guaranteeing convergence of the standard OEM estimator, the IOEM algorithm is also guaranteed to converge. The precise conditions are detailed in Assumption 1, Assumption 2, and Theorem 1 of Cappé and Moulines (2009).

Algorithm 3 Introspective Online Expectation Maximization for a simplified autoregressive model
For time t ≥ 1: 1. Simulate and calculate weights of particles using SMC with parameterθ t−1 2. Collect sufficient statistics 4. Perform weighted regression ons to calculate γ t+1 via (9-10).

The IOEM algorithm for the full autoregressive model
The adapting learning schedule {γ t } t=1,... sets IOEM apart from OEM. However, the way γ t is calculated in Algorithm 3 only works in the special case that a single sufficient statistic and the single parameter of interest coincide (here,σ 2 v,t =Ŝ t ). In general, the sufficient statisticsŜ are mapped to parameter estimatesθ by a function Λ, leading to a more involved setup that we explore here. To this end, we now consider the full noisily-observed autoregressive model AR(1) with master equations as in (1), but now with unknown parameters a, σ w , and σ v . We define four sufficient statistics, Then, in BEM and OEM, we update the parameter estimates tô whereŜ t is an approximation of S t . In most cases, as above, the function Λ mappingŜ t toθ t is nonlinear, and requires multiple sufficient statistics as input. To avoid bias, we want all sufficient statistics that inform one parameter estimate to share a learning schedule {γ t } t=1,2,... . We therefore estimate an adapting learning schedule for each parameter independently, by performing the regression on the level of the parameter estimates (Algorithm 4), rather than on the level of the sufficient statistics. We will calculateŜ t as in OEM (4) using our adapting learning schedule instead of a user specified learning schedule. Because the adapting learning schedule is specific to each parameter, we will have multiple estimates of certain summary sufficient statistics. In this case S 1,t and S 2,t are estimated byŜ a 1,t andŜ a 2,t for (11) and byŜ σ w 1,t andŜ σ w 2,t for (12). Simply regressing onθ 1:t with respect to t would correspond to regression onŜ 1:t , nots 1:t . AsŜ is a running average, there is a strong correlation betweenŜ t−1 andŜ t and hence also a strong dependence betweenθ t−1 andθ t . In order to perform the regression on the parameters we must "unsmooth"θ 1:t to create pseudo-independent parameter updatesθ t (see Algorithm 4). This is accomplished by taking linear combinations, where the coefficients are chosen so as to minimize the covariance between successive updates, justifying the term pseudo-independent. The resulting updates correspond with the unsmoothed sufficient statistics updatess t used in Sect. 2.3. See Sect. 7.3 for further details on this step.

Simulations
We performed inference on different models using the BEM, OEM and IOEM algorithms as described above. For BEM we used batch sizes from 100 to 10,000, and for OEM we used learning schedules γ t = t −c with c ranging from 0.6 to 0.9. In all cases the bootstrap filter was run with N = 100 particles, and the algorithm was run from t = 1 to t = 100,000. For all parameter choices, 100 independent replicates were generated, and we show the distribution of inferred parameter values across these replicates.

Inference with the simplified IOEM algorithm
We first applied the simplified IOEM algorithm (Algorithm 3) to the problem of inferring σ 2 v in model (1), with all other parameters assumed known, and compared the results with the BEM and OEM algorithms (Fig. 1). The choice of tuning parameter in BEM and OEM makes a significant difference to the precision of the estimate even after 100,000 observations. IOEM was able to recognize that behavior similar to BEM with b = 10,000 or OEM with c = 0.9 was optimal. The accuracy and precision of IOEM are comparable with those of the post-OEM averaging technique (AVG) with parameters c = 0.6 and t 0 = 50,000.

Inference with the complete IOEM algorithm
We next treated all four parameters of the AR(1) model (1) as unknown, and inferred them using the full IOEM algorithm (Algorithm 4). Estimates for the a parameter under different EM methods are presented in Fig. 2; for the other parameter inferences see Sect. 7.5, Fig. 6.
In the AR(1) model, IOEM outperforms most other EM methods when estimating the a parameter, while AVG for the chosen parameter settings (c = 0.6, t 0 = 50,000) provides slightly more precise estimates at similar accuracy. It is worth noting that in this case, OEM with c = 0.6 substantially outperforms OEM with c = 0.9, in contrast to the results shown in Fig. 1. This is a result of the bad initial estimates. OEM with c = 0.6 forgets the earlier simulations much faster than OEM with c = 0.9 and hence is able to move its estimates of a, σ w , and σ v much more quickly. Here IOEM recognizes that it should have similar behavior to OEM with c = 0.6, whereas in the inference displayed in Fig. 1 IOEM chose behavior similar to OEM with c = 0.9. IOEM can indeed adapt to the model.

Inference of multiple parameters
Next we investigated a model with a larger number of parameters and varying accuracy of initial parameter estimates. One of the advantages of the IOEM algorithm over OEM is its ability to adapt to each parameter independently. To highlight this, we applied IOEM to a simple 2-dimensional autoregressive model. For this model we consider the sequences {Y A , Y B } 1:t as observed, while {X A , X B } 1:t are unobserved, where Note that Y A and Y B are uncoupled, and that their master equation have independent parameters except for a shared parameter σ v . By giving component A good initial estimates and B bad initial estimates, we can see how the different EM methods cope with a combination of accurate and inaccurate initializations. IOEM is able to identify the set with good initial estimates (a A , σ A w ) and quickly start smoothing out noise. To IOEM, the other parameters appear to not have converged (σ B w and σ v because they are at the wrong value, a B because it will be changing to compensate for σ B w and σ v ). Figure 3 shows the inference of σ v , which due to its dependence on components A and B, suffers the most from a blanket choice of tuning parameter in BEM or OEM. OEM with c = 0.6 and OEM with c = 0.9 both suffer in this model as they are both well suited to parameter estimation in one of the components, but not the other. AVG provides precise but biased estimates in this case, because of its reliance on a fast-forgetting initial OEM stage which again is suited to only one of the model components. IOEM on the other hand is able to capture the best of both worlds, striving for precision in component A and initially foregoing precision in favour of accuracy in component B.
,000 is plotted for 100 replicates, N = 100 The inference of the other parameters and comparisons with a different choice of AVG threshold are shown in Sect. 7.5,Figs. 7,8,9,10.

Inference of parameters of a stochastic volatility model
The previous sections have demonstrated IOEM is comparable to choosing the optimal tuning parameter in OEM or BEM in certain models. However, the models shown have all been based on the noisily observed autoregressive model, which is a linear Gaussian case where in practice analytic techniques would be preferred over SAEM. We now examine the behaviour of these algorithms when inferring the parameters of a non-linear stochastic volatility model defined by transition and emission densities We define four summary sufficient statistics, Then the set of parameters that maximises the likelihood at step t arê Again IOEM results in similar estimates to the optimal BEM/OEM and the online averaging technique with a well-chosen threshold (see Fig. 4 and Sect. 7.5, Fig. 11).

Application to financial time series
We next applied our approach to daily log returns for US dollar to UK pound exchange rates, obtained from oanda.com. Between 18/05/2010 to 2/3/2016, roughly the period between the 2010 flash crash and the Brexit referendum, rates were fairly stable and might be described by an ARMA(1,1) model equivalent to (1). To assess confidence in estimates, we independently inferred model parameters 24 times from day-on-day log returns measured at every full hour (Fig. 5). We note that these time series are not fully comparable due to intraday seasonalities (Cornett et al. 1995), an effect that may be expected to increase the observed variation between the 24 time series, which would lead to conservative confidence estimates. Results suggest a weak negative correlation of successive daily log returns (a < 0), which is supported by a direct fit of an ARMA(1,1) model to the data (Fig. 12). Although the ARMA(1,1) model assumes fixed parameters and in particular constant volatility, running inferences strongly indicate volatility variations (Figs. 5 and 13), suggesting model (16) might be appropriate. Inferred values of φ indicate substantial day-to-day inertia in volatility. Running estimates of parameters are fairly constant in time, although those for β show that the model has difficulty tracking the two sudden drops in volatility that occurred in this period, indicating model misfit.

Conclusion
Stochastic Approximation EM is a general and effective technique for estimating parameters in the context of SMC. However, convergence can be slow, and improving convergence speed is of particular interest in this setting. We have shown that IOEM produces accurate and precise parameter estimates when applied to continuous state-space models. Across models, and across varying levels of accuracy of the initial estimates, the efficiency of IOEM matches that of BEM/OEM with the optimal choice of tuning parameter. The AVG procedure also shows good behaviour, but like BEM/OEM it has tuning parameters, and when these are chosen suboptimally performance is not as good as IOEM (Figs. 9 and 10). BEM/OEM/AVG all make use of a single learning schedule {γ t }, and for more complex models a single learning schedule generally cannot achieve optimal convergence rates for all parameters, as we have shown for the 2-dimensional AR example. In addition, AVG works by post-hoc averaging of noisy estimates, and since the inferences depend on the noisy estimates themselves, this implicitly relies on the model being sufficiently linear around the true parameter value. We expect IOEM to be more resilient to strong nonlinearities than AVG, but we have not explored this idea further here. IOEM finds parameter-specific learning schedules, resulting in better performance than standard methods with a single learning rate parameter are able to achieve. IOEM can be applied with minimal prior knowledge of the model's behavior, and requires no user supervision, while retaining the convergence guarantees of BEM/OEM, therefore providing an efficient, practical approach to parameter estimation in SMC methods. While not the focus of this paper, application to a financial time series suggests that IOEM may be useful in informally assessing model fit; it would be interesting to investigate whether this could be made rigorous.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fixed-lag technique
Our fixed-lag technique is slightly different than that proposed in the literature (Cappé and Moulines 2005;Olsson et al. 2008). Compared to the existing approach it uses less intermediate storage. Recall that the approximation we aim to evaluate iŝ where the sufficient statistic is written explicitly as a sum over the path traced out by the particle X (i) 1:t . The drawback is that for u t the paths will have collapsed due to resampling, increasing the variance for those contributions to S. The solution proposed in Cappé and Moulines (2005) is to use instead the approximation This requires storing the quantities ...,t for each sufficient statistic and each particle. This storage can be expensive if large numbers of sufficient statistics are tracked. Instead, at iteration t we use the approxi-mationŜ By disregarding terms involving s u for u > t − Δ and switching the summation in this way, we can now updateŜ at each iteration by adding the contribution of the current particles to a single summary statistic at a distance Δ, without requiring per-particle storage other than each particle's recent history.

Weighted regression
The term "weighted regression" usually refers to regression where the errors are independent and normally distributed with zero mean and known variance (up to a multiplicative constant), and the data is weighted inversely proportionally to its variance. In our case, the data is assumed to drift, contributing an additional, non-independent term to the error. Weights are used to only focus on recent data where the drift contributes an error of the same order of magnitude as the normally distributed noise, while discounting the impact of data points further away. In this setup we are interested both in estimating the regression coefficients, and the error in these estimates. Perry Kaufman's adaptive moving average (AMA) (Kaufman 1995) is a similar averaging technique which reacts to the trends and volatility (jointly referred to as the behavior) of the sequence. The difference lies in the measure of the behavior. AMA relies on a user specified window length n. The n most recent data points are used to measure the behavior. This would be equivalent to using equally-weighted linear regression over the last n points. By using weighted regression, the contribution of points to the behavior measures is also influenced by the previously observed behavior. For example, a sharp trend will effectively employ a smaller n value as we have lost interest in the behavior before that trend.
Let X be the 2 × n matrix consisting of a column of 1s and a column with the dependent variable, let y be the vector of observations, let β be the two coefficients, and the vector of errors, with k ∼ N (0, σ 2 ). Finally let w be a vector of weights. We estimate β by minimizing where X w and y w are defined as Setting the derivative ∂s 2 /∂β = 2(X w β − y w ) X w to zero and solving for β results in weighted regression estimatorβ = (X w X w ) −1 X w y w , or explicitlŷ From this expression we can see thatβ can be updated in an online manner as k increases simply by updating the above summations. The variance inβ can be estimated as follows: If w 2 k = 1 this simplifies to the usual varβ = σ 2 (X X ) −1 . Writing out the expression for varβ explicitly shows that it is again possible to find online updates for the relevant terms.

Pseudo-independent parameter updates
In order to perform our regression on the level of the parameters, we need to map from s (t) toŜ (t) and then toθ (t) . We do not wish to regress onθ (1:t) , asθ (t−1) andθ (t) are highly correlated. Instead we want a sequence defined in the parameter space where the correlations resemble those ins (1:t) . We define this sequence as Here we show thatθ i andθ j are uncorrelated for all i = j, under the assumption that s i ands j are uncorrelated (i = j). Define {η t k } k=0,...,t to be the sequence that satisfieŝ S t = t k=0 η t ks k and t k=0 η t k = 1. Note that η t t = γ t , η t t−1 = γ t−1 (1 − γ t ), and so on. Now, Hence, ifs i ands j are independent for all i = j, thenθ i andθ j are uncorrelated (i = j), justifying the term "pseudo-independent updates" forθ i .

Notation reference
See Table 1.

Figures
See Figs. 6,7,8,9,10,11,12,13,14,15 and 16.  (1 − a 2 )σ 2 v /σ 2 w ) and ρ AR M A(1,1) = 1 + 2φθ + θ 2 /(φ + θ)(1 + φθ), where the AR coefficient φ, a governs the asymptotic falloff in correlation in the two parameterizations, and θ is the coefficient of the moving average component. Results indicate a non-significant trend for a negative day-to-day correlation of log returns across the period studied. Inferred values of ρ across 24 different hourly offsets are themselves correlated but not identical between the two models, as expected as IOEM emphasizes recent observations over earlier ones, in contrast to the global optimization used for fitting the ARMA(1,1) model

Fig. 13
Memory length (γ −1 t ) for the three parameters of model (1) for the 24 daily time series. The data does not appear to support a stationary model, particularly for σ w and σ v , as can be seen from the collapse of the memory length γ −1 Fig. 14 Memory length (γ −1 t ) for the three parameters of model (16) for the 24 daily time series. The inferred parameters (see Fig. 5) appear to support a stationary model, although the collapse in memory length for parameter β due to apparent drift indicates the model's difficulty in tracking sudden changes in volatility around t = 650 and t = 1150 Fig. 15 Running estimates of parameters of model (1) on daily GBP/USD log returns, and memory length, inferred using α = 1 Fig. 16 Running estimates of parameters of model (16) on daily GBP/USD log returns, and memory length, inferred using α = 1