Composable models for online Bayesian analysis of streaming data

Data is rapidly increasing in volume and velocity and the Internet of Things (IoT) is one important source of this data. The IoT is a collection of connected devices (things) which are constantly recording data from their surroundings using on-board sensors. These devices can record and stream data to the cloud at a very high rate, leading to high storage and analysis costs. In order to ameliorate these costs, the data is modelled as a stream and analysed online to learn about the underlying process, perform interpolation and smoothing and make forecasts and predictions. Conventional state space modelling tools assume the observations occur on a fixed regular time grid. However, many sensors change their sampling frequency, sometimes adaptively, or get interrupted and re-started out of sync with the previous sampling grid, or just generate event data at irregular times. It is therefore desirable to model the system as a partially and irregularly observed Markov process which evolves in continuous time. Both the process and the observation model are potentially non-linear. Particle filters therefore represent the simplest approach to online analysis. A functional Scala library of composable continuous time Markov process models has been developed in order to model the wide variety of data captured in the IoT.


Introduction
Observations across sensor networks are highly heterogeneous, such as weather and temperature, counts of cars and bikes or tweets from Twitter.In order to model this data we need a wide variety of observation models, not just the Normal distribution.Dynamic Linear Models (DLMs) are often applied to modelling time series data.They are discrete time, latent variable models which can be applied to a wide variety of problems, provided the data is Normally distributed and the model is linear.The Kalman filter is an analytical solution to find the distribution of the latent variables of a DLM, which can be used to perform forecasting (Kalman, 1960).The closed form solution can be found in the case of a linear-Gaussian model because of special properties of the Gaussian distribution; the sum of two Gaussian distributions is a Gaussian distribution and a linear translation of a Gaussian distribution is still Gaussian with predictable mean and variance.
In order to model non-linear systems, the extended (EKF) and later unscented Kalman filter (UKF) have been developed (Julier and Uhlmann, 1997).The extended Kalman filter linearises at the current time step, by calculating the Jacobian of the transition and observation functions at the given time and using these in the Kalman filter equations.The extended Kalman filter becomes unstable when applied to highly non-linear problems, so the unscented transform was introduced.
Particle filters can determine the state space of a non-linear, non-Gaussian latent variable model, with minimal modification between models.The UKF is more computationally efficient than the simulation based particle filter and can be more accurate when models are near-linear (Bellotto and Hu, 2007).However, particle filters allow more flexibility of model choice and as such are considered when performing inference for the composable models presented in this paper.As the number of particles in the particle filter is increased, the estimation error tends to zero, the same is not true of the UKF, since a limited number of sigma points are used (Simon, 2006).So with access to more computing power, the particle filter is preferred for accurate inference.Gordon et al. (1993) developed the theory of the bootstrap particle filter and compared the performance of the new filter to the Extended Kalman Filter (EKF).They found the performance of the bootstrap filter to be greatly superior to the EKF using a highly non-linear model.A target tracking model was also considered, where the bootstrap filter again outperforms the EKF.
Partially observed Markov processes (Ionides et al., 2006) with irregular observations are used in order to model the variety of data streaming from sensor networks in the internet of things.The evolution of the unobserved system state is governed by a diffusion process, a continuous-time Markov process.A diffusion process is the solution to a stochastic differential equation.The Markov property of the diffusion process can be used to forward simulate the state between irregular observations given the previous state estimate and time different between observations.POMP models are flexible in the choice of state distribution and observation distribution, hence they can be used to model a wide variety of processes, such as counting processes.For a full Bayesian analysis of a POMP model, the full-joint posterior distribution of these models can be determined using the Particle Marginal Metropolis Hastings algorithm (PMMH) (Andrieu et al., 2010).The PMMH algorithm requires a batch of observations to determine the parameters.On-line filtering and forecasting can be carried out using a particle filter with per-determined parameters, and occasional retraining of the model can be carried out as required.On-line learning of model parameters is an active research area not pursued here; see Carvalho et al. (2010) for more details.
It is useful to reuse and combine existing models when exploring new data and building more complex models.The POMP models considered in this paper can be composed together to create new models.A software package has been written in Scala (Odersky and al., 2004) which allows users to build and compose models and perform inference using the bootstrap particle filter and the PMMH algorithm.
There are other software packages for performing inference on POMP models: LibBi (Murray, 2015) implements inference for general state-space models using particle Monte Carlo Markov chain (pMCMC) methods and sequential Monte Carlo (SMC).It is optimised for parallel hardware and utilises CUDA (a parallel programming platform by NVIDIA) for GPU (Graphics Processing Unit) programming.There is also an R (R Core Team, 2015) package, POMP (King et al., 2015) which implements Frequentist and Bayesian methods of parameter inference for POMP models.However, neither package supports online analysis of streaming data.
The composable POMP models presented in this paper have been developed to analyse data in the Urban Observatory (James et al., 2014), a grid of sensors deployed around the city of Newcastle Upon Tyne.In order to demonstrate the utility of this class of composable POMP models, there are two examples presented in Section 5.The first example consists of arrival times of traffic at specific locations in the city, which is modelled using a Poisson distribution with seasonally varying rate.The second example consists of temperature data collected from urban sensors in Newcastle Upon Tyne (James et al., 2014) and demonstrates the application of continuous time state space POMP models to irregularly observed time series data.
Section 2 introduces Partially Observed Markov Processes for modelling a variety of time evolving data.This includes binary (Bernoulli), count (Poisson), time to event (Log-Gaussian Cox-Process) and seasonal data.Section 3 describes how to compose models together to enable complex models to be developed.Section 4.1 presents the bootstrap particle filter for calculating the latent state of the model, and how it can be used to calculate an estimate of the marginal likelihood of the observations given a set of parameters.Section 4.2 outlines the Particle Marginal Metropolis Hastings algorithm, which can be used to calculate the full joint posterior of a POMP model.

Partially Observed Markov Processes
Streaming data arrives frequently and discretely as an observation, y, and a timestamp, t.In order to model this time series data, a structured model is proposed: (1) Y (t i ) denotes the observation at time t i , where π(y(t i )|η(t i ), θ) is the observation distribution.X(t i ) is the (possibly multidimensional) unobserved state of the process.The state is Markovian and is governed by the transition kernel p(x(t i )|x(t i−1 ), θ) that we assume realisations can be generated from, but in general may not represent a tractable distribution.The prior distribution on the state is given by p(x(t 0 )|θ).F t is a time dependent vector, the application results in γ(t) = F T t x(t), where γ(t) ∈ R. The The model is expressed as a directed graph in Figure 1.Although the latent state evolves continuously, it is indicated only when there is a corresponding observation of the process.

Exponential Family Models
Not all processes have Gaussian observation models.To this end, we propose modelling observations using exponential family models which include, Poisson, Bernoulli, Gamma, Exponential and the Gaussian distribution among others.
The exponential family of dynamic models are parameterised by x(t i ), V (t i ) and three known functions b(Y (t i ), V ), S(Y (t i )), a(x(t i )), the density is given by: x(t i ) is called the natural parameter of the distribution, V > 0 is the scale parameter, S(x(t i )) is the sufficient statistic, b(Y (t i ), V ) is the base measure and a(x(t i )) is known as the log-partition.Our model class is then a generalisation of dynamic generalised linear models (West and Harrison, 1997).

Binary Outcomes: The Bernoulli Distribution
The Bernoulli distribution is used to model a binary outcome.It is parameterised by p, representing the probability of success (or true).In order to represent observations from a Bernoulli distribution within the exponential family of distributions the natural parameter is the logit function, which gives a value for the probability mass function of: The natural parameter x(t i ), is the unobserved latent state and varies in continuous time according to a Markov process.The logistic function is the linking function used to transform x(t i ) into a value suitable to represent a probability, p(t i ) = exp(x(ti))  1+exp(x(ti)) .A model specification with a generalised Brownian motion state (see Section 2.4) is presented below: , The latent state, x(t) is one-dimensional.Figure 2 shows a simulation from the Bernoulli POMP model on a regular integer lattice.

Modelling Count Data: The Poisson Distribution
The Poisson is one distribution used for modelling count data.The Poisson distribution can be represented as an exponential family model; let S(Y Y (ti)! and the scale parameter V = 1.The natural parameter, x(t i ), is unobserved and varies in continuous time according to a Markov Process.The realm of Y (t i ) is Y = Z + , the positive integers including zero.The full specification is given as: The rate of the Poisson distribution, λ(t i ), changes with time, and is constant between time steps.For instance, if count observations are aggregated in non-overlapping one hour windows, the rate is considered constant within each hour.A more realistic model of time-to-event data is presented later in Section 2.3.Figure 3 shows simulation from a Poisson POMP Model (top) with a generalised Brownian Motion latent state (bottom) and the time varying rate of the Poisson distribution (middle).

Modelling Seasonal Data
Many natural phenomena feature predictable periodic changes in the rate of their process.For instance, when monitoring urban traffic, traffic flow is higher around 9am and 5pm as commuters make their way to and from the city.In order to model a periodic process, a time-dependent linear transformation vector of length n is used: , where ω = 2π T is the frequency, T represents the period of the seasonality and h represents the number of harmonics required.The phase and amplitude of the waves are controlled by the value of the latent state, if the h th harmonic at time t i is given by: the phase of the wave is ϕ = arctan(−x 2 (t i )/x 1 (t i )) and the amplitude is, A = x 2 (t i ) 2 + x 1 (t i ) 2 .This model has Normally distributed observations, and is given by: The latent state of the seasonal model is 2h-dimensional.Figure 4 shows a simulation from the seasonal model.

Time to Event Data: The Log-Gaussian Cox-Process
The Log-Gaussian Cox-Process (LGCP) is an inhomogeneous Poisson process whose rate is driven by a log-Gaussian process.In an inhomogeneous Poisson process the hazard, λ(t), varies in time according to a log-Gaussian process.The total number of events in the interval (0, T ] is distributed as: In the Log-Gaussian Cox-Process, the hazard, λ(t) is log-Gaussian.The general form of the LGCP POMP model is given by, where p(x(t i )|x(t i−1 ), θ) is the transition kernel of the Ornstein-Uhlenbeck process or other Gaussian Markov process.Figure 5 shows a simulation from the Log-Gaussian Cox-Process, using an approximate simulation algorithm presented in Appendix A.2.

The Latent State: Diffusion Processes
The system state evolves according to a continuous time Markov process.Any Markov process can be used to represent the state; however in this paper, the focus is on Itô diffusion processes.Diffusion processes are represented by stochastic differential equations (SDE), (see Oksendal (2013) for a detailed treatment).An SDE is a differential equation with a stochastic element, used to represent the continuoustime evolution of a stochastic random variable.An SDE of the form, is used to govern the time evolution of the Markovian latent state.Generally, X ∈ R n and µ : R n → R n is referred to as the drift coefficient, σ : R n → R n×m is the diffusion coefficient and W (t) is a Wiener Process.
If an SDE doesn't have an analytic solution, it can be simulated approximately using the Euler-Maruyama method (Kloeden and Platen, 1992).The Euler-Maruyama approximation gives an approximate solution to a stochastic differential equation, as a Markov chain.Start with a general SDE for a diffusion process: The interval of interest, (0, T ] is partitioned into n even sized sub-intervals, ∆t = T /n.Define a value for x(0), the initial value of the Markov chain, then where W n ∼ N (0, ∆t), are independent and identically distributed Normal random variables with mean 0 and standard deviation ∆t.The approximate transition for any diffusion process can then by written as, for sufficiently small ∆t.The transition to the next state only depends on the current value, hence the Euler-Maruyama approximation scheme for stochastic differential equations is a Markov Process.Analytic solutions of diffusion processes are also Markovian, necessarily.

Composing Models
In order to model more complex phenomena, such as the traffic example presented in section 5.1, it is convenient to compose simple model components to form a more complex model.If the traffic is considered as count data, the observation distribution can be Poisson.Since the traffic data also displays daily and weekly seasonal cycles, then the instantaneous hazard, λ(t) must vary periodically with time.
In order to account for the two periods of seasonality with a Poisson observation model, a seasonal-Poisson model is formed by the composition of a Poisson model and two seasonal models, one with a weekly period and the other with a daily period.Consider an arbitrary model, with its associated functions, distributions, parameters and latent state indexed by j.
Now define the composition of model M 1 and M 2 as M 3 = M 1 M 2 .Firstly consider the composition of observation models, by convention the observation model will be that of the left-hand model and the observations model of the right-hand model will be discarded.As such, the non-linear linkingfunction must be that of the left-hand model, g 1 : R → D. The linking-function ensures the state is correctly transformed into the parameter space for the observation distribution.
In order to compose the latent state, the initial state vectors are concatenated: ) .
The composed model's transition function for the state is given by: 2) ) .
In order to compose the linear deterministic transformation vectors, F (j) (t), j = {1, 2}, the vectors are concatenated.The vector dot-product with the latent state is then computed, so the result of applying the concatenated vector to the state is the same as applying each model vector to their respective state and adding the results: The full composed model, M 3 can then be expressed as follows: ) .
Note that models form a semi-group under this composition operator, however the operator is not commutative.Figure 6 shows a directed acyclic graph of a composed model.Further details on composing models in Scala are presented in appendix A.3.

Example: A Seasonal-Poisson Model
To illustrate model composition, consider a Poisson model with a time dependent rate parameter, λ(t) which varies seasonally.This model is the composition of two models, a single Poisson model, M 1 , and a single seasonal model, M 2 , to make M 3 = M 1 M 2 .This model could represent the flow of traffic through a city, as in the example in Section 5.1.
The Poisson model has a 1-dimensional state, which varies according to generalised Brownian motion.The linking function is the exponential function, g(x) = exp(x), and the linear transformation vector is the identity, F (1) ti = 1.The Poisson model is given by: The latent state of the seasonal model has a dimension of 2h, where h represents the number of harmonics.Generalised Brownian motion is used to represent the time evolution of the 2h-dimensional latent state, hence the drift coefficient µ is 2h×1 vector and the diffusion coefficient, Σ, is a 2h×2h diagonal matrix.The standard seasonal model has the Normal distribution as the observation distribution, and the linking function is the identity function: ti is a 2h × 1 vector of fourier components: Figure 6: A directed acyclic graph representing the composition of two models, the latent state is represented as separate, and advancing according to each model components transition kernel p i ω = 2π/T is the frequency and T is the period of the seasonality.In order to model a daily seasonality, T = 24, if the observation timestamps are in hour units.
In the composed model, the left-hand observation distribution is chosen, in this case the Poisson distribution.This necessitates the left-hand linking function (g(x) = exp(x)) since the rate parameter of the Poisson distribution is greater than zero, λ(t) > 0. The full composed model can be expressed as: The system state is n = 2h+1 dimensional, hence the drift coefficient µ is n×1 vector and the diffusion coefficient is an n × n diagonal matrix.The linear transformation vector F (3) ti is a time-dependent, n × 1 vector, which is the concatenation of the two F (j) ti , j = 1, 2, vectors:

Example Composition in Scala
Firstly, define a Poisson model with a stepFunction representing the solution to a Markov process and the models associated parameters.The LeafParameter contains the parameters for the initial state and Markov transition kernel, the None in the middle corresponds to a scale parameter required by some observation distributions (however not required for the Poisson model).Now define a single seasonal model: The seasonal model has a Gaussian observation model, in order to use the seasonal model by itself the measurement noise for the Gaussian observation model must be specified in the additional parameters (currently set as None).Now, composing the parameters and model: The parameterisedModel can be simulated from, it can be fit to observed data using the PMMH algorithm and particle filter and can used for online filtering as described in Section 4.

Statistical Inference
Consider a POMP model of the form, observed at discrete times, T = {t i |i = 1, . . ., M }: The joint distribution of all the random variables in the model can be written as: Since the latent state is a Markov process, the joint distribution can be written as the product of the distribution over the parameters, p(θ), the initial distribution of the latent state p(x(t 0 )|θ), the transition density, p(x(t i )|x(t i−1 )) and the likelihood π(y(t i )|η(t i ), θ).In general this likelihood is analytically intractable, the bootstrap particle filter introduced in Section 4.1, can be used to calculate the pseudomarginal likelihood and latent state of a POMP model, given observed data and parameters.Particle Marginal Metropolis-Hastings, presented in Section 4.2, is used to determine the full joint-posterior distribution, p(x(t 0:M ), θ|y(t 0:M )), utilising the pseudo-marginal likelihood estimated using the bootstrap particle filter.

State Estimation: The Bootstrap Particle Filter
The bootstrap particle filter (Gordon et al., 1993) is a simple filter able to estimate the latent state of a POMP model.The hidden Markov process has an associated transition kernel X(t i )|x(t i−1 ) ∼ p(x(t i )|x(t i−1 )), from which realisations can be sampled.The transition kernel can either be an exact solution to an SDE or an SDE simulated on a fine grid using the Euler-Maruyama Method.The process is observed through an observation model The bootstrap particle filter is used to calculate the unobserved system state by approximating the filtering distribution, p(x(t i )|y(t 0:i )).The algorithm to determine an empirical approximation of the unobserved system state at each observation time is presented below: 1. Initialise: Simulate N particles from the initial distribution of the latent state, x(t 0 ) (k) ∼ p(x(t 0 )), and set i = 1 2. Advance the particle-cloud until the time of the next observation, 3. Transform the each particle appropriately for the given observation model, 4. Calculate the weight of each particle, by calculating the likelihood of the observation given each particle: The particles and associated normalised weights form a weighted sample from the filtering distribution p(x(t i )|y(t Resample the state, by sampling with replacement N times from a Multinomial distribution with each category representing a particle and the probability of choosing a particle represented by the associated weights, w(t i ) (k) .This gives an approximate random sample from p(x(t i )|y(t 0:i )) 7. Return the random samples from the filtering distribution p(x(t i )|y(t 0:i )).If i < M set i = i + 1 and go to step 2, else stop.
The average of the un-normalised weights at each time point gives an estimate of the marginal likelihood of the current data point given the data observed so far: p(y(t i )|y(t The estimate of the likelihood of the full path is given by: p(y(t 1:M )) = p(y(t 1 )) The estimated marginal likelihood is consistent, meaning that as the number of particles are increased, N → ∞, then the estimate converges in probability to the true value.The estimate is also unbiased, meaning E(p * θ (y(t 1:M ))) = p * θ (y(t 1:M )), see Del Moral ( 2004) for a proof.This marginal likelihood is used in the Particle Marginal Metropolis Hastings (PMMH) algorithm discussed in Section 4.2.
In practice, the bootstrap particle filter can be easily parallelised, advancing the state and calculating the likelihood are naturally parallel.However, Multinomial resampling (as described above) requires communication among multiple threads to sum the value of the weights.Other resampling schemes, such as stratified resampling and systematic resampling are more amenable to parallelisation.An overview of resampling schemes used in the particle filter are considered by Murray et al. (2016).

Implementation (Fold and Scan)
The models and algorithms described in this paper have been implemented in Scala, a functional, objectoriented language which runs on the Java Virtual Machine (JVM).This means the code can be deployed without change across different operating systems and on servers in the cloud which have the JVM installed.Observations arrive as a stream, a stream can be thought of as a lazy list.A list can be represented recursively as a pair, with the first element a single value at the head of the list and the second element another list (called the tail), this is called a cons list.A lazy list, or stream, is also a pair, with the first element being a computed value and the second element a function with no parameters (in Scala, a lazy val).The stream can be forced to evaluate its contents, for instance when performing a side effect such as printing to the screen or writing to a file.In practice, when considering live streaming data, the function in the second element of the pair might be one which polls a web service for the next observation.The approach taken in this paper is similar to Beckman's series of papers on Kalman folding, implementing the Kalman filter on streams in the Wolfram language (Beckman, 2016).
foldLeft is a function which operates on a recursive datatype, such as list or stream.The function signature of foldLeft is given by: foldLeft is used to apply a function to elements of a Foldable (Stream is a Foldable data structure) pairwise in order to calculate an aggregated result.The function f takes two arguments, the first of which is of type B (a placeholder for any type) and the second is of type A, the same as the element type of the Stream (whatever that may be).The first application of f, is to the first element of the stream and z (the so called zero of the fold).The result is then passed to the next application of f, along with the second element of the stream.
For instance, consider adding values in a list using foldLeft: The first application of sum2 will add the zero element, 0, to the first element in the list (from the left): sum2(0, 1) = 0 + 1 = 1.Then the next element of the list and the previously evaluation of sum2 is passed to the next evaluation of sum2: sum2(1, 2) = 1 + 2 = 3.This is repeated until the list is exhausted and the value returned is 15.In this way, the list is summed without mutating state in the program.Also note that if A = B and the function f: (A, A) => A is associative, the fold can be computed in parallel via a binary tree reduction.
In the application of the bootstrap particle filter there is an internal state which propagates as the stream of data arrives.The internal state includes the particle cloud, the time of the most recent observation and the log-likelihood.The function f in foldLeft can be implemented as a single step of the particle filter.An illustrative implementation of f, called filterStep is implemented in the code block 1. Firstly a trait is defined containing abstract implementations of three important functions in the particle filter, which will be implemented when applying the particle filter in practice.stepFunction and dataLikelihood are implemented in each model and can be specified in a concrete class for a specific particle filter implementation.The resample function is not model specific, and Multinomial resampling is typically employed.Then the states are resampled to get an approximate unweighted random sample from p(x(t i )|y(t 1:i )).Each particle is advanced independently by applying the function stepFunction to each particle using the higher-order function map which has the following signature: map is used to apply a function to the inner type of the Vector, in this case A. Each particle weight is calculated independently using map and dataLikelihood.In practice, we compute the log-likelihood to avoid arithmetic underflow from multiplying many small values.The value of the log-likelihood, pθ (y(t 1:M )), is updated by adding the log of the mean of the weights to the previous value.
In order to calculate the log-likelihood (ie a single value) of a path, the function foldLeft can be employed: Listing 2: Scala Code to calculate the log-likelihood of a set of observations the value initState is implemented by sampling 1,000 times (equivalent to 1,000 particles) from the initial state distribution of the model and initialising the weights to be 1/1,000.The initial time t0 is taken to be 0.0 and the initial value of the log-likelihood is set at zero by convention.The log-likelihood is extracted by appending .ll on the call to foldLeft.
A closely related function to foldLeft is scanLeft which will return an aggregated stream of reduced values using a provided function, the signature is: In order to understand scanLeft, consider the application of sum2 to a stream of natural numbers: then an infinite stream containing the cumulative sum will be returned, 0, 1, 3, 6, ....The following code block runs a particle filter on a stream of data, where filterStep, initState and data are the same as supplied above in 2.

scanLeft ( data , initState ) ( filterStep )
this code accumulates the particle cloud (states and associated weights) and the running log-likelihood into a stream, using the reduction function filterStep.

Filtering for the Log-Gaussian Cox-Process
When considering observations from the Log-Gaussian Cox-Process (LGCP), the filtering needs to be performed slightly differently.The likelihood for the LGCP is given by: π(y(t)|λ(t), Λ(t)) = λ(t) exp(−Λ(t)), notice that the likelihood depends on the instantaneous hazard λ(t) and the cumulative hazard Λ(t) = t 0 λ(s)ds.In practice the log-likelihood is calculated to avoid arithmetic overflow and to simplify calculations, the log likelihood is given by, = log(λ(t)) − Λ(t).The state must be augmented to include Λ(t) in addition to λ(t).In practice the value of Λ(t) is often calculated approximately, using numerical integration.

Parameter Estimation
The Particle Marginal Metropolis-Hastings algorithm (PMMH) (Andrieu et al., 2010) is an offline Markov chain Monte Carlo (MCMC) algorithm which targets the full joint posterior p(x(t 0:M ), θ|y(t 1:M )) of a partially observed Markov process.Consider a POMP model given by, where θ represents the parameters to be estimated using the PMMH algorithm.The parameters include the measurement noise in the observation distribution, π(y(t i )|η(t i ), θ), the parameters of the Markov transition kernel for the system state, p(x(t i )|x(t i−1 ), θ) and the parameters of the initial state distribution p(x(t 0 )|θ).The data, y(t), is observed discretely.In order to simulate a Markov chain which targets the full posterior, p(θ, x|y), firstly a new set of parameters θ * is proposed from a proposal distribution q(θ * |θ).Then the bootstrap particle filter (see Section 4.1), is run over all of the observed data up to time t using the newly proposed θ * .The output of running the filter with the new set of parameters, θ * is used to estimate the marginal likelihood, pθ * (y) = n i=1 pθ * (y(t i )|y(t i−1 )) and, optionally, to sample a new proposed system state, x * from the conditional distribution p(x * |θ * , y).The pair (θ * , x * ) are accepted with probability min(1, A), where A is given by: the distribution p(θ), represents the prior distribution over the parameters.It is shown in Andrieu et al. (2010) that this algorithm has as its target the exact posterior p(θ|y), or optionally, p(θ, x|y).
The Metropolis-Hastings Kernel, can be simplified in the case of a symmetric proposal distribution.For a symmetric proposal distribution q(θ * |θ) = q(θ|θ * ).Commonly, the proposal is chosen to be a Normal distribution centered at the previously selected parameter, this is know as a random walk proposal, q(θ * |θ) = N (θ, σ), where σ is a parameter controlling the step size of the random walk.If a flat prior distribution is chosen, then the ratio in Equation 9 can be simplified further to: The full-joint posterior distribution is explored by performing many iterations of the PMMH algorithm, discarding burn-in iterations and possibly thinning the iterations to get less correlated samples from the posterior distribution.

Implementation (MCMC as a stream)
The Particle Marginal Metropolis-Hastings algorithm must be applied to a batch of data.Window functions, such as grouped, can be applied to a stream of data to aggregate observations into a batch.grouped accepts an integer, n, and groups each observation into another (finite) stream of size n.
The PMMH algorithm can then be applied to the aggregated group using map.Iterations from the PMMH algorithm are naturally implemented as a stream.In the Scala standard library there is a method for producing infinite streams from an initial seed: iterate applies the function f to the starting value, then passes on the result to the next evaluation of f.For example, to create an infinite stream of natural numbers: Iterations of an MCMC algorithm can be generated using iterate, by starting with an initial value of the required state (at a minimum the likelihood and the initial set of parameters) and applying the Metropolis-Hastings update at each iteration.Inside of each application of f, a new value of the parameters is proposed, the marginal likelihood is calculated using the new parameters (using the bootstrap particle filter) and the Metropolis-Hastings update is applied.
An illustrative example of a single step in the PMMH algorithm using the Metropolis Kernel is in code block 3. Three important functions are given abstract implementations in the MetropolisHastings trait, proposal, prior and logLikelihood.The proposal: Parameters => Rand[Parameters] is a function representing the (symmetric) proposal distribution, Rand is a representation of a distribution which can be sampled from by calling the method draw.logLikelihood: Parameters => LogLikelihood is a particle filter, with the observed data and number of particles fixed, which outputs an estimate of the log-likelihood for a given value of the parameters.prior: Parameters => LogLikelihood represents the prior distribution over the parameters.These three functions will be implemented in a concrete class extending the MetropolisHastings trait and correspond to specific implementation of the PMMH algorithm.In order to generate a stream of iterations, use iterate: Where initParams are drawn from the prior distribution and the initial value of the log-likelihood is chosen to be very small so the first iteration of the PMMH is accepted.
Built in stream operations can be used to discard burn-in iterations and thin the iterations to reduce auto-correlation between samples.The stream can be written to a file or database at each iteration, so the PMMH algorithm implemented as a stream uses constant memory as the chain size increases.

Examples
In this section two datasets are considered, the first consists of traffic arrival time at various stations around Newcastle Upon Tyne.The second consists of urban temperature measurements recorded on a sensor network also deployed around Newcastle Upon Tyne in the Urban Observatory (James et al., 2014).

UTMC Traffic Data: Seasonal-Poisson Model
The Urban Traffic Management Control1 , operates a grid of sensors deployed across the five districts of Tyne and Wear with the aim to analyse traffic flow in order to prevent congestion.The sensor locations are plotted on the map in Figure 7.
The raw data consists of a timestamp indicating the time the vehicles passed a monitoring station along with their direction, lane and speed.A useful way to summarise the data is by aggregating the count of cars per hour; counts of cars per hour at a station on Stamfordham Road are plotted on the right of Figure 7.There is a strong daily seasonal component, with a slightly less pronounced weekly component.There is a distinct systematic drop in traffic on weekends, this coincides with fewer commuters on the road.Along with the daily and weekly seasonality, there are two peaks each day, coinciding with rush hour in the morning and the evening.Grouping the data this way naturally leads to a Poisson observation model.
In order to model this data, a composition of three models is used.The first model in the composition is Poisson: The observation distribution is Poisson, hence the linking function is the exponential function, the linear-transformation vector is the identity, F (1) t = 1, the latent state is one-dimensional and evolves continuously according to generalised Brownian motion.
The second model in the composition is a seasonal model, representing daily seasonality: The linking function is the identity function, the observation distribution is Gaussian.The state is 2h 1 -dimensional, where h 1 is the chosen number of harmonics, and evolves according to a multivariate Ornstein-Uhlenbeck process.The linear-transformation vector is given by: , where ω 1 = 2π/T 1 and T 1 = 24, since the data is aggregated in hourly intervals.There are three harmonics in the model.The third model component in the composition is another seasonal model, representing weekly seasonality.This model component is the same as the model in equation 11, but the period of the seasonality is T 2 = 24 × 7, hence ω 2 = 2π/T 2 .
The model is then composed, with the observation distribution and linking-distribution taken from the left-hand (Poisson) model: The latent state from each model is concatenated, and each models state advances according to its own transition kernel.The vector, F t is the concatenation of the three linear-transformation vectors from Poisson, daily seasonal and weekly seasonal model: The parameter posterior distribution is determined by running the PMMH algorithm on a batch of 1,000 data points (each data point is the count of vehicles passing station 62 in an hour), consisting of approximately six weeks of data.
The mean of the parameter posterior distributions can be used to run a particle filter on future observations to determine the filtering distribution, p(η(t i )|y(t i ), θ).The mean of the filtering distribution is plotted alongside previously unseen observations in Figure 8.It is also possible to draw from the parameter posterior when performing forecasting, in order to take into account uncertainty about the parameters, θ.

Irregular Observations: Temperature Sensors in the Urban Observatory
One of the strengths of partially observed Markov process models with continuous time latent state is modelling irregularly observed data.The Urban Observatory Environment data James et al. (2014) contains many Temperature sensors, which vary in sampling rate and sometimes stop sampling altogether.Table 1 shows the first six observations in July from a station outside the Key building in Newcastle Upon Tyne.The difference between most observations is approximately one minute, however the difference between observations four and five is three minutes.This is typical of sensor data, as some sensors sample adaptively, or lose power due to battery failure.
The linear transformation vector is the identity, F (1) t = 1 and the linking function is also the identity.The second model is a seasonal model with the Ornstein-Uhlenbeck process representing the evolution of the state space: The linear transformation vector is used to represent the seasonality in the model and contains three harmonics: The model composition is then: The latent state is concatenated and the state from each model component evolves according to its transition kernel.The linking-equation is the identity, since the mean of the observation distribution (Gaussian) is free to vary over the entire real line.F t is a (2 * h + 1) × 1 vector, the seasonal model will have 3 harmonics, hence F t is a 7 × 1 vector: This model shows the ability of the POMP models to accurately model irregularly observed data.Figure 10 shows a plot of the one-step forecasts (stepped forward to the time of the next observation).The mean of the parameter posterior distribution was used to perform the one-step forecast.

Conclusion
Composable Markov process models have many applications.The class of composable models with the composition operator form a semi group, making it easy to build complex models by combining simple models.The use of the particle filter for simulation based inference has allowed for a flexible choice of Further, by using a continuous time Markov process to represent the latent state, the requirement to simulate the state on a regular grid is relaxed.If there is an exact solution to the Markov process, then the latent state can be simulated forward to the next time of interest using the time increment and the previous realisation of the state.This allows for faster computation when using real world data, since many sensors sample at an adaptive rate, or consumption must be slowed down due to limited computing resources.
Incoming observations are represented as functional streams, which can represent live observations from a webservice or other source.This allows for a flexible and composable model framework which can be tested on simulated data and the same code deployed for real world applications.The particle filter is implemented as a functional fold, a higher-order function allowing recursive data structures (such as the cons-list or cons-stream) to be reduced pairwise without side-effecting.The PMMH algorithm is written using iterate (often called unfold) which starts with a seed value and applies a function to it to create an infinite stream of iterations which can be manipulated using built-in stream processing functions.This allows for easy implementation of parallel chains, online-monitoring and constant-time memory usage, when used with asynchronous-IO.
The traffic and temperature sensors considered in the examples in Section 5 are placed in close proximity to each other, hence they have highly correlated observations.The models presented in this paper currently consider the observations at each station as independent, but sharing information between sensors could mean more accurate inference.There is also a potential to determine faulty sensors, which do not record similar readings as those in its neighbourhood.
In some cases, results are needed quickly and accuracy is not of paramount concern.In this case, approximate inference could be performed using the Extended Kalman Filter and Unscented Kalman Filter mentioned in Section 1.
Unusual events in the sensor network (for instance the presence of the world cup on traffic, pollution) can cause estimated parameters from past events to be ineffective to forecast future observations of the process.In this case it may be useful to have time varying parameters and estimate the parameters online.
All of the code used in this paper is available on GitHub2 .
1.A stochastic observation function: 6.A likelihood function: val dataLikelihood: (Eta, Observation) => LogLikelihood The observation method is a function from the transformed state η(t) to a distribution over the observations y.The distribution is a monad and can be sampled from by calling the method draw.The Rand monad, along with many common discrete and continuous distributions are implemented in Scala Breeze3 .
The linking function ensures the value of η(t) is appropriate for the observation distribution.The Markov transition kernel of the state space is represented by (stepFun) is a function from State and TimeIncrement to the next Rand [State].x0 returns a Rand monad, with a draw method, representing a random draw from the distribution of the initial state of the system.
An important property of functional programming is referential transparency, this means that any function applied to a fixed set of arguments can be replaced by it's evaluation without affecting the final output of the program.This means purely functional programs are easier to reason about.Functions which produce a pseudo-random output are not pure and violate the principal of referential transparency.Hence, the observation function returns the Rand monad which is manipulated via inference algorithms then sampled from at the last possible moment when the main program is run.
Probabilistic programming is an active area of research, Ścibior et al. (2015) have developed an approach to Bayesian Inference using monads in Haskell (a popular general purpose functional programming language).They develop a representation of a probability distribution as a Generalised Algebraic Data Type (GADT), then implement the particle MCMC algorithm Particle Independent Metropolis Hastings.

A.3.1 A Binary Operation for Composing Models
In order to compose models, define an associative binary operation called combine which combines two models at a time.combine accepts two un=parameterised models, (parameterisation is described in Section A.3.2) m1 and m2 and returns a third model representing the sum of the two models.In this way, the models form a semi-group, since the combine operation is associative and closed.
Further, an identity model, e, can be defined is defined such that for any model M a , e M a = M a = M a e. Un-parameterised models now form a monoid.A monoid is an algebraic data type commonly found in functional programming.
In order to define the combine operation, firstly consider combining the state space of two models: A binary tree is used in order to represent the state space of a model, this is depicted in figure 11.For a single model, the state space is called a LeafState.In order to represent the state of two models, the state of each model is combined into a BranchState with the left and right branch corresponding to the LeafState of each model.A binary tree for the state is defined in Scala as:

Figure 1 :
Figure 1: Representation of a POMP model as a Directed Acyclic Graph (DAG)

Figure 4 :
Figure 4: (Top) Simulated values from the Normal seasonal model, observed discretely at integer time points, (Middle) Transformed State, mean of the Normal Observations, (Bottom) Generalised Brownian Motion latent state

Figure 5 :
Figure 5: (Top) Simulated event times from the Log-Gaussian Cox-Process with µ = 0.1 and σ = 0.5, (Middle) The unobserved hazard, (Bottom) The unobserved state evolving according to generalised Brownian motion with positive drift val poissonMod : UnparamMod = PoissonModel ( stepFunction ) val poissonParam : Parameters = LeafParameter ( InitialStateParameters , None , M a r k o v P a r a m e t e r s ) val seasonalMod : UnparamMod = SeasonalModel ( period = 24 , harmonics = 3 , stepFunction ) val seasonalParam : Parameters = LeafParameter ( InitialStateParameters , None , M a r k o v P a r a m e t e r s ) val combine dParams : Parameters = poissonParam |+| seasonalParam val combinedModel : UnparamModel = poissonMod |+| seasonalMod val p a r a m e t e r i s e d M o d e l : Model = combinedModel ( combined Params )
trait M e t r o p o l i s H a s t i n g s { import M e t r o p o l i s H a s t i n g s ._ import math ._ import breeze .stats .distributions .{Rand , Uniform } val prior : Parameters = > LogLikelihood val proposal : Parameters = > Rand [ Parameters ] val logLikelihood : Parameters = > LogLikelihood val stepMetrop : MetropState = > MetropState = s = > { val propParams = proposal ( s .params ) .draw val propll = logLikelihood ( propParams ) val a = propll + prior ( propParams ) -s .ll -prior ( s .params ) if ( log ( Uniform (0 , 1) .draw ) < a ) MetropState ( propll , propParams ) else s } } object M e t r o p o l i s H a s t i n g s { type Parameters = Vector [ Double ] type LogLikelihood = Double case class MetropState ( ll : LogLikelihood , params : Parameters ) } Listing 3: Illustrative implementation of the PMMH, with an implementation of stepMetrop for use in Stream.iterate

Figure 7 :
Figure 7: (Left) sensor number 62, located on Stamfordham Road (Centre) All traffic sensors in Newcastle Upon Tyne (Right) Count of cars arriving per hour during part of March and April 2015 at station 62 on Stamfordham Road

Figure 8 :
Figure 8: The mean of the filtering distribution and associated 99% credible intervals are plotted alongside the actual observations, the mean of the filtering distribution very closely aligns with the actual observations, which are assumed to be drawn from a Poisson distribution with this mean

Figure 10 :
Figure 10: Future observations of the temperature at the Key building in Newcastle Upon Tyne overlaid with a one-step forecast for the temperature model val observation: Eta => Rand[Observation]    2. A deterministic, non-linear linking function: val link: Gamma => Eta 3. A deterministic linear transformation function: val f: (State, Time) => Gamma 4. A stochastic function to advance the state: val stepFun: (State, TimeIncrement) => Rand[State] 5. A definition of the distribution of the initial state: val x0: Rand[State]

Table 1 :
The first 6 temperature measurements considered at the start of July, displaying the irregular arrival of observations from sensors. Figure 9: Temperature recorded during the first week of July at the Key building in Newcastle Upon Tyne Figure 9 shows a plot of the temperature recorded in first week of July 2016.A suitable model to fit to this data is a seasonal model with a local-level trend.The first model in the composition is a linear Gaussian model with generalised Brownian motion representing the evolution of the latent state: sealed trait State case class LeafState ( x : Vector [ Double ]) extends State case class BranchState ( left : State , right : State ) extends State Since the LeafState and BranchState both extend the State, when a function accepts a parameter of type State it can be either a LeafState or a BranchState.Pattern matching is used in order to decompose the state trees and perform functions on the values contained in the leaf nodes.