Deep state-space Gaussian processes

This paper is concerned with a state-space approach to deep Gaussian process (DGP) regression. We construct the DGP by hierarchically putting transformed Gaussian process (GP) priors on the length scales and magnitudes of the next level of Gaussian processes in the hierarchy. The idea of the state-space approach is to represent the DGP as a non-linear hierarchical system of linear stochastic differential equations (SDEs), where each SDE corresponds to a conditional GP. The DGP regression problem then becomes a state estimation problem, and we can estimate the state efficiently with sequential methods by using the Markov property of the state-space DGP. The computational complexity scales linearly with respect to the number of measurements. Based on this, we formulate state-space MAP as well as Bayesian filtering and smoothing solutions to the DGP regression problem. We demonstrate the performance of the proposed models and methods on synthetic non-stationary signals and apply the state-space DGP to detection of the gravitational waves from LIGO measurements.


Introduction
Gaussian processes (GP) are popular models for probabilistic non-parametric regression, especially in the machine learning field (Rasmussen and Williams, 2006). As opposed to parametric models, such as deep neural networks (Goodfellow et al., 2016), GPs put prior distributions on the unknown functions. As the mean and covariance functions characterize a GP entirely, the design of those two functions determines how well the GP learns the structure of data. However, GPs by using, for example, radial basis functions (RBFs) and Matérn class of covariance functions are stationary, and hence those conventional GPs have limitations on learning non-stationary structures in data. Heteroscedastic GPs (Le et al., 2005;Lazaro-Gredilla and Titsias, 2011) are designed to tackle with the non-stationarity in measurement noise. To model the non-stationary of the unknown process, we often need to manipulate the covariance function to give non-stationary GPs.
One approach to construct non-stationary GPs is to transform the domain/input space by compositions. For example, Wilson et al. (2016a,b); Al-Shedivat et al. (2017) transform the inputs by deterministic deep architectures and then feed to GPs, where the deep transformations are responsible for capturing the nonstationarity from data. The resulting GP posterior distribution is in closed form. Similarly, Calandra et al. (2016) transform the input space to manifold feature spaces. Damianou and Lawrence (2013) construct deep Gaussian processes (DGPs) by feeding the outputs of GPs to another layer of GPs as (transformed) inputs. However, the posterior inference requires complicated approximations and does not scale well with a large number of measurements (Salimbeni and Deisenroth, 2017a).
Another commonly used non-stationary GP construction is to have input-dependent covariance function hyperparameters, so that the resulting covariance function is non-stationary (Sampson and Guttorp, 1992;Higdon et al., 1999;Paciorek and Schervish, 2004). For example, one can parametrize the lengthscale as a function of time. This method grants GPs the capability of changing behaviour depending on the input. However, one needs to be careful to ensure that the construction leads to valid (positive definite) covariance functions (Paciorek and Schervish, 2004). It is also possible to put GP priors on the covariance parameters (Tolvanen et al., 2014;Lazaro-Gredilla and Titsias, 2011;Heinonen et al., 2016;Roininen et al., 2019;Monterrubio-Gómez et al., 2020). For example, Salimbeni and Deisenroth (2017b) model the lengthscale as a GP by using the non-stationary covariance function of Paciorek and Schervish (2006), and they approximate the posterior density via the variational Bayes approach.
The idea of putting GP priors on the hyperpameters of a GP (Heinonen et al., 2016;Roininen et al., 2019;Salimbeni and Deisenroth, 2017b) can be continued hierarchically, which leads to one type of deep Gaussian process (DGP) construction (Dunlop et al., 2018;Emzir et al., 2019Emzir et al., , 2020. Namely, the GP is conditioned on another GP, which again depends on another GP, and so forth. It is worth emphasizing that this hyperparameter-based (or covarianceoperator) construction of DGP is different from the compositional DGPs as introduced by Damianou and Lawrence (2013) and Duvenaud et al. (2014). In these composition-based DGPs, the output of each GP is fed as an input to another GP. Despite the differences, these two types of DGP constructions are similar in many aspects and are often analyzed under the same framework (Dunlop et al., 2018). This paper focuses on hyperparameter-based (or covariance-operator) temporal DGPs. In particular, we introduce the state-space representations of DGPs by using non-linear stochastic differential equations (SDEs). The SDEs form a hierarchical non-linear system of conditionally linear SDEs which results from the property that a temporal GP can be constructed as a solution to a linear SDE (Hartikainen and Särkkä, 2010;Särkkä and Solin, 2019). More generally, it is related to the connection of Gaussian fields and stochastic partial differential equations (SPDEs, Lindgren et al., 2011). (D)GP regression then becomes equivalent to the smoothing problem on the corresponding continuous-discrete state-space model . Additionally, by using the SDE representations of DGPs we can avoid to explicitly choose/design the covariance function.
However, the posterior distribution of (state-space) DGPs does not admit a closed-form solution as a plain GP does. Hence we need to use approximations such as maximum a posteriori (MAP) estimates, Laplace approximations, Markov chain Monte Carlo (MCMC, Heinonen et al., 2016;Brooks et al., 2011;Luengo et al., 2020), or variational Bayes methods (Lazaro-Gredilla and Titsias, 2011;Salimbeni and Deisenroth, 2017a;Chang et al., 2020). However, those methods are often computationally heavy. The another benefit of using state-space DGPs is that we can use the Bayesian filtering and smoothing solvers which are particularly efficient for solving temporal regression/smoothing problems (Särkkä, 2013).
In short, we formulate the (temporal) DGP regression as a state-estimation problem on a non-linear continuous-discrete state-space model. For this purpose, various well-established filters and smoothers are available, for example, the Gaussian (assumed density) filters and smoothers (Särkkä, 2013;Särkkä and Sarmavuori, 2013;Zhao et al., 2021). For temporal data, the computational complexity of using filtering and smoothing approaches is O(N ), where N is the number of measurements.
The contributions of the paper are the follows. 1) We construct a general hyperparameter-based deep Gaussian process (DGP) model and formulate a batch MAP solution for it as a standard reference approach. 2) We convert the DGP into a state-space form consisting of a system of stochastic differential equations. 3) For the state-space DGP, we formulate the MAP and Bayesian filtering and smoothing solutions. The resulting computational complexity scales linearly with respect to the number of measurements. 4) We prove that for a class of DGP constructions and Gaussian approximations on the DGP posterior, certain nodes of the DGP (e.g., the magnitude σ of Matérn GP) will not be asymptotically updated from measurements. 5) We conduct experiments on synthetic data and also apply the methods to gravitational wave detection.

Non-stationary Gaussian Processes
We start by reviewing the classical Gaussian process (GP) regression problem (Rasmussen and Williams, 2006). We consider the model f (t) ∼ GP(0, C(t, t )), where f : T → R is a zero-mean GP on T = {t ∈ R : t ≥ t 0 } with a covariance function C. The observation y k := y(t k ) of f (t k ) is contaminated by a Gaussian noise r k ∼ N (0, R k ). We let R = diag(R 1 , . . . , R N ). Given a set of N measurements y 1:N = {y 1 , . . . , y N }, GP regression aims at obtaining the posterior distribution which is again Gaussian with closed-form mean and covariance functions (Rasmussen and Williams, 2006). In this model, the choice of covariance function C is crucial to the GP regression as it determines, for example, the smoothness and stationarity of the process. Typical choices, such as radial basis or Matérn covariance functions, give stationary GPs. However, it is difficult for a stationary GPs to tackle with non-stationary data. The main problem arises from the covariance function (Rasmussen and Williams, 2006) as the value of a stationary covariance function only depends on the difference of inputs. That is to say, the correlations of any pairs of two inputs are the same when the differences are the same. This feature is not beneficial for non-stationary signals, as the correlation might vary depending on the input.
A solution to this problem is using a non-stationary covariance function (Higdon et al., 1999;Paciorek andSchervish, 2004, 2006;Lindgren et al., 2011). That grants GP with the capability of adaption by learning hyperparameter functions from data. However, one needs to carefully design the non-stationary covariance function such that it is positive definite. Recent studies by, for example, Heinonen et al. (2016);Roininen et al. (2019) and Monterrubio-Gómez et al. (2020), propose to put GP priors on the covariance function hyperparameters. In this article, we follow these approaches to construct hierarchy of GPs which becomes the construction of the deep GP model.

Deep Gaussian Process Construction
We define a deep Gaussian process (DGP) in a general perspective as follows. Suppose that the DGP has L layers, and each layer (i = 1, . . . , L) is composed of L i nodes. Each node of the DGP is conditionally a GP, denoted by u i j,k , where k = 1, 2, . . . , L i . We give three indices for the node. The indices i and k specify the layer and the position of the GP, respectively. As an example, u i j,k is located in the i-th layer of the DGP and is the k-th node in the i-th layer. The index j is introduced to indicate the conditional connection to its unique child node on the previous layer. That is to say, u i j,k is the child of nodes u i+1 k,k for all suitable k . The terminologies "child" and "parent" follow from the graphical model conventions (Bishop, 2006). To keep the notation consistent, we also use u 1 1,1 := f for the top layer GP. The nodes u L+1 j,k outside of the DGP, we treat as degenerate random variables (i.e., constants or trainable hyperparameters). Remark that every node in the DGP is uniquely indexed by i and k, whereas j only serves the purpose of showing the dependency instead of indexing.
We call the vector process the DGP, where each element of U corresponds (one to one and onto) to the element of the set of all nodes u i j,k : i = 1, . . . , L, k = 1, 2, . . . , L i . Similarly, each element of the vector U i : T → R Li corresponds to the element of the set of all nodes from the i-th layer. We denote by U i k,· = u i k,k : for all suitable k the set of all parent nodes of u i−1 j,k . Variable y is the measurement, and the nodes in U 4 are degenerate random variables.
In this tree-like general construction of DGP U , there are L i=1 L i nodes in total. Every u i j,k is independent of other nodes in the same i-th layer, and depends on the nodes U i+1 j,· on the next (i + 1)-th layer. When there is only one layer, the DGP reduces to a conventional GP. Figure 1 illustrates the DGP construction.
The realization of the DGP depends on how each of the conditionally GP nodes is constructed. In the following sections, we discuss two realizations of this DGP, by either constructing the conditional GPs by specifying the mean and covariance functions, or by stochastic differential equations. These two constructions lead to DGP regression in batch and sequential forms, respectively.

A Batch Deep Gaussian Process Regression Model
In this section, we present a batch DGP construction which uses the construction of non-stationary GPs presented in Paciorek and Schervish (2006) to form the DGP. To emphasize the difference to the SDE construction which is the main topic of this article, we call this the batch-DGP. Let us assume that every conditional GP in the DGP has zero mean and we observe the top GP f with additive Gaussian noise. We write down the DGP regression model as where each covariance function C i k : T × T → R is parameterized by next layer's (conditional) GPs. That is to say, the covariance function C i k takes the nodes in U i+1 k,· as parameters. This DGP construction requires positive covariance function at each node. One option is the non-stationary exponential covariance function which has the form (cf. Paciorek and Schervish, 2006) In the above covariance function C N S , the length scale (t) and magnitude σ(t) are functions of input t. Paciorek and Schervish (2006) also generalize C N S to the Matérn class. For the DGP construction in (2) we need to ensure the positivity of the hyperparameter functions. For that purpose we introduce a wrapping function g : R → (0, ∞) which is positive and smooth, and we put (t) = g(u (t)) and σ(t) = g(u σ (t)) where u and u σ are the conditionally Gaussian processes from the next layer. The exponential or squaring functions are typical options for g. In Example 1, we show a twolayer DGP by using the covariance function C N S .
In this case, we have the so-called length scale 2 1,1 = g(u 2 1,1 ) and magnitude σ 2 1,2 = g(u 2 1,2 ). Also, U = f u 2 1,1 u 2 1,2 T and U 2 = U 2 1,· = u 2 1,1 u 2 Given a set of measurements y 1:N = {y 1 , y 2 , . . . , y N }, the aim of DGP regression is to obtain the posterior density for any input t ∈ T. Moreover, by the construction of DGP (conditional independence) we have where each u i j,k | U i+1 k,· is a GP as defined in (2). We isolate f | U 2 out of the above factorization because we are particularly interested in the observed f . It is important to remark that the distribution of U is (usually) not Gaussian because of the non-Gaussianility induced by the conditional hierarchy of Gaussian processes which depend on each other non-linearly.

Batch MAP Solution
The maximum a posteriori (MAP) estimate gives a point estimate of U as the maximum of the posterior distribution (3). Let us denote f 1: We are targeting at the posterior density p(U 1:N | y 1:N ) evaluated at t 1 , . . . , t N . The MAP estimate is then obtained by where L BMAP is the negative logarithm of the unnormalized posterior distribution given by In the above Equation (6), C and C i k are the covariance matrices formed by evaluating the corresponding GP covariance functions at (t 1 , . . . , t N ) × (t 1 , . . . , t N ). The computational complexity for computing (6) is , which scales cubically with the number of measurements.
It is important to recall from (2) that the covariance matrices also depend on the other GP nodes (i.e., f 1:N and U 1:N are in C i k ). Therefore the objective function L BMAP is non-quadratic. Additional non-linear terms are also introduced by the determinants of the covariance matrices. However, quasi-Newton methods (Nocedal and Wright, 2006) can be used to solve the optimization problem. The required gradients of (6) are provided in Appendix A.
There are two major challenges in this MAP solution. Firstly, the optimization of (5) is not computationally cheap. It requires to evaluate and store L i=1 L i inversions of N -dimensional matrices for every optimization iteration. This prevents the use of the DGP on large-scale datasets and large models. Moreover, Paciorek and Schervish (2006) state that the optimization of (5) is difficult and prone to overfitting, which we also confirm in the experiment section. Another problem is the uncertainty quantification and prediction (interpolation) with the MAP estimate which is degenerate.

Deep Gaussian Processes in State-space
Stochastic differential equations (SDEs) are common models to construct stochastic processes (Friedman, 1975;Rogers and Williams, 2000a,b;Särkkä and Solin, 2019). Instead of constructing the process by specifying, for example, the mean and covariance functions, an SDEs characterizes a process by describing the dynamics with respect to a Wiener process. In this section, we show how a DGP as defined in Section 2.2 can be realized using a hierarchy of SDEs. To highlight the difference to the previous batch-DGP realization, we call this the SS-DGP. The regression problem on this class of DGPs can be seen as a state estimation problem.

Gaussian Processes as Solutions of Linear SDEs
Consider a linear time invariant (LTI) SDE where coefficients A ∈ R d×d and L d×S are constant matrices, W f (t) ∈ R S is a Wiener process with unit spectral density, and f (t 0 ) is a Gaussian initial condition with zero mean and covariance P ∞ , which is obtained as the solution to When the stationary covariance P ∞ exists, the vector process f is a stationary Gaussian process with the (cross-)covariance function It now turns out that we can construct matrices A, L, and H such that f = H f is a Gaussian process with a given covariance function (Hartikainen and Särkkä, 2010;Särkkä and Solin, 2019). The marginal covariance of f can be extracted by Cov In order to construct non-stationary GPs, we can let the SDE coefficients (i.e., A and L) be functions of time.
In particular, if f is a Matérn GP, then we can select the state and the corresponding H = 1 0 0 · · · , where D is the time derivative, α is the smoothness factor, and dimension d = α + 1. We can also generalize the results to spatial-temporal Gaussian processes, and hence the corresponding SDEs will become stochastic partial differential equations (SPDEs, Särkkä and Hartikainen, 2012;. When constructing a GP using SDEs, we sometimes need to select the SDE coefficients suitably so that the resulting covariance function (9) admits a desired form (e.g., Matérn). One way to proceed is to find the spectral density function of the GP covariance function (via Wiener-Khinchin theorem) and translate the resulting transfer function into the state-space form (Hartikainen and Särkkä, 2010). The results are known for many classes of GPs, for example, the Mátern and periodic GPs (Särkkä and Solin, 2019).
As an alternative to the batch-GP construction in Section 3, the SDE approach offers more freedom to certain extent because the corresponding covariance functions are positive definite and non-stationary by construction. It is also computationally beneficial in regression, as we can leverage the Markov properties of the SDEs in the computations.

Deep Gaussian Processes as Hierarchy of SDEs
So far, we have only considered the SDE construction of a single stationary/non-stationary GP. To realize a DGP as defined in Section 2.2, we need to formulate a hierarchical system composed of linear SDEs. Namely, we parametrize the SDE coefficients as functions of other GPs in a hierarchical structure. Followed from the SDE expression of GP f in Equation (7), let us similarly define the state We then construct the DGP by finding the SDE representation for each u i j,k to yield where W f ∈ R S and W i k ∈ R S for all i and k are mutually independent standard Wiener processes. Note that the above SDE system (11) is non-linear, and the coefficients are state-dependent. We denote by U i+1 k,· the collection states for all parent states of u i j,k . For example, if u 2 1,1 is conditioned on u 3 1,1 and u 3 1,2 , then To further condense the notation, we rearrange the above SDEs (11) into where is the SDE state of the entire DGP, U(t 0 ) is the Gaussian initial condition, = d L i=1 L i is the total dimension of the state, and The drift Λ•U : T → R and dispersion β •U : T → R functions can be written as

respectively.
The above SDE representation of DGP is general in the sense that the SDE coefficients of each GP and the number of layers are free. However, they cannot be completely arbitrary as we at least need to require that the SDE has a weakly unique solution. A classical sufficient condition is to have the coefficients globally Lipschitz continuous and have at most linear growth (Friedman, 1975;Xu et al., 2008;Mao, 2008;Øksendal, 2003). These restrictive conditions can be further weakened, for example, to locally Lipschitz (Friedman, 1975, Ch. 5) and weaker growth condition (Shen et al., 2006, Theorem 2.2). Alternatively, requiring the coefficients to be Borel measurable and locally bounded is enough for a unique solution (Rogers and Williams, 2000b, Theorem 21.1 and Equation 21.9).
It is also worth remarking that the SDE system (12) and hence the DGP is a well-defined Itô diffusion, provided that the coefficients are regular enough (Definition 7.1.1, Øksendal, 2003). This feature is valuable, as being an Itô diffusion offers many fruitful properties that we can use in practice, for example, continuity, Markov property, and the existence of infinitesimal generator (Øksendal, 2003). The Markov property is needed to ensure the existence of transition density and also enables the use of Bayesian filtering and smoothing for regression. The infinitesimal generator can be used to discretize the SDEs as we do in Section 5.
It is also possible to extend the SDE representations of temporal DGPs to stochastic partial differential equation (SPDE) representations of spatiotemporal DGPs.  given the following result. Suppose v : X × T → R is a spatiotemporal stationary GP on a suitable domain, such that v(x, t) ∼ GP(0, C(x, x , t, t )). Then v(x, t) can be constructed as a solution to an evolution type of SPDE where v(x, t) is the state of v, A and B are spatial operators, and w(x, t) is the spatio-temporal white noise. Lindgren et al. (2011), which provides another path to the spatio-temporal case.

Deep Matérn Process
In this section, we present a Matérn construction of SS-DGP (12). The coefficients are chosen such that each SDE corresponds to a conditional GP with the Matérn covariance function. The idea is to find an equivalent SDE representation for each Matérn GP node, and then parametrize the covariance parameters (i.e., lengthscale and magnitude σ) with another layer of Matérn GPs. We are interested in a GP with the Matérn covariance function where K ν is the modified Bessel function of the second kind and Γ is the Gamma function. We denote κ = √ 2ν/ and ν = α + 1/2. As shown by Hartikainen and Särkkä (2010) and , one possible SDE representation of Matérn GP u i j,k in Equation (13) is where the state and the SDE coefficients A i k and L i k admit the form respectively. Above, we denote by α 1 a binomial coefficient and W i k ∈ R is a standard Wiener process. Next, to contruct the deep Matérn process, we need to parametrize the length scale and magnitude σ by the states of parent GPs and build the system as in Equation (11). For example, if we want to use u 3 1,1 and u 3 1,2 to model the length scale and magnitude of u 2 1,1 , then 2 1,1 = g(u 3 1,1 ) and σ 2 1,1 = g(u 3 1,2 ). The wrapping function g : R → (0, ∞) is mandatory to ensure the positivity of Matérn parameters. The minimal requirement for function g is to be positive and Borel measurable. For instance, we can let g(u) = exp(u) + c or g(u) = u 2 + c for some c > 0. Another choice is to let g(u) = 1/(u 2 +c) that is bounded and Lipschitz on R, which makes the deep Matérn process an Itô diffusion and we have the SS-DGP well-defined (Øksendal, 2003).
Note that the resulting state-space model composed of (15) has a canonical form from control theory (Glad and Ljung, 2000), and the dimensionality is determined by the smoothness parameter α. Moreover, the coefficient A i k is Hurwitz, because all the eigenvalues have strictly negative real part. The stability of such system is studied, for example, in Khasminskii (2012).

State-space Deep Gaussian Process Regression
In this section, we formulate sequential state-space regression by DGPs. By using the result in Equation (12), the state-space regression model is where the initial condition U(t 0 ) ∼ N (0, P 0 ) is independent of W(t) for t ≥ 0, and H U(t k ) = f (t k ) extracts the top GP f from the state. We also assume that the functions Λ and β are selected suitably such that the SDE (17) has a weakly unique solution and imply Markov property (Friedman, 1975). The deep Matérn process and Example 2 satisfy the required two conditions, provided that function g is chosen properly. Suppose we have a set of observations y 1:N = {y 1 , y 2 , . . . , y N }, then the posterior density of interests is for any t 1 ≤ t ≤ t N . Since we have discrete-time measurements, let us denote by for k = 1, 2, . . . , N and use U 1: Also, it would be is possible to extend the regression to classification by using a categorical measurement model (Rasmussen and Williams, 2006;Garcia-Fernández et al., 2019).

SDE Discretization
To obtain the posterior density with discrete-time observations, we need the transition density of the SDE, such that U k+1 ∼ p(U k+1 | U k ). It is known that the transition density is the solution to the Fokker-Planck-Kolmogorov (FPK) partial differential equation (PDE, Särkkä and Solin, 2019). However, solving a PDE is not computationally cheap, and does not scale well for large-dimensional state. It is often more convenient to discretize the SDEs and approximate the continuousdiscrete state-space model (17) with a discretized version where a : R → R is a function of state, and q : R → R is a zero-mean random variable depending on the state. One of the most commonly used methods to derive functions a and q is the Euler-Maruyama scheme (Kloeden and Platen, 1992). Unfortunately, the Euler-Maruyama is not applicable for many DGP models, as the covariance q would be singular. As an example, a smooth Matérn (α ≥ 1) GP's SDE representation gives a singular β(U k ) β T (U k ) (see Equation (16)), thus the transition density p(U k+1 | U k ) is ill-defined. The Taylor moment expansion (TME) is one way to proceed instead of Euler-Maruyama (Zhao et al., 2021;Kessler, 1997;Florens-Zmirou, 1989). This method requires that the SDE coefficients Λ and β are differentiable and there exists an infinitesimal generator for the SDE (Zhao et al., 2021). The deep Matérn process satisfies these conditions provided that the wrapping function g is chosen suitably.
We remark that at this point that we have formed an approximation to the SS-DGP in order to use its Markov property. This is different from the batch-DGP model where we do not utilize the Markov property for regression. In summary, we approximate the transition density as a non-linear Gaussian, where a discretization such as Euler-Maruyama or TME is used. With the transition density formulated, we can now approximate the posterior density (18) of SS-DGP using sequential methods in state-space.

State-space MAP Solution
The MAP solution to the SS-DGP model is fairly similar to the batch-DGP model, except that we factorize the posterior density with the Markov property. Suppose that we are interested in the posterior density p(U 0:N | y 1:N ) at N discrete observation points, then we factorize the posterior density by By taking the negative logarithm on both sides of Equation (20) where L SMAP (U 0:N ; y 1:N ) The corresponding gradient of (22) is given in Appendix B. The computational complexity of this SS- We see that the state-space MAP solution has an advantage with large dataset, as the computational complexity is linear with respect to the number of data points N .
The state-space MAP method also has the problem that it is inherently a point estimate. One way to proceed is to use a Bayesian filter and smoother instead of the MAP estimates (Särkkä, 2013).

Bayesian Filtering and Smoothing Solution
Recall the original SS-DGP model (17). The estimation of the state from an observed process is equivalent to computing the posterior distribution (18) which in turn is equivalent to a continuous-discrete filtering and smoothing problem (Jazwinski, 1970;Särkkä and Solin, 2019). Compared to the MAP solution, the Bayesian smoothing approaches offer the full posterior distribution instead of a point estimate.
The core idea of Bayesian smoothing is to utilize the Markov property of the process and approximate the posterior density recursively at each time step. In particular, we are interested in the filtering posterior and the smoothing posterior for any k = 1, 2, . . . , N . There are many well-known methods to obtain the above posterior densities, such as the Kalman filter and Rauch-Tung-Striebel smoother for linear Gaussian state-space models. Typical methods for non-linear SS-DGP models are the Gaussian filters and smoothers (Särkkä and Sarmavuori, 2013;Kushner, 1967;Itô and Xiong, 2000). Some popular examples are the extended Kalman filter and smoother (EKF/EKS), and the unscented or cubature Kalman filter and smoothers (UKF/UKS/CKF/CKS). The significant benefit of Gaussian filters and smoothers is the computational efficiency, as they scale linearly with the number of measurements.
Remark 1 Although the Gaussian filters and smoothers are beneficial choices in terms of computation, there are certain limitations when applying them to DGP regression. We elucidate this peculiar characteristic in Section 6.
Instead of Gaussian filters and smoothers, we can use a particle filter and smoother on a more general setting of DGPs (Godsill et al., 2004;Andrieu et al., 2010). Typical choices are the bootstrap particle filter (PF, Gordon et al., 1993) with resampling procedures (Kitagawa, 1996) and the backward-simulation particle smoother (PF-BS, Godsill et al., 2004). However, particle filters and smoothers do not usually scale well with the dimension of state-space, as we need more particles to represent the probability densities in higher dimensions. Other non-Gaussian assumed density filters and smoothers might also apply, for example, the projection filter and smoother (Brigo et al., 1998;Koyama, 2018).

Analysis on Gaussian Approximated DGP Posterior
Gaussian filters are particularly efficient methods, which approximate the DGP posterior (23) and the predictive density p(U k | y 1:k−1 ) as Gaussian (Itô and Xiong, 2000). Under linear additive Gaussian measurement models, the posterior density is approximated analytically by applying Gaussian identities. However, we are going to show that this type of Gaussian approximation is not useful for all constructions of DGPs. In particular, we show that the estimated posterior covariance of the observed GP f (t) and an inner GP σ(t) approaches to zero as t → ∞. As a consequence, the Gaussian filtering update for σ(t) will not use information from measurements as t → ∞.
Hereafter, we restrict our analysis to a certain construction of DGPs and a class of Gaussian approximations (filters) for which we can prove the covariance vanishing property. Therefore, in Section 6.1 we define a construction of DGPs, and in Algorithm 1 we formulate a type of Gaussian filters. The main result is revealed in Theorem 1.
We organize the proofs as follows. First we show that at every time step the predictions from DGPs give vanishing prior covariance (in Lemma 1). Then we show that the Gaussian filter update step also shrinks the covariance (in Theorem 1). Finally we prove the vanishing posterior covariance by mathematical induction over all time steps as in Theorem 1.

Preliminaries and Assumptions
Let f : T → R and u σ : T → R be the solution to the pair of SDEs for t ≥ t 0 starting from random initial conditions f (t 0 ), u σ (t 0 ) which are independent of the Wiener processes W f (t) ∈ R and W σ (t) ∈ R. In addition, u : T → R and u v : T → R are two independent processes driving the SDEs (25), which are also independent of W f (t) ∈ R and W σ (t) ∈ R for t ≥ t 0 . Let y(t k ) = f (t k ) + r(t k ) be the noisy observation of f (t) at time t k , where r(t k ) ∼ N (0, R k ) and k = 1, 2, . . .. Also let y 1:k = {y 1 , . . . , y k } and ∆t = t k − t k−1 > 0 for all k. We make the following assumptions.
Assumption 3. There exists constants C µ < 0 and Assumption 5. There exists a constant C R > 0 such that R k ≥ C R for all k = 1, 2, . . . , or there exists a k such that R k = 0.
The solution existence in Assumption 1 is the prerequisite for the analysis of SDEs (25) (Kuo, 2006;Øksendal, 2003). Assumption 2 ensures that the SDEs start from a reasonable condition which is used in Lemma 1. Assumption 3 postulates negativity on functions µ and a. It implies that the sub-processes f and u σ stay near zero. Also, the negativity guarantees the positivity of lengthscale (e.g., the lengthscale of f (t) is −µ(u (t))). Assumption 4 yields a lower bound on the variance of f as stated in Corollary 1. Finally, Assumption 5 means that the measurement noise admits a lower bound uniformly which is used in Theorem 1. This assumption also allows for perfect measurements (i.e., no measurement noises).
The above SDEs (25) and Assumptions 1-5 correspond to a type of DGP constructions. In particular, f is a conditional GP given u σ and u . Also, u σ is another conditional GP given u v . The processes u and u v are two independent processes that drive f and u σ . The Matérn DGP in Example 2 satisfies the above assumptions, if we choose Gaussian initial conditions and a regular wrapping function by, for example, g(u) = u 2 + c and c > 0.

Theoretical Results
The following Lemma 1 shows that the covariance of f (t) and u σ (t) approaches to zero as t → ∞.
Lemma 1 Under Assumptions 1 to 3, . By Itô's lemma (see, e.g., Theorem 4.2 of Särkkä and Solin, 2019), To analyze the relation between f and u σ , we need to fix the information from u v and u . Hence, let F v t and F t be the generated filtrations of u v (t) and u (t), respectively. Taking conditional expectations on the above Equation (27) gives µ(u (s))+a(uv(s)) ds .
Using the same approach, we derive a(uv(s)) ds .
Then by law of total expectation, we recover a(uv(s)) ds .
Taking the limit of Equation (29) gives a(uv(s)) ds , where all the three limits on the right side turn out to be zero. Let us first focus on µ(u (s)) ds . By Jensen's inequality (see, e.g., Theorem 7.9 of Klenke, 2014) for t ∈ T. Then by Hölder's inequality (see, e.g., Theorem 7.16 of Klenke, 2014), the above inequality continues as µ(u (s)) ds .

Now by using Assumption 3, we know that there exists a constant
for all t > t 0 . Therefore Similarly, we obtain the zero limits for the rest of the terms in Equation (30). Thus limit (26) holds.
The almost sure negativity (i.e., Assumption 3) on functions µ(·) and a(·) is the key condition we need to have for the covariance to vanish to zero in infinite time. These conditions are often true in an SDE representation of a DGP because µ(·) and a(·) ensure the positivity of lengthscales.
Before analyzing the posterior covariance, we need to construct a positive lower bound on the variance of f (t), which is given in Lemma 2 and Corollary 1.
Corollary 1 Under Assumptions 1 and 4, there exists > 0 and C F (∆t) > 0 such that Proof. From Lemma 2, we know that for any > 0, there is ζ > 0 such that Equation (31) holds. By As- (33) holds. Note that the inequality (33) only depends on ∆t and some fixed parameters of the SDEs.
The following Algorithm 1 formulates a partial procedure for estimating the posterior density using a Gaussian approximation. In particular, Algorithm 1 gives an approximation to the posterior covariance for k = 1, 2, . . .. In order to do so, we need to make predictions through SDEs (25) based on different starting conditions at each time step. Hence let us introduce two notations as following. We denote by  (29) and (32) starting from initial values c 0 ∈ R and s 0 ∈ (0, +∞) at t 0 , respectively.

Remark 2
The above Algorithm 1 is a abstraction of continuous-discrete Gaussian filters (Itô and Xiong, 2000;Särkkä and Solin, 2019), except that the predictions through SDEs (25) are done exactly in Equations (34) and (35). The derivation of Equation (36) is shown in Appendix C. Note that in practice the predictions might also involve various types of Gaussian approximations and even numerical integrations (e.g., sigma-point methods).
Theorem 1 Suppose that Assumptions 1 to 5 hold. Further assume that |Cov[f (t), u σ (t)](c 0 )| ≤ |c 0 | for all c 0 ∈ R, then Algorithm 1 gives Proof. We are going to use induction to prove that the claim which satisfies the induction claim (38). Thus Equation (38) holds. Above, we used the assumption for any k. By Corollary 1, Assumption 5, and a fixed non-zero ∆t, we know thatP f,f k are lower bounded uniformly over all k, thus lim k→∞ k i=1 M i = 0. Hence, by taking the limit on Equation (38), the Equation (37) holds. Also, this theorem trivially holds if R k = 0 for some k or P f,σ 0 = 0 because M k = 0 for all k = 1, 2, . . ..

Remark 3
Note that in Theorem 1, the initial bounding assumption |Cov[f (t), u σ (t)](c 0 )| ≤ |c 0 | for all c 0 ∈ R is needed because it is not always followed from Lemma 1. On the other hand, for any choice of c 0 ∈ R, there always exists a threshold η > 0 such that for all t > η we have |Cov[f (t), u σ (t)](c 0 )| ≤ |c 0 | because of Lemma 1.
Under the result of bounded Var [f (t)] in Corollary 1, the consequence of the vanishing posterior covariance in Theorem 1 is that the so-called Kalman gain for u σ (t) approaches zero asymptotically. It entails that the Kalman update for u σ (t) will use no information from measurements when t → ∞. In the later experiment as shown in Figure 8 we see that the corresponding estimated u σ (t) and covariance rapidly stabilizes to zero.
The previous Theorem 1 is formulated in a general sense which applies to DGP methods that use Algorithm 1 and satisfy Assumptions 1 to 5. A concrete example is shown in the following Example 3.
Example 3 Consider a system of SDEs, starting from a Gaussian initial condition f (t 0 ), u σ (t 0 ), where constants µ < 0, a < 0, and b > 0. The conditions of Theorem 1 are now satisfied, and thus lim k→∞ P f,σ k = 0.

Experiments
In this section we numerically evaluate the proposed methods. The specific objectives of the experiments are as follows. First, we show the advantages of using DGPs over conventional GPs or non-stationary GPs (one-layer DGPs) in non-stationary regression. Then, we compare the batch and state-space constructions of DGPs. Finally, we examine the efficiencies of different DGP regression methods.
We prepare four regression models as shown in Figure 2. These models are the conventional GP (Rasmussen and Williams, 2006), non-stationary GP (NS-GP, Paciorek and Schervish, 2006), two-layer DGP (DGP-2), and three-layer DGP (DGP-3). The DGP-2 and DGP-3 are constructed using both the batch and state-space approaches as formulated in Sections 3 and 4, respectively. In particular, we consider a Matérn type of GP construction, which only has two hyperparameters (i.e., the length scale and magnitude σ). That is to say, we use the non-stationary Matérn covariance function (Paciorek and Schervish, 2006) for the NS-GP and batch-DGP models, and the deep Matérn process for SS-DGP model. For the wrapping function g, we choose g(u) = exp(u). For the discretization of SS-DGP, we use the 3rd-order TME method (Zhao et al.,Fig. 3: Demonstration of the magnitude-varying rectangle signal in Equation (42) with 500 samples. 2021). We control the smoothness of f and hyperparameter processes by using α = 1 and 0, respectively (see Equation (14)). In addition, we draw samples from the DGP priors in Appendix D.
There are unknown model hyperparameters. We use the maximum likelihood estimation (MLE) routine to optimize the hyperparameters for the GP and NS-GP models which have closed-form likelihood functions and gradients. For the DGP models, we find them by grid searches because the gradients are non-trivial to derive. We detail the found hyperparameters in Appendix E.
As for the batch-DGP models, we use the proposed batch maximum a posteriori (B-MAP) method in Section 3.1. Similarly for the SS-DGP, we apply the state-space MAP (SS-MAP), Gaussian filters and smoothers (Särkkä, 2013), and a bootstrap particle filter (PF, Andrieu et al., 2010;Doucet et al., 2000) and a backward-simulation particle smoother (PF-BS, Godsill et al., 2004).
We use the limited-memory Broyden-Fletcher-Goldfarb-Shanno (l-BFGS, Nocedal and Wright, 2006) optimizer for MLE and MAP optimizations. For the Gaussian filters and smoothers, we exploit the commonly used linearization (EKFS) and spherical cubature method (CKFS) (Särkkä, 2013). As for the PF and PF-BS, we use 200,000 particles and 1,600 backward simulations.
The following experiments except the real application are computed with the Triton computing cluster at Aalto University, Finland 1 . We uniformly allocate 4 CPU cores and 4 gigabyte of memory for each of the individual experiment. In addition, the PF-BS method is implemented with CPU-based parallelization. All programs are implemented in MATLAB ® 2019b.

Regression on Rectangle Signal
In this section, we conduct regression on a magnitudevarying rectangle wave, as shown in Figure 3. The regression model is formulated by where f is the true function, y is the measurement, and r(t) ∼ N (0, 0.002). We evenly generate samples y(t 1 ), . . . , y(t T ), where T = 100. The challenge of this type of signal is that the rectangle wave is continuous and flat almost everywhere while it is only rightcontinuous at a finite number of isolated points. Moreover, the jumps have different heights.
average the RMSE and NLPD as well as the computational time. For visualization, we uniformly choose the results under the same random seed. Figure 4 shows the results of GP and NS-GP regression. Both of GP and NS-GP experience overfitting problem on this rectangle signal, while the estimated posterior variance of NS-GP is significantly smaller than that of GP. The outcome of GP is expected, as the covariance function is stationary. Because there are no constraints (e.g., being time-continuous) on the parameters of NS-GP, the learnt 2 1,1 and σ 2 1,2 overfit to the likelihood function individually at each time instant (cf. Paciorek and Schervish, 2006). From Table 1 we can see that the RMSE and NLPD of GP and NS-GP are very close.
The results of B-MAP on batch-DGPs are shown in Figure 5. We can see a slight improvement in overfitting compared to GP and NS-GP. However, the learnt function f (t) of B-MAP is not smooth enough and is jittering. For B-MAP on DGP-2, the estimated 2 1,1 and σ 2 1,2 change abruptly on the jump points, and do not stay at flat levels, especially 2 1,1 . On the contrary, the estimated 3 1,1 and σ 3 1,2 on the last layer of DGP-3 stay mostly flat while changing sharply on the jump points. From Figure 5 and the RMSEs of Table 1 we can see that the results of B-MAP on DGP-2 and DGP-3 are almost identical with subtle differences.
Compared to the batch-DGP, the SS-DGP method gives a better fit to the true function. This result is demonstrated in Figure 7, where SS-MAP is used. There is no noticeable overfitting problem in the SS-MAP estimates. The learnt function f is smooth and fits to the actual function to a reasonable extent. For SS-MAP on DGP-2, the estimated 2 1,1 and σ 2 1,2 mostly  stay at a constant level and change rapidly on the leap points. From the second and third rows of Figure 7 and Table 1, we see that the SS-MAP achieves better result on DGP-3 compared to on DGP-2. We also find that the learnt parameters 2 1,1 and σ 2 1,2 of DGP-3 appear to be smoother than of DGP-2.  Apart from the SS-MAP solution to the SS-DGP, we also demonstrate the Bayesian filtering and smoothing solutions in Figures 8, 6, and 9. Figure 8 shows the results of CKFS on DGP-2. We find that the regression result on DGP-2 is acceptable though the estimate is overly smooth on the jump points. The learnt param-eters 2 1,1 also change significantly on the jump points and stay flat elsewhere. Moreover, we find that the estimated log σ 2 1,2 and Cov[f, σ 2 1,2 | y 1:k ] converge to zero in very fast speeds, especially the covariance estimate. This phenomenon resembles the vanishing covariance in Theorem 1. In this case, the estimated log σ 2 1,2 converges to the prior mean of SS-DGP which is zero, due to the vanishing covariance. Therefore for this experiment and all the following experiments, we treat all the magnitude parameters of Matérn (e.g., σ 2 1,2 ) as trainable hyperparameters learnt from grid searches. The results are illustrated in Figure 6. However, we identify that there is a numerical difficulty when applying CKFS on DGP-3. With many hyperparameter settings, the CKFS fails due to numerical problems (e.g., singular matrix). The EKFS still works on DGP-3, thus we plot the results in the second row of Figure 6. The estimated f of EKFS appears to be over-smooth, especially on the jump points. Also, the estimated variances of 2 1,1 and σ 2 1,2 are significantly large. Figure 9 illustrates the result of PF-BS. We find that the regression results are reasonably close to the ground truth. Also, the estimated f is smooth. The Fig. 9: PF-BS regression results on model DGP-2 (first row) and DGP-3 (second and third rows). The shaded area stands for 95% confidence interval. estimated parameters 2 1,1 and σ 2 1,2 for PF-BS on DGP-2 have a similar pattern as the results of SS-MAP, CKFS, and EKFS, which only change abruptly on the jump points. However, the 2 1,1 of DGP-3 does not stay flat generally, and σ 2 1,2 does not change significantly on the jump points. In Table 1, the RMSEs of PF-BS on DGP-3 are better than on DGP-2. Also, PF-BS is slightly better than PF.
We now summarize the numerical results in terms of the RMSEs, NLPD, and computational time from Table 1. Table 1 demonstrates that the DGP methods using MAP, PF, and PF-BS outperform GP and NS-GP on this non-stationary signal regression. Moreover, the RMSEs and NLPDs are improved by using DGP-3 over DGP-2, except for Gaussian filters and smoothers. Among all regression methods, the SS-MAP is the best in terms of RMSE, followed by B-MAP and PF-BS. In terms of NLPD, PF-BS admits the lowest value. However, the NLPD and RMSE results of PF and PF-BS have very large deviations which are improved by using DGP-3 over DGP-2. We found that the Gaussian filters and smoothers (CKFS and EKFS) are the fastest, followed by GP and NS-GP. We also notice that for all methods, DGP-3 is more time-consuming than DGP-2. Even though we implemented PF-BS in CPUbased parallelization the time consumption is still sig-

Regression on Composite Sinusoidal Signal
In this section, we conduct another experiment on a non-stationary composite sinusoidal signal formulated by f (t) = sin 2 7 π cos 2 π t 2 t cos (5 π t) + 2 , t ∈ [0, 1], where f is the true function, and r(t) ∼ N (0, 0.01). This type of signal has been used by, for example, Rudner et al. (2020); Vannucci and Corradi (1999) and Monterrubio-Gómez et al. (2020). A demonstration is plotted in Figure 10. In contrast to the discontinuous rectangle wave in Equation (42), this composite sinusoidal is smooth. Thus it is appropriate to postulate a smooth Matérn prior. This non-stationary signal is challenging in the sense that the frequencies and magnitudes are changing rapidly over time.
The settings of this experiment are the same with the rectangle wave regression in Section 7.1, except that we generate the signal with 2, 000 samples. With this number of measurements, the NS-GP and MAP-based solvers fail because they do not converge in a reasonable amount of time. Also, we select three other GP models  from the literature for comparison, that are, the fully independent conditional (FIC, Quinonero-Candela and Rasmussen, 2005) sparse GP with 500 pseudo-inputs, the warped GP (WGP, Snelson et al., 2004), and a nonstationary GP (HGP) by Heinonen et al. (2016). The results for GP, Sparse GP, WGP, and HGP are shown in Figure 11. We find that the estimate of GP is overfitted to the measurements, and it is not smooth. On the contrary, the estimate of sparse GP is underfitted. The result of WGP is similar to GP, but the estimated variance of WGP is large. The HGP works well except for the part after t > 0.8 s. The learnt 2 1,1 and σ 2 1,2 from HGP are smooth. Figures 12 and 13 plot the results of EKFS and CKFS, respectively. From visual inspection, the Gaus- Fig. 12: EKFS regression results on model (45) using DGP-2 (first row) and DGP-3 (second row). The shaded area stands for 95% confidence interval. Fig. 13: CKFS regression results on model (45) using DGP-2 (first row) and DGP-3 (second row). The shaded area stands for 95% confidence interval.
sian filters and smoothers based DGPs outperform GP, sparse GP, WGP, and HGP. We also find that the esti-mates from EKFS and CKFS are quite similar, whereas EKFS gives smoother estimate of f compared to CKFS. The learnt 2 1,1 and σ 2 1,2 also adapt to the frequency changes of the signal. It is worth noticing that the estimated 3 1,1 in the third layer of DGP-3 is almost flat for both CKFS and EKFS.
The RMSE, NLPD, and computational time are listed in Table 2. This table verifies that the DGPs using Gaussian filters and smoothers (i.e., CKFS and EKFS) outperform other methods in terms of RMSE, NLPD, and computational time. Also, CKFS gives slightly better RMSE and NLPD than EKFS. For this signal, using DGP-3 yields no better RMSE and NLPD compared to DGP-2.

Real Data Application on LIGO Gravitational Wave Detection
The theoretical existence of gravitational waves was predicted by Albert Einstein in 1916 from a linearized field equation of general relativity (Hill et al., 2017;Einstein and Rosen, 1937). In 2015, the laser interferometer gravitational-wave observatory (LIGO) team made the first observation of gravitational waves from a collision of two black holes, known as the event GW150914 (Abbott et al., 2016). The detection was originally done by using a matched-filter approach. It is of our interests to test if the GP and DGP approaches can detect the gravitational waves from the LIGO measurements. We now formulate the detection as a regression task.
We use the observation data provided by LIGO scientific collaboration and the Virgo collaboration 2 . As shown in the first picture of Figure 14, the data contains 3,441 measurements sampled in frequency of 16,384 Hz. We use time interval 10 −5 s to interpolate the data, which results in 10, 499 time steps. The reference gravitational wave calculated numerically from the general relativity theory is shown in Figure 14, and we use it as the ground truth for comparison.
We use the previously formulated regression models GP and DGP-2, as shown in Figure 2. Unfortunately, the NS-GP and MAP-based solvers are not applicable due to a large number of observations and interpolation steps. Hence, we choose the Gaussian filters and smoothers (i.e., CKFS and EKFS) for DGP regression.
The detection results are shown in the second and third rows of Figure 14. We find that the DGP-2 model gives a better fit to the gravitational wave compared to GP. The DGP-2 estimate is almost identical to the numerical relativity result. GP however, fails because the estimate overfits to the measurements. Also, the outcomes of DGP-2 are explainable by reviewing the learnt parameter 2 1,1 . We see that the length scale 2 1,1 adapts to the frequency changes of the gravitational wave, which is an expected feature by using the DGP model. The results of CKFS and EKFS are similar, while EKFS gives smoother results. Moreover, the Gaussian filters and smoothers on DGP-2 have significantly smaller time consumption compared to GP. In one single run of the program, CKFS and EKFS take 1.5 s and 0.4 s, respectively, while GP takes 202.2 s (including hyperparameter optimization).

Summary of Experimental Results
In this section, we summarize the results of the statespace methods presented in the sections above. In the rectangular signal regression experiment, the statespace MAP and particle smoothing methods are better than Gaussian smoothers (e.g., EKFS and CKFS) in terms of RMSE and NLPD. Based on the results of the composite sinusoidal signal regression experiment, Gaussian smoothers are particularly efficient in computation. However, Gaussian smoothers may not be suitable solvers for SS-DGP models that have both lengthscale and magnitude parameters included in the DGP hierarchy. This is proved in Section 6, and it is also numerically shown in Figure 8.

Conclusion
In this paper, we have proposed a state-space approach to deep Gaussian process (DGP) regression. The DGP is formulated as a cascaded collection of conditional Gaussian processes (GPs). By using the state-space representation, we cast the DGP into a non-linear hierarchical system of linear stochastic differential equations (SDEs). Meanwhile, we propose the maximum a posteriori and Bayesian filtering and smoothing solutions to the DGP regression task. The experiment shows significant benefits when applying the DGP methods to simulated non-stationary regression problems as well as to a real data application in gravitational wave detection.
The proposed state-space DGPs (SS-DGPs) have the following major strengths. The DGP priors are capable of modeling larger classes of functions compared to the conventional and non-stationary GPs. In the construction of state-space DGP, one does not need to choose/design valid covariance functions manually like in Paciorek and Schervish (2006) or Salimbeni and Deisenroth (2017b). In DGP regression in state-space form we do not need to evaluate the (full) covariance function either. Moreover, state-space methods are particularly efficient for temporal data as they have linear computational complexity with respect to time.
In addition, we have identified a wide class of SS-DGPs that are not suitable for Gaussian smothers to solve. More specifically, these SS-DGP models are the ones that have both their lengthscale and magnitude parameters modeled as GP nodes under the assumptions in Section 6. When applying Gaussian smoothers on these SS-DGPs, their Kalman gains converge to zero as time goes to infinity, which makes Gaussian smoothers use no information from data to update their posterior distributions. This is one limitation of SS-DGPs. Although one can use the MAP and particle smoothing methods in place of Gaussian smoothers, these methods can be computationally demanding.
For future investigation, enabling automatic differentiations is of interests. In this paper we have only applied grid search on a large number of trainable hyperparameters which results in a very crude optimization. By using libraries like TensorFlow or JAX we can also obtain Hessians which we can use to quantify the uncertainty in MAP.
Another useful future extension is to exploit datascalable inference methods, such as sparse variational methods. For example, Chang et al. (2020) solve statespace GP regression problems (possibly with non-Gaussian likelihoods) by using a conjugate variational inference method while still retaining a linear computational complexity in time. Their work is extended by Wilkinson et al. (2021) who introduce sparse inducing points to the said variational state-space GP inference, resulting in a computational complexity that is linear in the number of inducing points. Although these works are mainly concerned with standard statespace GPs (i.e., linear state-space models), it would be possible to apply these methods on SS-DGPs as well, for example, by linearizing the state-space models of SS-DGPs.
Generalizing the temporal SS-DGPs to spatiotemporal SS-DGPs (see, the end of Section 4.2) would be worth studying as well, by extending the methodologies introduced in ; Emzir et al. (2020). (6) We define the derivatives in a set Above, the m-th element of g i k ∈ R N is  (22) We collect the derivates of the state in a set ∂L SMAP ∂U 1:N = ∂L SMAP ∂U k , k = 0, . . . , N , for all time step, where each element is a column vector. For the initial condition, its derivative is

Appendix B Derivatives of Loss Function
For k = 1, 2, . . . , N − 1, the derivative is Above, z k ∈ R is a vector for k = 0, 1, . . . , N − 1. Now let us temporarily use u m k as the m-th component of state U k , then the m-th element of z k is Finally, for the derivative on the last time step Appendix C Derivation of Equation (36) Let us denote bȳ Then by the update step of Gaussian filters (see, e.g., Algorithm 6.3 of Särkkä, 2013), we have S k = HP k H T + R k , where H = 1 0 . Substituting K k and S k into P k gives Hence, the P f,σ k of P k is

Appendix D Samples from DGP Priors and Predictions from DGP Posterior Distributions
To demonstrate the non-stationarity of the DGP models, we draw samples from the DGPs priors defined in Example 2 and Figure 2c. The samples are drawn by using the TME-3 discretization approach (Zhao et al., 2021) on t ∈ [0, 10] with time interval ∆t = 0.01 s. We show the samples in Figure 15, where we can clearly see the non-stationary features of process f (t). The samples also switch the stationary and non-stationary behaviour randomly.
It is also of interests to see how does a fitted DGP model behave in the future (i.e., when extrapolated). For this purpose, we select the fitted CKFS DGP-2 on the sinusoidal experiments as the example. We draw prediction samples starting from the end of the smoothing posterior distribution, and predict until t = 4 s. We see that at the beginning (t = 1 s) the samples of f retain similar features as the fitted f . As t reaches the end, f (t) is gradually becoming smoother because its lengthscale approach the stationary state.

Appendix E Hyperparameter Values Found via Grid Search
For the sake of reproducibility we list the hyperparameters found by grid search in the following Table 3. Due to a large number of unknown hyperparameters, the grid search routine assumes that GP nodes in the last layer share the same hyperparameters. Hereafter we use notations and σ to represent the last layer lengthscale and magnitude.