Feature Engineering with Regularity Structures

We investigate the use of models from the theory of regularity structures as features in machine learning tasks. A model is a polynomial function of a space–time signal designed to well-approximate solutions to partial differential equations (PDEs), even in low regularity regimes. Models can be seen as natural multi-dimensional generalisations of signatures of paths; our work therefore aims to extend the recent use of signatures in data science beyond the context of time-ordered data. We provide a flexible definition of a model feature vector associated to a space–time signal, along with two algorithms which illustrate ways in which these features can be combined with linear regression. We apply these algorithms in several numerical experiments designed to learn solutions to PDEs with a given forcing and boundary data. Our experiments include semi-linear parabolic and wave equations with forcing, and Burgers’ equation with no forcing. We find an advantage in favour of our algorithms when compared to several alternative methods. Additionally, in the experiment with Burgers’ equation, we find non-trivial predictive power when noise is added to the observations.


Introduction
The aim of this paper is to explore the effectiveness of models from Hairer's theory of regularity structures [Hai14] as feature sets of space-time signals.A model is a collection of polynomial functions of the signal which has been used to great success in the analysis of stochastic partial differential equations (SPDEs).This paper is the first, to our knowledge, to explore its effectiveness in a machine learning context.
One of the motivations for this study comes from the fact that models are a higherdimensional analogue of the path signature, a central object in Lyons' theory of rough paths [Lyo98].The signature is the collection of the iterated integrals of a path, which has a rich mathematical structure; it is known to characterise the path up to a natural equivalence relation [Che58,HL10,BGLY16] and leads to a natural notion of non-commutative moments on pathspace [CL16,CO22].Over the past decade, the ability of the signature to encode information about a path in an efficient and robust way has made it a powerful tool in the analysis of time-ordered data.Examples of applications of signatures include the recognition of handwriting [XSJ + 18, Gra13] and gestures [LZJ17], analysis of financial data [LNA19,KLA20], statistical inference of SDEs [PL11], analysis of psychiatric and physiological data [AGG + 18, MKN + 19], topological data analysis [CNO20], neural networks [KBPA + 19], and kernel learning [KO19,CO22].1 See [CK16] for a gentle introduction to the path signature and some of its early applications.

Our contribution
Our main contribution is to introduce a novel concept of model feature vector (MFV) that provides an extension of path signatures outside the context of time-ordered data, that is, to data parameterised by multi-dimensional space.The MFV is a collection of space-time functions from D to R, where D ⊂ R d for d ≥ 0, that is built from an input signal ξ : D → R K .In the context of solving PDEs, the signal may incorporate a forcing term and boundary data.The motivation for the MFV is the fact that solutions to PDEs with a given forcing term and boundary data should be well-approximated at a space-time point z by components of the corresponding MFV evaluated at the same space-time point; we give further details in Section 2.1.In a machine learning context, the MFV provides a set of features of the original data that can be used in learning algorithms.
In addition to proposing the MFV, we give evidence that these features can carry important information in several test cases.The basic problem on which we test the use of MFV is: Problem 1 For a point z ∈ R d , predict the value u(z), where u solves a PDE with a known forcing ξ and boundary condition u 0 but with unknown coefficients.
We focus on the case that the PDE in question is an evolution equation.To address Problem 1, we propose two algorithms, Algorithms 1 and 2, in Section 2.3 based on elementary linear regression with MFVs in supervised learning tasks.Algorithm 1 is designed to predict u(z) in the presence of a general forcing ξ, while Algorithm 2 is designed to work when there is no forcing (or equivalently ξ = 0) in which case one can leverage the flow property of u to improve predictability.An important feature of Algorithm 2 is that it predicts u(t, x) for all space-time points (t, x), thereby effectively learning the entire function u.
We investigate the effectiveness of Algorithms 1 and 2 in numerical experiments in Section 3. We apply Algorithm 1 to non-linear parabolic and wave equations with forcing and fixed initial conditions, and apply Algorithm 2 to Burgers' equation with no forcing but varying initial condition.
In the case of Burgers' equation, Algorithm 2 performs similarly to an adaptation of the PDE-FIND algorithm [RBPK17] on noiseless data and data with small noise, and outperforms the latter on data with large noise (see Section 3.3.1).In the case of a parabolic equation, Algorithm 1 outperforms some basic off-the-shelf regression algorithms (SVR, K-Nearest Neighbours, Random forests) applied simply by treating the forcing as a large vector.
We emphasise that the definition of MFV in Section 2.2 and the algorithms in Section 2 are presented independently of PDEs and could be applied to learn other functions of the underlying signal, not necessarily the solution of a PDE -see Section 4 for further discussion.

Related works
The MFV is inspired by the notion of a model from [Hai14].The main difference between our definition and that in [Hai14] is that we suitably include the boundary data of the signal as part of the model.The path signature is a special case of MFV (Proposition 2.4).
The idea to apply machine and statistical learning methods to find, predict, or study solutions of PDEs has seen much attention in recent years.See for example [MQdH18, RPK19, SS18, BSHHB19, HJE18, RBPK17] and the references therein.We also mention the works [LM21, LKRY22, ZWJW20, HWZ22, JK20] that, like ours, treat boundary data in machine learning-based solvers.Many works in this direction have focused on new design of learning algorithms.In contrast, our main contribution comes from designing a new set of features which can be used in a range of algorithms.As such, we expect our approach to complement many existing methods.
Since the appearance of this article, several works have built on MFVs (or related approaches) especially in combination with neural networks.See for example [HMC + 22, SLG22].

Model feature vectors and regression algorithms
In this section we motivate and define the "model feature vector" and introduce two algorithms based on models for learning functions of space-time signals.
We denote by N = {0, 1, 2, . . .} the set of non-negative integers and by R the set of real numbers.Assume that we are given a spatial domain D ⊂ R d for d ≥ 0 and a time horizon We will also define the order of a multi-index as |a| := d i=1 a i .Note that ∂ 0 u = u.We let ∂ t denote the partial derivative with respect the time parameter t ∈ [0, T ].

Motivation
Our motivating problem is to learn the solution of a PDE on [0, T ] × D given by where L is a linear differential operator, u 0 : D → R is the initial condition, and take as arguments the partial derivatives of u up to order q (i.e. the jet of u to level q) and are assumed to be smooth and unknown, while (ξ, u 0 ) is known.
Such an equation is often called a PDE with forcing ξ.When ξ is a random function, e.g.space-time white noise, it is also referred to as a stochastic PDE (SPDE).
For this discussion, we assume L = ∂ t − ν∆ is the heat operator with viscosity ν > 0, where ∆ = d i=1 ∂ 2 i is the Laplacian on D ⊂ R d , and µ, σ depend only on u (and not its derivatives).The case when µ, σ depend on ∂ a u with |a| > 0 as well as the case of other choices of L, e.g. the wave operator, are left to the reader.
Under good enough assumptions on the functions µ, σ, u 0 and the forcing ξ, equation (2.1) admits a local in time mild solution.3In particular, Picard's Theorem implies that u is the limit of the following recursive sequence (2.2) where the operators I and I c are defined by 2One might need to include other initial information like an initial speed for the case of the wave equation from Section 3.2.
for functions f : [0, T ] × D → R and g : D → R, subject to the same boundary conditions as in (2.1).
The idea is now to Taylor expand the function µ up to m terms and the function σ up to ℓ terms in the equation for u (n+1) .Define u 0,m,ℓ = I c [u 0 ] and recursively set Then, heuristically, since Taylor's expansion implies u n,m,ℓ → u (n) as m, ℓ → ∞ and since Picard theorem implies u (n) → u as n → ∞, we see that u n,m,ℓ should be a good candidate for approximating u.It is not difficult to see from (2.4) that u n,m,ℓ is a polynomial function of I c [u 0 ] and ξ that involves iterated integrals (i.e.iterated applications of I).Recalling that the unknowns are µ and σ, and thus µ (k) and σ (k) are also unknown, it is sensible to encode as features these polynomials of I c [u 0 ] and ξ and learn the solution map (u 0 , ξ) → u via linear regression.Our definition of "model feature vector" below precisely encodes this collection of polynomials that appear in u n,m,ℓ in a more general setting.These polynomials closely resemble models appearing in the theory of regularity structures, see [Hai14,Sec. 8], which is the motivation behind our terminology.

Model feature vectors
We now generalise and abstract the polynomial features discussed in the previous subsection.Fix for the rest of this section a pair ({u (i) } i∈J , ξ) (an "observed signal") where J is a finite index set (possibly empty) and (2.5) We call ξ the forcing.The case d = 0 corresponds to just ξ : [0, T ] → R K and u (i) : [0, T ] → R. In the experiments in Section 3, we sometimes let u be fixed, so that the signal is only ξ, and sometimes we fix ξ (essentially taking ξ ≡ 0) so that the signal is only u (i) .One should think of {u (i) } i∈J as "boundary conditions", like 2), and where we allow multiple boundary conditions (as needed, e.g. for the wave equation).
Let us fix a linear operator I that maps space-time functions f : [0, T ] × D → R to space-time functions I[f ].For example, I[f ] could be a convolution with some space-time kernel or a solution to some linear PDE with forcing f .Definition 2.1 Consider a tuple of non-negative integers α = (m, ℓ, q) ∈ N 3 and n ∈ N. The model feature set S n α is the finite set of formal symbols defined inductively by4 S 0 α = J and where Here Ξ (k) and D a i are formal symbols.When a i = 0 we simply write D a i τ = τ .
The model feature vector (MFV, or simply model) M n α of (u (i) , ξ) as in (2.5) is the family functions M n α : S n α → R [0,T ]×D that we denote by where We call n the height of a model, m the additive width, ℓ the multiplicative width, and q the differentiation order.Furthermore, • if q = 0, we call M n α a model without derivatives, • if ℓ = 0, we call M n α a model without forcing, and • if J is empty, we call M n α a model without initial conditions.
We will often use an abuse of notation and write f τ ∈ M n α meaning that there exists a symbol τ ∈ S n α such that M n α [τ ] = f τ .The symbols in S n α can be represented as decorated combinatorial rooted trees as in [BHZ19, Sec.2].
Note that additive width m limits how many functions could be multiplied if none of them includes a component of the forcing ξ, while multiplicative width ℓ limits how many functions could be multiplied if one of them is a component of ξ.
To give an example at level n = 2, one of the symbols in We next show precisely how the path signature is generalised by the MFV.Consider n ≥ 1 and a differentiable path
Then W n ⊂ S n α and, for all (i 1 , . . ., i n ) ∈ {1, . . ., K} n , Proof.The inclusion W n ⊂ S n α and that φ : {1, . . ., K} n → W n is a bijection are clear.Equality (2.8) is clearly true for n = 1 and in general follows by induction: where the second equality follows from the inductive hypothesis of (2.8) for n − 1 and the third equality follows readily from the definition (2.7).
In our experiments below, the finite index set J and "boundary conditions" {u (i) } i∈J will be taken as follows: in Section 3.3 J will be a singleton J = {c} and u (c) will be the solution to the linear heat equation with a given initial condition for I c as in Section 2.1; in Section 3.1 J will primarily be empty J = ∅ as we will ignore initial conditions (though see Section 3.1.2for an exception); in Section 3.2 J will contain two elements J = {c, s} and u (c) (resp.u (s) ) will be the solution of the linear wave equation with initial condition u (c)  0 and initial speed 0 (resp.initial condition 0 and initial speed u (s) 0 ), where both u (c) 0 , u (s) 0 are given.While we consider only a space-time setting, Definition 2.1 readily adapts to an purely spatial setting.In this case the linear operator

Regression algorithms
In this subsection, we propose two supervised learning algorithms which use the MFV of an input ({u (i) } i∈J , ξ) to learn an output u.While in principle there is no limitation of the nature of u (vector, classification label, etc.), we will consider the special case where u is a number associated to a space-time point or is a space-time function.In the experiments in Section 3, u will be the solution to a PDE with forcing ξ and a given initial condition.We will furthermore consider henceforth K = 1, so the forcing is R-valued and simply write ξ (1) = ξ; the generalisation K > 1 is left to the reader.

Prediction at one point
In the following algorithm, one should think of the observation u as a quantity which depends on the signal ({u (i) } i∈J , ξ) at a given space-time point

Algorithm 1 (Prediction at one point.)
Parameters: integers n, m, ℓ, q ∈ N and an operator I. Input: • a set of pairs ({v (i) } i∈J , ζ) ∈ U pr for which we want to make a prediction.
Step 1 Let α = (m, ℓ, q).For each (u, {u (i) } i∈J , ξ) ∈ U obs and (resp.each Step 2 Fit a linear regression of u against Step 3 For each ({v (i) } i∈J , ζ) ∈ U pr , construct a prediction u pr using the linear fit constructed from Step 2 and the associated model Recall that our motivating problem is to learn the solution of a PDE (2.1) at a given point (t, x) where (u 0 , ξ) are observed but µ, σ are unknown.Using the notation of Section 2.1, our typical choice for J is a singleton J = {c} with u (c) = I c [u 0 ] (but we use other choices if we wish to encode more or less boundary conditions).The heuristic reason why Algorithm 1 should work for predicting PDEs comes from the fact that functions in M n α constructed from ξ and I c [u 0 ] well approximate the n-th Picard iterate u (n) which itself should converge to the solution of (2.1) for smooth µ and σ.
Remark 2.5 If it is known that the equation (2.1) is additive, i.e. that σ is a constant, then the heuristic of Section 2.1 suggests that one should consider M n α with ℓ = 1.More generally, if it is known that both µ and σ are polynomials, then the heuristic suggests that taking m and ℓ − 1 greater than the respective degrees of µ and σ would likely not improve the accuracy of the above algorithm.These remarks follow from the fact that polynomials agree with their Taylor expansion (for high enough order of expansion).
Note that, in Algorithm 1, we regress against the functions in the model at one input space-time point (t, x) only (see Section 2.1 for the motivation behind this choice in the case of PDEs).In the case of path signatures (Definition 2.3), this corresponds to using only the endpoint T of the signature, i.e. S I 0,T (X), which is common practice (see e.g.[AGG + 18, KO19, CNO20]).There are situations, however, where it is beneficial to use the signature of a path over different segments, i.e. use S I s,t as a feature for different choices of [s, t] ⊂ [0, T ], and the choice of segments is a hyperparameter, see e.g. the sliding window approach of [XSJ + 18].It would be of interest to explore if a similar approach yields any benefit for MFVs.

Prediction using flow property
We will now focus on predicting functions u defined on all space-time points which have a given initial condition and no forcing.Algorithm 2 below is designed to work when u satisfies the time-homogeneous flow property: u(t, x) should depend on u(0, •) in the same way as u(t + h, x) depends on u(h, •).
The algorithm employs a discretisation of time which we assume is equally spaced, i.e. t k = δk where δ = T /N .The observed and predicted functions of this algorithm are both functions Assume further that we are given an additional linear map I c which is an initialising map: We briefly describe the algorithm in words.Suppose that we know or have a prediction for u(t k , x) for some k ∈ {0, . . ., N − 1} and all x ∈ D. Under the time-homogeneous flow property, it is natural to seek an approximation for u(t k+1 , x) using a functional linear regression of the form where a, b : D → R are functions to be learned and The time homogenous flow property implies that a and b τ are expected to only depend on the time step δ and not t k (but a, b τ can depend on x ∈ D).In the training phase, we therefore decompose each observation u : O T × D → R into N 'subobservations' u(t k , •) : D → R, for k = 0, . . ., N − 1, and learn the coefficients a, b τ from these subobservations.The prediction phase then recursively applies the formula (2.9) to predict u from the 'initial condition' u(0, •).In the following, we will sometimes write u t for the function u(t, •) : D → R.

Algorithm 2 (Prediction using flow property.)
Parameters: integers n, m, q ∈ N, operator I, and initialising map I c .

Input:
• a collection {u(t, x)} (t,x)∈O T ×D ∈ U obs of observed functions; • a collection u 0 ∈ U pr of initial conditions u 0 : D → R for which we want to make a prediction.
Step 3 For each . Make a prediction of u pr (t 1 , x) for each x ∈ D based on the fit from Step 2 and (f τ (δ, x)) fτ ∈M n α (u 0 ) .Step 4 Recursive step.For each u 0 ∈ U pr , k ≥ 1, and the predicted u and make a prediction of u pr (t k+1 , x) for each x ∈ D based on a linear fit from Step 2 and ) .
When specific boundary values are given, one might need to enforce these for the predicted function u pr .For example one might set u pr (t k , x) = 0 for x ∈ ∂D and every k = 0, . . .N if this was known.
As remarked earlier, Algorithm 2 effectively converts the size of the training set for the linear fit from Algorithm 2 aims to address a problem similar to that of learning a dynamical system.A different approach to this problem is dynamic mode decomposition, which is based on spectral analysis of the Koopman operator [Sch10, RMB + 09].

Feature selection and hyperparameters
The cardinality of S n α grows exponentially with n.To avoid overfitting or to speed up the learning, it can be important to restrict further the number of elements in S n α .We do this below by introducing a function called degree deg : S n α → R which satisfies deg(ξ) = η and deg(i) = η i , i ∈ J , for some η, η i ∈ R together with the inductive definition (2.10) for some β > 0. This definition is set so that deg of symbols from S n α will be usually larger for bigger n, m, ℓ, q.We will then perform the regression in our algorithms against the functions in the model whose symbol does not exceed a certain degree γ.When used, the degree function and cutoff γ are additional parameters in Algorithms 1 and 2.
We follow a "rule of thumb" of keeping the ratio (number of train cases):(number of predictors) above 10 (see [Har15,Sec. 4.4] and references therein for a discussion about such rules).Thus, one would choose γ so that the number of functions in f τ ∈ M n α with degτ ≤ γ is at least 10 times smaller than number of elements in U obs .
Note that it is much easier to keep the train cases to predictors ratio above 10 for Algorithm 2 because we have N |U obs | train cases compared to only |U obs | train cases in Algorithm 1.Nevertheless, it is still beneficial to use degree for Algorithm 2 for computational reasons.Indeed, the linear regression time complexity Thus, a large size of the model can drastically slow down the learning.
The use of deg and cutoff γ is motivated by analysis of SPDEs,5 but other choices of feature selection are possible and may lead to improved learning.For example, it is possible to consider higher degree features but with a sparsity (i.e.l 0 norm) penalty, which is often employed in dictionary learning, though this choice would still require the computation of a large model M n α , at least on the training data; it would be of significant interest to investigate this form of feature selection (we are not aware of any systematic studies of sparse dictionary learning even for signature features).
In addition to the degree, Algorithms 1 and 2 come with several further hyperparameters, one of which is the 'height' (number of iterated applications of a linear operator) of the model; in the case of path signatures (Definition 2.3), our 'height' is the 'level' of a signature.In all the numerical experiments in Section 3, it was established that using a model with a larger height improves the performance of regression.Another hyperparameter is the linear operator I used in the definition of a model.In the case of Burgers' equation analysed in Section 3.3, we additionally found that I can be 'guessed' from the data, yielding sensible results (the guess for I does not need to be precise but the precision influences the prediction power).Some further discussion is given in Section 4.

Numerical simulations
We present several numerical experiments where we learn the solution of the PDE (2.1) with different choices of operator L and non-linearities µ, σ.In general one needs to specify the boundary conditions of (2.1), i.e. the values of u(t, x) for x ∈ ∂D.For simplicity, we only consider periodic boundary conditions in our experiment, but Dirichlet or Neumann boundary conditions can be easily implemented.As in Section 2.3, we will only consider MFVs (Definition 2.1) with K = 1 and q ≤ 1.
To approximate the continuum, we fix a finite grid O ⊂ [0, T ] × D. We will assume that for some integer N ≥ 1 and O X is a finite grid of points in D. We will work with functions defined on the grid O instead of [0, T ] × D. For this purpose, the operator I, the partial derivatives ∂ i , and I c (whenever it is used) must have approximations on O.
In all experiments below we use an ordinary least squares linear regression.See https://github.com/andrisger/Feature-Engineering-with-Regularity-Structures.git for Python code containing implementation of the model and experiments from this section.

Parabolic PDEs with forcing
In this subsection we will suppose that the differential operator in (2.1) is given by L = ∂ t − ν∆, where ν > 0 is the viscosity and ∆ = d i=1 ∂ 2 i is the Laplacian on 5The notion of degree is similar to the one introduced in [Hai14] and is related to the Hölder regularity of functions in the model which are built from highly oscillatory signals.Remark 3.2 Algorithm 1 does not require knowledge of µ or σ in (2.1).However, in the experiments in this subsection, µ and σ will be polynomials, and we will use knowledge of their degree to choose the hyperparameters m, ℓ.Another hyperparameter is ν since this determines I through (2.3).When µ, σ, ν are completely unknown, these hyperparameters could be chosen, as usual, by splitting the data into training, validation and test sets, and tuning the hyperparameters on the validation set.See also Section 3.3 where a starting point for an approximation of the viscosity ν is derived from the training data.

Multiplicative forcing
Consider the following PDE where ξ is a space-time forcing.Here we discretise space and time respectively in 100 and 1000 evenly distanced points, which we use to define the grids O X and O T .We solve (3.1) for each forcing ξ using a finite difference method on the same discretisation O = O T × O X (see [LPS14, Sec.10.5]).
We take here ξ as approximations of space-time white noise.We performed Algorithm 1 both using the full model M n α from Definition 3.1 with viscosity ν = 1 and the model without the initial conditions (i.e.where J is assumed to be empty in the construction of S n α ).We have found that in practice using the full model did not drastically improve the errors (see Remark 3.3).Therefore, we primarily present results in this subsection for the model without the initial condition.
We construct a model with J = ∅ of height n = 4, additive width m = 3, multiplicative width ℓ = 2,6 and differentiation order q = 0 (because µ and σ do not depend on ∂ i u) so that α = (3, 2, 0).We assign a degree from Section 2.3.3 to satisfy (2.10) with β = 2 and degξ = −1.5.7 In the experiments below we only consider functions f τ ∈ M 4 α with degτ ≤ 5. We randomly sample 1000 realisations of approximations of white noise ξ on O and solve (3.1) for each realisation.We then split the pairs (u, ξ) into training 6See Remark 2.5 for a motivation behind taking these particular widths.7This is motivated by the Hölder regularity of space-time white noise being −1.5 − ε for any small ε > 0 and the fact that the heat operator I increases the Hölder regularity by 2. and test sets of size 700 and 300 respectively.There are only 56 functions in f τ ∈ M 4 α with degree degτ ≤ 5 thus corresponding to a ratio of training cases to the number of predictors of 700/56 = 12.5.In Figure 1, we show results of performing Algorithm 1 with models without initial conditions at various space-time points (t, x).In every subfigure one can see a scatter plot of actual values of u(t, x) from the test set plotted against the predicted values.The error is measured as a relative ℓ 2 error, i.e. for the vector of realisations R and predictions P we set and the relative ℓ 2 error is defined by We also report the R 2 coefficient of determination and the "error standard deviation" which we define as We also report the slope of the regression line between true values and the predicted ones.
In Figure 1 one sees a better fit for a small time t = 0.05 which is explained by the fact that the approximation of u by functions from M 4 α is local because of the Taylor expansions in the Picard iterations (see Section 2.1 and equation (2.4)).For larger times t ∈ {0.5, 1} as well as different spatial points x ∈ {0.5, 0.95} there seem to be no big statistical difference in accuracy.
In Table 1, we show average relative ℓ 2 error, slope of the regression line, and R 2 statistic for Algorithm 1 applied to models of heights 1, 2, 3 and 4. All experiments are performed 1000 times (i.e.splitting the data randomly into training/test sets) and the average values over these experiments are reported.Table 1 demonstrates that increasing height indeed allows for a better overall prediction.A similar result holds true for the width: additive width smaller than 3 (which corresponds to the third power in the non-linearity in (3.1)) gives on average a worse error.Remark 3.3 Note that the error for the middle time t = 0.5 is slightly worse than for the end time t = 1.This could be caused by using a model without initial conditions instead of the full model.Indeed, using the full model as in Definition 3.1 allows to slightly reduce the error for the prediction at (t, x) = (0.5, 0.5) to 7.4% (with the same n = 4 and α = (3, 2, 0)) while making almost no change to the error for the prediction at (t, x) = (1, 0.5) and at (t, x) = (0.05, 0.5).A heuristic reason why the effect of the fixed initial condition could be ignored for the parabolic equations could be a good local structure and dissipative properties of the heat operator.The advantage of using models with J = ∅ for parabolic equations is that such models contain fewer functions, which both improves the speed of the computation and potentially helps with problems of overfitting.
We compare the results from Algorithm 1 with several basic off-the-shelf learning algorithms.To do this, for (u, ξ) ∈ U obs we transform all the space-time points of the forcing ξ into a vector (in this case a vector of 100000 points) and applied support vector regression (SVR), K-nearest neighbours (KNN), and random forest regressions (RFR) to predict the value of u(t, x) for (t, x) = (1, 0.5) ∈ [0, T ] × D. These algorithms were applied with the default settings in the Python sklearn library (e.g.SVR was taken with the RBF kernel) and each algorithm was tested on (t, x) = (0.05, 0.5) (t, x) = (0.5, 0.5) Prediction is performed at space-time points (t, x) = (0.05, 0.5), (0.5, 0.5), (1, 0.5), (1, 0.95).
1000 realisations of the noise with a 700:300 split for training and testing data as before.We give the results in Table 2. None of these algorithms gave better than 35% error.We also subsampled ξ by taking 50, 200 and 1000 evenly sampled space-time points (vs. the full 100000 points) to avoid over-fitting, but these three choices only increased the error for each regressor.This short comparison demonstrates that the MFV captures information that is lost by treating the noise simply as a large vector.

Two-dimensional spatial domain
We consider a similar experiment as in the previous subsection but over a twodimensional domain.Specifically, we consider the PDE where T 2 def = R 2 /Z 2 is the two-dimensional torus that we identify with [0, 1) 2 as a set (i.e.we consider periodic boundary conditions).We take the forcing ξ now as white in time and coloured in space.The precise definition is ξ(t, x, y) = β(t)w(x, y) where β(t) is Brownian motion, and w is independent of β and normally distributed according to N (0, 3 3/2 (−∆ + 49I) −3 ) as in [SLG22,Sec 4.3].Here ∆ is a periodic Laplacian on T 2 and we take its discrete periodic approximation to generate the data.The reason why we take a coloured noise in space instead of space-time white noise is that the equation (3.2) is singular in two spatial dimensions (see [Hai14]) and does not have a classical solution if the noise is white in both space and time.
We discretise space into 64 × 64 points and time into 1000 points.We construct a model with J = {c} a singleton and height n = 2 and remaining parameters (m, ℓ, q and degree cut-off) as in Section 3.1.1.For the corresponding function u c in (2.5) we take u (c) = I c [u 0 ] + I[ξ] where I c and I are as in (2.3).Remark that this definition slightly differs from Definition 3.1; we made this choice to still incorporate the initial condition while keeping the number of functions in M n α relatively small for computational reasons (cf.Remark 3.3).Note though that this implies that the height 0 model already has some non-trivial information (coming from I[ξ]).
We sampled 1000 realisations of ξ and performed Algorithm 1 as in Section 3.1.1(with 700 training and 300 testing samples).Figure 2 shows the outcome for three space-time points.As in Figure 1, we see an decrease in accuracy at larger times, which is expected from theory (as explained in Section 3.1.1).We also mention that we performed the same experiment with no initial conditions (i.e. with J = ∅ as Section 3.1.1),but achieved no meaningful predictive power (see Section 3.2 for a similar outcome for the wave equation), a feature not encountered in the one-dimensional setting of Section 3.1.1.
We furthermore performed the experiment 1000 times, each time resampling the training and testing set, and record the averages of the relative ℓ 2 error, slope of regression line, and R 2 statistic.The results are recorded in Table 3.As in Table 1, we see a sharp rise in predictive power with the height of the model, further demonstrating that non-linearities in the MFV capture important information of the underlying signal.We note that height n = 2 here effectively corresponds to height n = 3 of Section 3.1.1since we included I[ξ] in u (c) .

Additive forcing
We repeat the same experiment for the additive version of the equation (3.1) namely: (3.3) Discretisation of space-time and number of training and test cases is the same as in Section 3.1.1.We perform Algorithm 1 using the model M 5 α from Definition 3.1 with viscosity ν = 1 and α = (3, 1, 0), without initial conditions (J = ∅), and with degree ≤ 7.5, which gives 58 functions.8Note that since multiplicative width is 1 this reduces the number of functions in the model compared to the multiplicative case of Section 3.1.1,which allows us to take a larger height and upper bound for 8See Remark 2.5 for a motivations behind taking these particular widths.the degree.Figure 3 shows the results for space-time points (t, x) = (0.5, 0.5) and (t, x) = (1, 0.5).One can see that the additive equation exhibits a worse prediction for long times compared to the multiplicative equation (Figure 1 and Table 1) but a slightly better prediction for short times (for t = 0.05 error is even better: 0.1% in comparison to ≈ 5% in the multiplicative case).α with α = (3, 1, 0), without initial conditions and of degree ≤ 7.5.The x-axis contains values of u(t, x) for realisations of the forcing ξ from the test set and the y-axis contains predictions of the linear regression.Subplots (a), (b) show predictions at space-time points (t, x) = (0.5, 0.5) and (t, x) = (1, 0.5) respectively.
The prediction error rises in this additive case as t increases.This is expected by a similar reason as mentioned in Section 3.1.1,which is the Taylor expansions in the Picard iterations approximating u (see (2.4)).It would be of interest to extend Algorithm 1 to decrease this error.A potential way to do this is to generalise and combine Algorithms 1 and 2 and compute models on subintervals to learn the 'flow' of the equation, i.e. build models from the initial condition and the forcing over subintervals of [0, T ], use this model to predict the solution over subintervals, and chain the predictions together.(See also the discussion at the end of Section 2.3.1.)We leave this generalisation for a future work.

Wave equation with forcing
We will now consider a wave equation taking L = ∂ 2 t − ∆ in (2.1) and predict solutions of the following non-linear wave equation where ξ is a space-time forcing which we again take as a realisation of white noise.We will compare Algorithm 1 with both models with and without initial conditions.
Discretisation of space-time and number of training and test cases is the same as in Section 3.1.1.Note that, in the general case, the level zero of the full model M 0 α for the wave equation should not only include the contribution of the initial condition u 0 but also the contribution of the initial speed ∂ t u(0, x) = v 0 .This leads to the following definition.
Definition 3.4 Consider an initial condition u 0 : D → R, an initial speed v 0 : D → R, and a forcing ξ : [0, T ] × D → R, as well as n, m, ℓ ≥ 1, q ≤ 1 and ν > 0. Let α = (m, ℓ, q).The model M n α for the wave equation with propagation speed Moreover, for functions f : [0, T ] × D → R the operator I[f ] is defined to be the solution to a wave equation Boundary conditions for the above equation are taken to be the same as boundary conditions for the underlying wave equation (in this case periodic).
For this experiment we choose n = 4, α = (m, ℓ, q) = (2, 2, 0), and ν = 1, and impose the degree to satisfy degξ = −1.5, degu 0 = degv 0 = −0.5 and β = 1.5 in (2.10).We choose only functions of degree ≤ 1.5 which gives 60 functions in M 4 α .Figure 4 shows the importance of using the full model in the case of the wave equation, i.e. the contribution of the initial condition (even fixed and deterministic) can't be ignored.9The model constructed with J = ∅ in the case of the wave equation gives absolutely no predictability contrary to the parabolic case (see Remark 3.3) because the wave operator is not dissipative contrary to the heat operator.
Average relative ℓ 2 errors corresponding to different heights of the model with and without initial speed are presented in the Table 4 for 1000 repeated experiments.Table 4 further shows the importance of including the contributions of both the initial condition and the initial speed in the model when predicting the wave equation.

Burgers' equation
In this subsection, we aim to predict solutions to the following Burgers' equation with no forcing 9This parallels the necessity of including the initial condition in an analogue of the model in [GKO21] where the authors solve a non-linear singular stochastic wave equation in 3 dimensions.4: Average relative ℓ 2 errors, slopes, and R 2 for prediction at space-time point (t, x) = (1, 0.5) for different heights of the model.First column involves models using J = {c, s} and the second column involves models using J = {c} only.
from the knowledge of the initial condition u 0 .That is, given only the initial condition u 0 : [−8, 8] → R, our goal is to reconstruction the entire function u : [0, 10] × [−8.8] → R without explicitly solving the PDE (3.5), i.e., we wish to learn the map u 0 → u.This experiment is partly inspired by [MAAA20].The above equation satisfies the time homogeneous flow property that motivates Algorithm 2, which we use in the experiments below.
Above, (a k ) k=−10,...,10 are sampled as independent and identically distributed (i.i.d.) standard normal random variables and λ is a scaling parameter.We sample 120 such initial conditions with scaling λ = 8, 4, 2 (40 initial conditions for each scaling), which corresponds to u 0 having respectively one, two and four cycles.We then randomly subdivide these initial conditions into training and test sets of sizes 100 and 20 respectively.
We choose height n = 3, additive width m = 2, and differentiation order q = 1 for the model, i.e. α = (2, 0, 1).We assign for the degree degu 0 = −1.5 and β = 2 in (2.10) and only functions f τ with degree degτ ≤ 2.5 are considered.This gives 20 functions of degτ ≤ 2.5 instead of the original 91 functions in M 3 α .Using this degree cutoff speeds up the fitting of the linear regression by around 20 times in addition to a faster computation of the model.
As the number of training and testing cases (100 and 20) is relatively small, we repeated the above experiment 10 times.In the first row of Table 5 we record the performance of Algorithm 2, namely the averages, ranges, and standard deviations of the relative ℓ 2 error over the 10 experiments with 20 test cases each.Here the relative ℓ 2 error for one test case with true solution R and predicted solution P is defined as Figure 5 shows the heat-maps for the true and predicted solutions drawn from two test cases with greater than average error (heat-maps for test cases with error close to the average error appeared indistinguishable to the naked eye; even for Figure 5(a), where the error of 5.9% is above the average, the two solutions appear similar).
We furthermore tested Algorithm 2 on noisy data.We added a 1% error (resp.3%) to the observed data in the following way.Instead of observing the solution u of (3.5), we observe   We find that the results are essentially the same as those with the correct viscosity ν = 0.2 (and in fact have tiny improvements over the latter).

Comparison with PDE-FIND algorithm
We use a version of PDE-FIND algorithm from [RBPK17] to learn the non-linearity first instead of learning the solution.We use linear regression to find the best coefficients a, b such that   We then use finite difference method with the estimated (a, b) (estimating these coefficients separately for each experiment) starting from initial conditions from U pr in order to construct the predicted solution on the full domain [0, 10] × D. Note though that in this finite difference we can discretise time on a finer grid.In fact, we take 2001 points on [0, 10] which is the same number of time points that is used to construct solution to (3.5).We cannot, however, discretise the spatial domain [−8, 8] to a finer grid than 512 points because these are the only points observed for the initial conditions from U pr .
The estimated coefficients a, b are recorded in Table 7.The resulting errors for the predicted solutions after repeating the experiment 10 times with 20 test cases as before are recorded in Table 5.We notice that the performance of Algorithm 2 (with and without estimated viscosity) is similar to that of PDE-FIND, although Algorithm 2 yields a slightly larger average error, but a slightly lower maximum average error and total error.We furthermore performed the same experiments but with 1% noise and 3% noise on observed samples as for Algorithm 2. Here, instead of direct linear regression, we follow the proposal in [RBPK17] and perform polynomial interpolation: for each space-time point z, we fit a polynomial of degree 4 that best matches the observed function in a neighbourhood of radius 20 points around z, and then estimate a, b in (3.7) by taking derivatives of these polynomials and applying linear regression.This is made in order to avoid taking explicit derivatives of ũ via the finite difference method since the noisy data is not differentiable.The resulting errors and estimates for a, b are recorded in Tables 5 and 7 respectively.On 1% noise, the two methods are again comparable (with PDE-FIND demonstrating a slightly lower average error).
However, with 3% noise, we found that there is a noticeable difference.First, the estimated viscosity a in Table 7 is between 0.02 and 0.04, which is significantly lower than the true value 0.2.This caused the predicted solution to blow up on some test cases (due to numerical instability in our finite difference method): in each of the 10 experiments, between 0 and 7 of the 20 test cases blew up (the two extreme values were attained only for one experiment each, and the most common number of blow-ups was 1). Figure 7(b) shows heat-maps for a non-blow-up test case with 3% noise using PDE-FIND, wherein one can see the effect of the low estimated viscosity.In comparison, no test cases for Algorithm 2 blew up.
In Table 5 we only report errors from test cases where PDE-FIND did not blow up -the errors are expected to be even larger if all test cases were included by solving the associated equation with a more sophisticated numerical scheme.Even after removing the test cases for which PDE-FIND blew up, we see a mild advantage of Algorithm 2 over PDE-FIND with polynomial interpolation.
Finally, the reader may wonder if it is fair to compare Algorithm 2 to PDE-FIND given that we input into Algorithm 2 the true viscosity 0.2, while PDE-FIND is required to estimate it.We point out, however, that Algorithm 2 has no knowledge of the non-linearity u∂ x u in (3.5) (though the parameters m, q are chosen with the motivation that the non-linearity is at most quadratic with ∂ x u possibly appearing), while in our implementation of PDE-FIND we do input u∂ x u as the only possible non-linearity.Furthermore, on noiseless data, the viscosity estimated from the data gives results for Algorithm 2 that are comparable to PDE-FIND (see Table 5).We also point out that Algorithm 2 was approximatively 20 times faster to run with a fast Fourier transform method of computing models than PDE-FIND with polynomial interpolation.

Summary and discussion
To summarise, we proposed a new model feature vector (MFV) of a space-time signal that extends to multi-dimensional space the notion of a path signature.We further proposed two regression algorithms, which reveal that MFVs may contain important information about the underlying signal.
We applied Algorithm 1 to both parabolic and hyperbolic equations with forcing and Algorithm 2 to Burgers' equation with varying initial conditions.We did an elementary comparison of our algorithms with other methods.We compared the performance of Algorithm 1 for the parabolic equation with multiplicative forcing against several off-the-shelf methods and found a large advantage in favour of Algorithm 1.We further compared Algorithm 2 for the Burgers' equation against a version of PDE-FIND [RBPK17] (see Section 3.3.1).The two methods were comparable (with PDE-FIND showing a minor advantage) on noiseless and small noise data, while Algorithm 2 showed an advantage over PDE-FIND with larger noise data.We believe the success of Algorithm 2 in this experiment is due to the smoothing properties of the heat operator, which provides considerable robustness.
In terms of the hyperparameters, the experiments with Algorithm 1 in Sections 3.1 and 3.2 show that increasing the height of the model gives better predictability.We also found in Section 3.3 that one can effectively estimate the viscosity parameter ν > 0 at no expense in the error.These experiments demonstrate a potential for the use of MFVs as features for learning PDEs.A more systematic comparison of our algorithms with other methods as well as analysis of the effect of hyperparameters is left for future work.
We conclude by discussing several other directions in which this work could be extended.
• Beyond PDEs.An important next step is to investigate the use of models as features in learning algorithms in contexts beyond PDEs.We believe that natural directions to investigate include analysis of meteorological data [SCW + 15, CLK + 18], image and remote sensing recognition [Kri09,ZNZW15], and applications to fluid dynamics [BNK20,LKT16].Such extensions would parallel the current use of signatures in data science well beyond the scope of ODEs.
• Universality.It would be of interest to understand universality properties of models, i.e. in what sense and under which conditions can one approximate general functions of the input ({u (i) } i∈J , ξ) with linear functions of the model.Beyond their importance in machine learning, such universality properties are of deep mathematical interest; a celebrated result is that linear functions of the signature (see Definition 2.3) approximate, uniformly on compact sets, continuous function of rough paths modulo tree-like equivalence [HL10,BGLY16].
• Further learning algorithms.It will be important to explore the utility of 'model features' when combined with learning algorithms beyond linear regression (the only tool used in this article), such as with neural networks and random forests.Similarly, it would be important to kernelise the model feature vector efficiently.This would allow for use of popular kernel learning methods, such as support vector machines, and of the maximum mean discrepancy (MMD) of [GBR + 12] to compare samples drawn from different distributions.An MMD from the kernelised signature map was used in [CO22] to define a metric on the laws of stochastic process indexed by time, and fast signature kernelisation algorithms were introduced in [KO19]; extending these results to models would be of significant interest.
• Higher dimensions.In order to apply the ideas in this paper to data in high dimensional spaces, it would be important to improve the computation of models.It took10 between 0.2 to 0.5 seconds to compute one model in Sections 3.1 and 3.2, and approximately 90 seconds to perform one run of Algorithm 2 in Section 3.3.The computation time in higher spatial dimensions would be significantly longer.
In this direction, there are a number of works aiming to solve high dimensional PDEs with learning algorithms, such as [HJE18,SS18].Since, with the choice of operator I in our experiments, elements of the models are solutions to special PDEs, it would be interesting to see if these methods could make it feasible to compute the model features in high dimensional spaces.
• Operator I hyperparameter.The operator I in the definition of a model is a hyperparameter which needs to be chosen from a very large space (the infinite-dimensional space of linear operators).In our experiments, we mostly used knowledge of the linear part of the PDE (heat or wave operators) to choose I.However, if the PDE is completely unknown, or if the output u does not come from a PDE at all, then one would need a systematic way to choose this hyperparameter.The same applies to the other hyperparameters, such as n, m, ℓ, q, but these take values in a smaller space for which standard hyperparameter tuning (e.g.cross-validation, or sparse linear regression similar to PDE-FIND [RBPK17]) is feasible.Note that the problem of choosing I does not arise in the context of signatures simply because one hardcodes I as convolution with the Heaviside step function J(t) = 1 t>0 .We could of course likewise hardcode I, e.g., as the inverse of the heat operator (2.3), but if one believes the output u should behave like the solution to a wave equation, this will likely yield poor performance.How to choose I in a general context is therefore an important theoretical and practical question.

D
⊂ R d .This motivates the following definition.Definition 3.1 Fix an initial condition u 0 : D → R and a forcing ξ : [0, T ]×D → R as well as n, m, ℓ ≥ 1 and ν > 0. Let q ≤ 1 and α = (m, ℓ, q).The model M n α for the parabolic equation with viscosity ν is constructed by taking J = {c} with u (c) = I c [u 0 ] where operators I and I c are given by (2.3).

Figure 2 :
Figure 2: Results of linear regression of solution to (3.2).The x-axis contains values of u(t, x, y), where (t, x, y) is the indicated space-time point, from the test set and the y-axis contains predictions of the linear regression.

Figure 3 :
Figure 3: Results of linear regression of solution to (3.3) against the functions in model M 5 α with α = (3, 1, 0), without initial conditions and of degree ≤ 7.5.The x-axis contains values of u(t, x) for realisations of the forcing ξ from the test set and the y-axis contains predictions of the linear regression.Subplots (a), (b) show predictions at space-time points (t, x) = (0.5, 0.5) and (t, x) = (1, 0.5) respectively.

Figure 4 :
Figure 4: Results of linear regression of solution to (3.4) using functions from models M 4α with α = (2, 2, 0) of degree ≤ 1.5 without and with initial conditions.The x-axis contains values of u(t, x) for realisations of the forcing ξ from the test set and the y-axis contains predictions of the linear regression.Subplots (a), (b) show predictions at space-time point (t, x) = (1, 0.5) for models with J = ∅ and J = {c, s} respectively.

Figure 5 :
Figure 5: Heat-maps for the solutions of (3.5) (left) and predictions for two test cases (right) using Algorithm 2 with functions from the model M 3 α with α = (2, 0, 1) of degree ≤ 2.5.The values of λ are 4 for subfigure (a) and 8 for (b).
.7) for u ∈ U obs , k = 0, . . .N − 1, and x ∈ D, where ∂ t u(t k , x) := u(t k+1 ,x)−u(t k ,x)δ is a discrete time derivative, ∂ x is discrete space derivative and x ∈ D are the observed points.

Figure 6 :
Figure 6: Heat-maps for the solutions of (3.5) (left) and predictions for two test cases (right) using Algorithm 2 with 1% error in observed samples.The values of λ are 2 for subfigure (a) and 8 for (b).

Figure 7 :
Figure 7: Heat-maps for the solutions of (3.5) (left) and predictions for two test cases (right) with 3% noise on observed data.Subfigure (a) is for Algorithm 2 and subfigure (b) is for PDE-FIND.The values of λ are 8 for subfigure (a) and 4 for (b).

Table 1 :
Average relative ℓ 2 errors, slopes, and R 2 for linear regression against models of different heights.

Table 3 :
Average relative ℓ 2 errors, slopes, and R 2 for linear regression against models of different heights for equation (3.2).Prediction is performed at the same space-time points as in Figure2.

Table 7 :
Estimated parameters a and −b in (3.7) via PDE-FIND.The true values are a = 0.2 and −b = 1.