Linear/Ridge expansions: Enhancing linear approximations by ridge functions

We consider approximations formed by the sum of a linear combination of given functions enhanced by ridge functions -- a Linear/Ridge expansion. For an explicitly or implicitly given function, we reformulate finding a best Linear/Ridge expansion in terms of an optimization problem. We introduce a particle grid algorithm for its solution. Several numerical results underline the flexibility, robustness and efficiency of the algorithm. One particular source of motivation is model reduction of parameterized transport or wave equations. We show that the particle grid algorithm is able to produce a Linear/Ridge expansion as an efficient nonlinear model reduction.


Introduction
Many (numerical) approximations rely on linear approximation schemes.For some f ∈ H, H a normed space, one seeks for a possibly good finite-dimensional subspace H N ⊂ H and an approximation f N ∈ H N of f .Examples include finite element, finite volume, spectral or discontinuous Galerkin methods.In many cases, such linear schemes work very well, in particular if the error of an approximation scheme can be shown to be bounded in terms of the error of the best approximation in H N (e.g. by the famous Ceá lemma, [7,28]).If then the error of the best approximation decays fast (e.g. by using a Clément-type operator, [8]), one obtains an efficient (numerical) approximation.
Of course, one would want to determine a "best-possible" approximation.For linear approximation schemes, the worst best possible error is known as the Kolmorogov N -width d N (F) which is defined for a class F ⊂ H of elements as ( [18,24]) The class F could e.g.be a smoothness class (a Sobolev or Besov space) or a set of solutions for certain problems with different data (e.g. a parametric partial differential equation (PPDE) for different parameter values, see below).
The question if d N (F) decays "fast" as N → ∞ (e.g.d N (F) = O(e −N )) or "slow" (e.g.d N (F) ≃ N −s , 0 < s < 1) typically depends on some measure of smoothness of the elements of F, for example the Besov regularity.If F is given as a set of solutions of a problem (such as a PPDE), then the decay of d N (F) is a property of the problem at hand (not its discretization).This means that linear approximation schemes are not appropriate for problems with a poor decay of the N -width.Hence, other schemes are needed, which are necessarily nonlinear.
There is a huge variety of nonlinear approximation schemes, an extensive list goes far beyond the scope of this introduction.For reasons to be described below, we are interested in ridge functions for building nonlinear approximation schemes.More specific, we aim at constructing an approximation scheme consisting of a linear part and some nonlinear enhancement in terms of a ridge function update.Doing so, such an expansion consisting of a linear and a nonlinear part, we will be able to treat problems with fast and slow decay of d N (F) by the same algorithm.We need to collect some notation in order to motivate our choice.
1.1.Linear/Ridge approximation.We consider a given function u ∶ Ω → R, where Ω ⊂ R d is an open bounded domain and u ∈ L 2 (Ω) is the minimal requirement, sometimes we need to assume more, e.g.u ∈ C 0,1 ( Ω), at least piecewise.a This function can be given either explicitly or implicitly as the (unknown) solution to a given problem.In order to formulate the approximation problem under consideration, let Φ N ∶= {ϕ 1 , ..., ϕ N } be a given linear space of dimension N ∈ N with ϕ i , i = 1, ..., N , being given functions (which do not need to be linearly independent).Next, we recall the notion of a ridge function and refer e.g. to [6,17,25] for overviews and details.
) is called ridge function with profile v, direction a and offset b.
Remark 1.2.In practical application, we shall assume v ∈ L 2 (R) ∩ C 0,1 pw (R) for the profiles to be discussed below.We will denote its argument by ξ ∈ R, i.e., v(ξ).
In addition to Φ N , we assume that we are given a finite number M ∈ N of profiles and consider the approximation problem A function u δ of type (1.1) will be called Linear/Ridge expansion.Clearly, such a Linear/Ridge approximation depends on the choices of Φ N and V M as well as on the coefficients α i , c j ∈ R, the directions a j ∈ R d and the offsets b j ∈ R.However, in this paper we view Φ N and V M to be given (e.g. by some preprocessing training), so that we only consider the dependency of u δ on the directions, offsets and coefficients.In order to shorten notation, we collect this dependency in one index δ.The main topic of this paper is thus to investigate appropriate choices of coefficients, directions and offsets in (1.1) in order to determine a "good" approximation of a given function u.
1.2.Motivation.The main source of motivation for this paper comes from model order reduction of parameterized partial differential equations (PPDE) by means of the reduced basis method (RBM, [14,15,26]).In order to briefly review it, let L(µ) be a parameterized partial differential operator and f (µ) be some given right-hand side, where µ ∈ P ⊂ R Q is some parameter.One seeks for the exact solution u(µ) of L(µ)u = f (µ) -in a suitable weak sense.Typically, a suitable (i.e., sufficiently fine, we say "detailed") discretization is available.However, the computation of a numerical detailed approximation u N (µ) (with N ≫ 1 degrees of freedom) is too costly at least for approximating u(µ) for several different values of µ ("multi-query" context) and/or extremely fast (e.g. in realtime and/or in an embedded system).
The RBM aims at constructing a linear subspace H N ⊂ H (if the PPDE is posed on H, e.g. the Sobolev space H 1 0 (Ω)) of dimension N ∋ N ≪ N , which is typically constructed in an offline training phase.Then, in the online realtime or multiquery environment, a reduced approximation u N (µ) is computed very rapidly -as the (Petrov-Galerkin) projection of the detailed solution u N (µ) onto H N .In this context, the above mentioned class reads and this explains the particular relevance of the decay of d N (F) for model order reduction, e.g.[1,2].In particular, it is known that certain elliptic problems admit an exponential decay of d N (F), [5], whereas parameterized transport and wave equations show poor decay, [13,22], which motivates the construction of nonlinear model reduction (approximation) schemes in particular for such types of problems, [3,4,11,20,21,27], just to mention a few.
We are suggesting a nonlinear update in terms of ridge functions due to several reasons.First of all, ridge functions have a simple structure and turn out the be particularly suitable for parameterized transport and wave-type problems as we shall see below.Second, there is a rich literature for approximation theory with ridge functions, see, e.g.[6,9,17,25].Finally, ridge functions are closely related to neural networks, which just recently have been investigated for nonlinear model reduction, see e.g.[10,12,19].1.3.Outline.The remainder of this paper is organized as follows.In Section 2, we collect all required preliminaries and introduce the arising optimization problem.Section 3 is devoted to the association of the desired quantities (directions and offsets) with particles, which is the basis for the particle grid algorithm which we introduce in Section 4. Some results of our various numerical experiments are presented in Section 5. We finish with an outlook in Section 6.

Preliminaries
In order to quantify what has to be understood by 'good' approximation, define the set which is a nonlinear subset of L 2 (Ω).We consider the error in the L 2 -norm and are interested in the/a best approximation in this norm abbreviated by ⋅ 0 ≡ ⋅ L2(Ω) .In order to make (at least an approximation to) a best approximation accessible, we collect all variables by setting α ∶= (α write u δ (x) as in (2.1) and consider the cost function Before we continue let us detail an example which also indicates our particular interest in such approximations to be considered here.
Example 2.1 (Parametric linear transport problem).Consider the homogeneous linear transport equation on the real line with velocity µ > 0, i.e., ∂ t u(t, x) + µ ∂ x u(t, x) = 0 on (0, T ) × R, T > 0 with initial condition u(0, x) = u 0 (x), x ∈ R. In a model reduction framework, one would interpret µ as a parameter and would want to approximate the solution u(t, x; µ) = u 0 (x − µt) either for many velocities µ and/or in realtime.This is typically done in terms of a linear combination of "snapshots" ϕ i ∶= u(⋅, ⋅ ; µ (i) ), i = 1, ..., N , where the snapshot parameters µ (i) are chosen in some offline training phase.However, it is known from [22] that the Kolmogorov N -width is at most O(N −1 2 ), which makes such a linear model reduction inefficient.
However, choosing the profile v 1 ∶= u 0 and setting , the exact solution.Hence, choosing We stress the fact that we cannot assume in general that the functions v 1 , ..., v M are linearly independent, not even pairwise.There are even cases, where v i = v j for some distinct indices i = j as the following example shows.
0 as well as a 1 = (−µ, 1) ⊺ and a 2 = (µ, 1) ⊺ yields the exact solution.This is an example of two identical profiles causing an exact representation using different directions.Besides, also for u(0) ≠ 0, the wave equation is a sum of two, but then different, ridge functions.
Also this problem is particularly interesting since it is known that projection-based (i.e., linear) model reduction techniques do not work in the sense that the decay of the Kolmogorov N -width is at most O(N −1 4 ), [13].However, d'Alembert's solution formula is a ridge function.
These examples should motivate the consideration of the optimization problem Then, there exists a minimizer δ * of (2.4) for J u defined in (2.3).
Proof.Since L 2 (Ω) is a Hilbert space and U N,M ⊂ L 2 (Ω) is closed, the claim follows by standard arguments.
Note, that U N,M is not necessarily convex and therefore the minimizer in Lemma 2.3 is not necessarily unique.We are now going to determine an approximation u δ ∈ U N,M to the given function u step by step.
Remark 2.4.In practice we might just have access to the function u indirectly by a quantity like the residuum of a PDE.In this case, one would not use the L 2 -error as cost function J u but rather the residuum (if computable) or an appropriate error estimator.

Given directions and offsets.
As a first step, let us assume that the directions a and offsets b according to the profiles V M would be given.Then, the optimal coefficients α and c for a, b are easily determined as we shall see next.
⊺ are given as the solution of the linear system of equations where the right-hand side is given by f (u Proof.Let a, b be fixed and denote δ ∶= (α, c) as well as δ = (α, a, b, c) as in (2.2).We consider the reduced cost function Ju ( δ) ∶= J u (δ) as a function of δ only.Then, Ju ( δ) is also positive semi-definite, which concludes the proof.
This result shows that the coefficients are determined once directions and offsets are given.Setting δ(u, a, b) ∶= (α * (u, a, b), a, b, c * (u, a, b)), we consider the reduced cost function depending solely on directions and offsets (2.6) along with the reduced optimization problem Ĵu ( δ) Hence, we are left with finding the directions a and the offsets b.

Directions, offsets and particles
Since the determination of directions and offsets amounts to solving a complex optimization problem, we aim at using a very well-known heuristic method, the particle swarm algorithm, see e.g.[16,23].In order to reduce computational complexity, we arrange our "particles" (which are associated to the collection of all directions a j ∈ R d and offsets b j ∈ R, j = 1, ..., M ) in a dynamic grid.Hence, we call the arising scheme particle grid method.For each profile, we collect the direction and the offset in one vector d j ∶= (a j , b j ) ∈ R d+1 .These vectors are then associated to some component p j ∈ (−1, 1) D =∶ S D .The vector d = (d j ) j=1,...,M ∈ R DM of all directions and offsets is then associated to one particle.The dimension D is at most d + 1, but can also be smaller if the problem at hand fixes some components of d j , see Example 3.1 below.
Example 3.1.Consider the linear transport and wave equation in one spacedimension from Example 2.1 and 2.2, respectively.The underlying domain is Ω = R + × R and the variables read (t, x) ∈ Ω indicating time and space.Hence, in both cases, any direction takes the form a = (a t , a x ) and any ridge function reads v(a t t + a x x + b) for some offset b ∈ R.
The ridge functions should be consistent with the initial condition u(0, x) = u 0 (x), which implies that v ∶= u 0 and a x ∶= 1 as well as b ∶= 0. This shows that we do not need a full vector d = (a t , a x , b) of dimension d + 1 = 3, but that a single parameter suffices to represent each combination of direction and offset with a particle.
In fact, in both cases, profiles take the form v(x ± µt), which means that we choose a = (±µ, 1) ⊺ and b = 0 with µ ∈ R.
The association of particles to all vectors d j , j = 1, ..., M (i.e., for all M profiles), is done by constructing an appropriate mapping Of course, on can construct several such mappings.
Example 3.2.For D = d + 1, we frequently used the smooth transformation π ∶ In order to associate the real-valued parameter µ to a particle p ∈ (−1, 1), we can define π ∶ S 1 → R 3 by p ↦ (tan( π 2 p), 1, 0) ⊺ , i.e., µ = tan( π 2 p).This setting now allows us to reformulate the reduced optimization problem (2.7) in terms of the particles, i.e., Ĵu (p) → min!for p = (p 1 , ..., p M ) ∈ S M D , where where the arguments are to be understood in the following manner: We call such a p particle.
Example 3.4.In order to illustrate the challenges for solving the reduced optimization problem for (3.1), let us consider three examples, where the function u to be approximated is given as a sum of ridge functions (i.e., which can even be represented exactly).We consider the two-dimensional case Ω = (0, 1) 2 and M = 2 profiles.The data is collected in the following table. Case cos(ξ) cos(ξ) Given v 1 and v 2 , setting b 1 = b 2 = 0, (a 1 ) 1 = (a 2 ) 1 = 1 we scatter S 2 1 by selecting particles p = (p 1 , p 2 ) ∈ (−1, 1) 2 for the remaining two unknowns (a 1 ) 2 and (a 2 ) 2 .Each such particle defines a direction for which we define u δ and compute the value of the cost function, i.e., the error u − u δ 0 .The results are depicted in Figure 1.
The functions u are shown in the top row.The reduced cost function Ĵu as a function of p ∈ (−1, 1) 2 is visualized in the second row, scattered on a grid of 129 2 points.Solving for the two directions and offsets would thus amount finding a global minimum of the cost functions shown in the second row.As we see, we may face multiple local minima, even multiple global minima (in particular in cases 1 and 2, where v 1 = v 2 , so that we see symmetry), steep gradients and several highly localized phenomena.As we see from Example 3.4, we may face a truly complex optimization problem.In particular, we cannot hope to use gradient-based optimization techniques, at least not in a straightforward manner.

A particle grid algorithm
As already motivated earlier, we are now going to describe an algorithm which has shown good performance in determining at least a good approximation to a global minimum of Ĵu in (3.1).Recall, that we are facing an optimization problem in S M D ≅ S D M ≅ S DM = (−1, 1) DM .Hence, we define the optimization dimension as P ∶= DM .
The algorithm produces a sequence of particle grids where each grid (i.e., a swarm in form of a grid) P (k) consists of m par particles in S P .We choose n par nodes in each dimension, i.e., particles in S P = (−1, 1) P .Then, we initialize the first particle swarm P (0) by taking the tensor product, yielding a regular grid.Each particle has uniquely defined next neighbors in each diagonal direction.This next neighbor relation does not change in the course of the iteration, as we will see in Lemma 4.1.This means that each swarm is a grid whose internal geometry does not change even if the position of each particle may vary.We may associate each particle grid P (k) with a tensor of dimension P (e.g., a matrix for P = 2). If .., i P ) ∈ {1, ..., n par } P }, and each particle takes the form i ) P ) ⊺ ∈ (−1, 1) P , the algorithm can be described as follows: Choose δ ∈ (0, 1  2 ).Then, for each particle p (k) i ∈ P (k) , we consider the value of the particle and of the at most 3 P − 1 surrounding particles p (k) j with j = (j 1 , ..., j P ) where i s − j s ≤ 1 for all 1 ≤ s ≤ P .We collect the indices of these particles in a set I(i) ∶= {j = (j 1 , ..., j P ) ∈ {1, ..., n p } P ∶ i s − j s ≤ 1, 1 ≤ s ≤ P } and set q (k+1) i

∶= arg min
Subsequently we get the next particle by the step: For the points on the boundary of the grid we need to slightly adapt this procedure.Technically, we view particles at opposite sides of the boundary as being neighbors, which means that, for example in one dimension the particle on the most left is considered as the neighbor of the particle on the most right.For dimension 1 ≤ s ≤ P , a particle p (k) j with j = (j 1 , ..., j s−1 , 1, j s+1 , ..., j P ) is a neighbor of p (k) i with i = (j 1 , ..., j s−1 , n par , j s+1 , ..., j P ).This means that we are sticking the boundary together, so that we get a grid where in each dimension each particle has neighbors to the left and to the right, similar to a torus.This concludes the description of one iteration P (k) → P (k+1) , which is terminated after a predefined number K ∈ N of iterations with the output p app ∶= arg min i∈{1,...,np} P Ĵu (p which is used as an approximation for some minimizer p * .Here, we often choose δ = 1  3 , which turned out to be a reasonable choice as it always ensures a positive distance of neighboring particles.We summarize the method in Algorithm 1.

Algorithm 1 Particle grid algorithm
Input: Number of particles m par = n P par , maximal number of iterations K, parameter δ ∈ (0, 1  2 ) 1: Initialize P (0) = {p  Get p app as in formula (4.3) 9: end for Output: Approximation particle p app  The following observation is then immediate.4.1.Complexity.Let K be the number of iterations and m par the number of particles.In order to compute Ĵu , we sample Ω ⊂ R d by n ∈ N quadrature points for computing the L 2 -norm.Hence, we need to compute point values of u and u δ at n d points.For each quadrature point, the evaluation of u δ requires to evaluate ϕ i , i = 1, ..., N and v j (a ⊺ j ⋅ +b j ), j = 1, ..., M , which is a total of Cost( Ĵu ) ∶= N ⋅ n d + M ⋅ (d + 1) ⋅ n d operations for one evaluation of J u and m par ⋅ Cost( Ĵu ) evaluations for one particle grid P (k) .For each iteration, we need to evaluate Ĵu for all grid points and per grid point, we have at most 3 P − 1 comparisons.In total, Algorithm 1 thus requires in the order of )) + 3 P − 1 operations, where P = DM and D ≤ d + 1.Hence, the algorithm gets to its limits for large particle dimension D and large numbers M of profiles.The dimension N of the linear part only plays a minor role, especially as with sufficient memory the values of ϕ i can be computed once and be stored.4.2.Parallelization.The particle grid algorithm has the advantage that it can easily be parallelized.In fact, in a shared memory environment, the current grid P (k) and any given particle p (k) i is needed for determining the position p (k+1) i in the next iteration.Hence, all such computations can be performed in parallel without further communication, which leads to linear speedup w.r.t. the numbers of processors.For distributed memory one would pass the positions of the neighbors of p (k) i to the processor handling this particle.

Numerical experiments
In this section, we present results of some of our numerical examples.The main focus is the question how well a Linear/Ridge expansion is able to approximate certain functions and also the quantitative performance of the presented particle grid algorithm.
We start by applying our particle grid algorithm for determining a Linear/Ridge approximation of type (1.1), i.e., we seek for an approximation in U N,M consisting of the sum of a linear combination of functions Φ N and profiles V M for selected choices of such sets Φ N and V M .5.1.Approximation properties.First, we choose functions u which can be written exactly in the form (1.1), i.e., u ∈ U N,M and approximate the coefficients y in (2.2) by our particle grid algorithm.Doing so, we can monitor the error u − u y 0 .5.1.1.Polynomials and Wavelets.We start by a problem in two dimensions, (t, x) ∈ Ω = [0, 1] × [−1, 1] which might be interpreted as time and space.It seems to be a straightforward choice to use span(Φ N ) = P r (Ω), i.e., the space of algebraic polynomials of degree at most r ∈ N.Here we choose r = 2, so that N = 6 and for simplicity we use the monomial basis, i.e., Of course, we could use a more stable basis Φ 6 (e.g.orthonormal polynomials), but we are also interested to see how the algorithm can cope with ill-conditioned sets Φ N , which are possibly even allowed to be linearly dependent.
Concerning the profiles, we choose wavelets as it is known that dilates and translates of wavelets yield frames or bases of L 2 (R 2 ).That makes them good candidates for profiles.Specifically, we take M = 4 and • v 1 as Mexican Hat, • v 2 as the Haar wavelet, • v 3 as Morlet wavelet, • and v 4 to be the Hockeystick, which is also known as the ReLU activation function in neural networks, see Figure 3b.The function u to be approximated is a Linear/Ridge expansion with α i = 1, i = 1, ..., 6, b j = 0, c j = 1, j = 1, ..., 4 and The arising function is displayed in Figure 3a.As we see, the shape of u is rather complex.As the required discretization for performing computations, we use h t = 1 128 , h x = 1 64 , so that we have 129 grid points for both variables.For Algorithm 1 we choose as above δ = 1 3 and m par = 6 4 = 1296 particles.In Figure 3d, we monitor the L 2 -error over the number of iterations.We obtain monotone convergence, without rate of course.For example, in order to reach an error smaller than 10 −4 we need about 250 Iterations.After about 1000 iterations, the L 2 -error is machine accuracy, i.e., 9.91 ⋅ 10 −15 .The approximation u δ is shown in Figure 3c.5.1.2.Discontinuous staircase function.Even though the shape of the function in our first example is complex, it is a smooth function b .Hence, we are now going to consider a function which consists of eight jumps along straight lines which are rotated in order to exclude a tensor product approximation.We use again Ω = (0, 1) × (−1, 1).The profiles are chosen as the jump function displayed in Figure 4a top right.In order to investigate the role of non-linearly independence of the profiles, we take v 1 = ⋯ = v 8 = 1 R + , i.e., M = 8 identical profiles.We forego the linear part here, i.e., N = 0.The function u to be approximated takes the form (1.1) with N = 0 and M = 8 choosing b j = 0, c j = 1, j = 1, ..., 8 and directions according to the angles π 9 , so that we obtain a staircase-like function as displayed in Figure 4a bottom left.
For the discretization, we choose as above 129 grid points in both coordinate directions.For the particle grid algorithm, we use δ = 1 3, m par = 3 8 = 6561 and m par = 4 8 = 65563 particles, respectively, where we note that the algorithm did not converge for 2 8 particles.The error decay is shown in Figure 4b.The reason why we show results for m par = 3 8 and m par = 4 8 is the fact that we observe a stagnation of the error for m par = 3 8 after 50 iterations, whereas m par = 4 8 yields machine accuracy after 49 iterations.This shows that one might be forced to use a large number of particles in order to reach high accuracies.After 50 iterations, we achieved an L 2 -error of 0. In Figure 4a, we show the staircase-type step function that is also the computed approximation after 50 iterations.5.2.Non-exact approximation functions.So far, we used the algorithm to approximate functions that can exactly be represented in terms of the chosen families Φ N and V M .This shows how fast (or slow) the algorithm is able to find the function in terms of directions and offsets.Now, we are going to consider functions u that cannot be represented exactly by u δ , i.e., inf δ u − u δ > 0. To this end, consider for c ≥ 1 the function Of course, the Haar wavelet is discontinuous, but a quite easy and single function to be approximated.
For the linear part of the approximation, we choose snapshots ϕ j (t, x) = u(t, x; j), for j = 1, ..., N , and the profiles are chosen as v i ∶= cos(2π⋅), i = 1, ..., M for M ∈ {0, ..., 4}.In Figure 5, we display the L 2 -error for u(t, x) = u(t, x; 100) and different values of N and M .For the discretization, we choose as above 129 grid points in both coordinate directions and for the particle grid algorithm, we used δ = 1 3, m par = 5 M particles.We do not obtain monotone convergence as N grows, which might be a stability issue.On the other hand, however, the error decreases for fixed N and increasing M in a monotonic manner.5.3.Parametric Partial Differential Equations (PPDEs).Next, we consider two PPDEs that depend on parameters µ ∈ R P and investigate the approximation of the solution u(µ) for various parameter values.The first problem is a stationary one, the second is instationary; both are defined in Ω = (0, 1) 2 , which is interpreted as a time/space-domain for the wave equation.For both cases, we use δ = 1 3 and m par = 11 2 = 121 particles.
5.3.1.Case 1: Thermal block.The thermal block is a classical problem for model reduction, [14].We consider the stationary case, which is a Poisson problem with piecewise constant coefficients, which serve as parameters.In fact, we decompose Ω by The right-hand side and boundary data can be retrieved from [14], and the formula for the exact solution can easily be seen to be for x = (x 1 , x 2 ) ∈ Ω as follows Due to the specific form of the solution, we use the linear approximation with N = 4 and choose Φ 4 ∶= {ϕ 1 , ϕ 2 , ϕ 3 , ϕ 4 } with the functions defined above.We took M = 2 arbitrarily chosen profiles.The reason for this choice is as follows: Since the solution can be approximated quite accurately by the linear combination of the snapshots ϕ i (which is due to the fact that such snapshots are known from the Reduced Basis Method to yield a very accurate approximation), the particle grid algorithm should automatically detect that no profiles are needed.As we can see from the results in Table 1 this is in fact the case -the algorithm finds the solution up to machine accuracy after only 1 iteration of the particle grid algorithm.We stress the fact that we tested much more parameter values and always observed this behavior.The results are shown in Table 1.As expected, the hyperbolic wave equation is a harder problem than linear transport, which in turn is much harder problem than the thermal block.The numbers of the required iterations of the particle grid algorithm are significantly higher in order to reach a desired tolerance.On the other hand, we see that the algorithm is able to dismiss all the linear functions.The algorithm in fact converges and even reaches machine accuracy -at the expense of more iterations.model is typically determined offline within a training phase.In our setting the sets Φ N and V M would thus be determined in an offline training.
In an online environment (typically suitable for multi-query or realtime situations), new parameters µ ∈ R P are given and an approximation is to be determined extremely fast -in online complexity.If we would determine the online approximation in terms of u δ in (1.1), we would need to determine (optimal) directions that are parameter-dependent, i.e., a(µ) and b(µ), which might be computationally costly in the sense that many iterations of the particle grid algorithm would be needed, destroying online efficiency.
In order to speedup this procedure, we suggest to train the mapping µ ↦ (a(µ), b(µ)) offline as follows: For a given (or to be determined) set {µ 1 , ..., µ n train } of training parameters, we determine (highly accurate) approximations a(µ i ), b(µ i ), i = 1, ..., n train , offline by the particle grid algorithm.Then, we determine a interpolation µ ↦ (ǎ(µ), b(µ)) of those offline data such that the interpolation error (a(µ), b(µ)) − (ǎ(µ), b(µ)) is small, at least for a number of test samples µ.Of course, this interpolation error influences the overall online approximation error.We are going to investigate this influence.
To this end, we consider the solution u(⋅, ⋅ ; µ) of the parametric linear transport problem u t + µ −1 2 u x = 0, u(0) = u 0 , since it is known that this problem also results in poor approximation rates using standard linear model reduction such as the Reduced Basis Method, [22].In addition, the function µ ↦ µ −1 2 is much harder for the interpolation as a polynomial parameter-dependence.We choose 10 equidistant training parameters µ i ∈ [0.1, 1].Since the offset is zero here, we only determined the corresponding directions a(µ i ), i = 1, ..., 10, which are real numbers here.The left graph in Figure 8 shows a cubic spline interpolation of the obtained data.We also display the exact curve µ ↦ a(µ) by determining high resolution approximations of optimal directions for 100 parameters µ by the particle grid algorithm.As we see, the error is almost negligible.
Next, we computed an online approximation for a new parameter µ as follows: (1) Evaluate the interpolation to retrieve ǎj (µ), set b j = 0, N = 0; (2) compute the coefficients c j and obtain an approximation ǔδ in (1.1).This is to be compared with the function u δ using the exact direction a j (µ).On the right in Figure 8, we display these errors ǔδ (⋅, ⋅; µ) − u δ (⋅, ⋅; µ) 0 for different initial conditions u 0 which also serve as the single profile.We used the same knots for the interpolation and see that the quantitative errors depend on the initial condition or more precisely on the derivative of the profile.However, please note the range of the vertical axis, which indicates that all errors are in fact in a comparable range.5.6.Conclusions.As we did much more experiments than we can report here (due to page limitation) let us collect some observations that we have seen and our corresponding conclusions.
• Even though we have seen that the optimization problem arising from the Linear/Ridge approximation is a challenging task, the particle grid algorithm often works very well.• For approximating a given function, the performance seems to be better for smooth functions.However, the algorithm yields also good results for non-continuous or multivariate functions, even though machine accuracy is harder to reach then.• By treating time "just as another variable", the presented approach can handle stationary and instationary problems in the same manner.• Choosing the number of particles sufficiently large, we were always able to reach machine accuracy.• The algorithm is able to detect if a linear approximation is already sufficient to reach a desired accuracy.Hence, the scheme is robust in the considered cases of fast and of slow decay of the Kolmogorov N -width.We anticipate that this is restricted to (P)PDEs with linear characteristics.

Outlook
The above described results of our numerical experiments seem to indicate that this path might be continued.Of course, we are aware that research in several directions is required, e.g.
• the introduced particle grid algorithm is based upon the particle swarm heuristics.There is no rigoros convergence analysis, which is a significant drawback in particular compared to linear RBMs, where online efficiency and a posteriori error control is certified.One might think of using (stochastic) gradient-descent methods in combination with backpropagation/automatic differentiation as known from the training of neural networks.Another option might be to start by the particle swarm algorithm and then use the result as starting point for a decent method.• in the current form, the particle grid algorithm is not yet online efficient, which would be required for using it within a multi-query and/or realtime environment.Also here, techniques from neural networks might help.• last, but not least, we did not focus on the training of Φ N and V M , but merely viewed them as being given.

Figure 1 .
Figure 1.Cases 1-3 (from left to right): function to be approximated (top row) and corresponding reduced cost functions (bottom row).

Figure 2
Figure 2 shows 6 stages of Algorithm 1 for one specific example with P = DM = 2 and n par = 10.We start with a regular grid on top left and indicate how the algorithm moves the particles by showing the states for iterations k = 1, k = 5, k = 25, k = 56 and k = 90.As we see, not all particles are concentrated in one point even for this case having a single global minimum.The reason is that sticking the opposite boundaries together prevents a concentration.However, most points are very close to the global minimum or on lines parallel to the coordinate axes through the point of global minimum.

Lemma 4 . 1 .
The next neighbor relation does not change in the course of the iteration, i.e., the index set I(i) is independent of the iteration k.

( a )Figure 3 .
Figure 3. Approximation of function given as sum of quadratic polynomials (N = 6) and M = 4 wavelets of different type.

Figure 4 .
Figure 4. Approximation of a given staircase-type step function.

Figure 5 .
Figure 5. Approximation of an infinite series: L2-error for different sizes of Φ N and V M .

Figure 8 .
Figure 8. Training directions for the parametric linear transport equation: Spline interpolation of directions (left) and errors for the obtained ridge approximation (right).