UvA-DARE (Digital Academic Repository) An optimal adaptive tensor product wavelet solver of a space-time FOSLS formulation of parabolic evolution problems

In this work, we construct a well-posed first-order system least squares (FOSLS) simultaneously space-time formulation of parabolic PDEs. Using an adaptive wavelet solver, this problem is solved with the best possible rate in linear complexity. Thanks to the use of a basis that consists of tensor products of wavelets in space and time, this rate is equal to that when solving the corresponding stationary problem. Our findings are illustrated by numerical results.


General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).
Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons.In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website.Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands.You will be contacted as soon as possible.

Introduction
After pioneering earlier work in [6,7], in recent years, one witnesses a renewed and growing interest in simultaneously space-time solvers for evolutionary PDEs (e.g., [3,15,19,27,30,40]).Instead of reducing the problem to a coupled system of ODEs, as with the method of lines, or to a sequence of stationary PDEs, as with time marching schemes like Rothe's method ( [10]), one aims to solve the problem as a whole.The motivations to do so are to obtain a reduction in the computational complexity or storage requirements by exploiting the product structure of the space-time cylinder for tensor product approximation (see [24]); to construct meshes that are optimally adapted to singularities that are local in space and time (improving upon the rather limited possibilities of local time stepping); and, finally, to use the additional time dimension to enhance the possibilities for a massive parallelization (e.g., [22]).
When aiming at a convergent or rather optimally convergent adaptive solution method, a major obstruction is that a space-time variational formulation does not lead to a bilinear form that is symmetric or coercive.In [21,26,36], considering parabolic PDEs, this problem was tackled as follows: By equipping the (Bochner) spaces w.r.t. which the bilinear form is well-posed with (wavelet) Riesz bases, the variational problem has an equivalent formulation as a well-posed bi-infinite matrixvector formulation.By now forming normal equations, one arrives at a well-posed symmetric positive bi-infinite matrix-vector problem whose solution can be approximated at the best possible nonlinear approximation rate using the adaptive wavelet scheme developed in [9,20].
For both the evaluation of the a posteriori error estimator, and for solving the arising Galerkin systems, a key ingredient of these schemes is, for the current finite approximate solution vector, the approximate evaluation of the residual of the infinite system of equations within a tolerance that is equal to a fixed multiple of the current error, and at a cost such that the overall scheme has optimal computational complexity.In the original scheme, this was performed by approximating separately both the right-hand side vector and the matrix-vector multiplication each within half this tolerance.For the second task, an ingenious scheme was constructed, the socalled apply-routine, that approximates each column of the matrix with an accuracy dependent on the modulus of the corresponding entry in the input vector.To have a proper decay of the matrix entries away from the diagonal, it requires that the wavelets have sufficiently many vanishing moments dependent on their order.Even for linear operators, the apply routine is non-linear, and unfortunately, it turned out to be quantitatively quite demanding.
In [34], we developed an alternative for the apply routine.It is based on the observation that if the wavelets are sufficiently smooth, then each entry of the residual vector equals the residual of the equation in mild form integrated against a wavelet.Now it can be shown that if the current approximation is from the span of the wavelets up to some level, then a sufficiently accurate residual vector approximation is obtained by ignoring all entries of this vector that correspond to wavelets whose levels exceed that level by a fixed constant.This observation also applies to non-uniform, adaptive settings as long as the approximate solutions are sought in spans of wavelets whose indices form trees.For this scheme, the number of vanishing moments of the wavelets does not have to grow with their order, and often one vanishing moment suffices.To compute the remaining entries of the residual vector in linear complexity, i.e., those that cannot be ignored, the current approximate solution is first written in a locally finite single-scale basis, then its residual is integrated against single-scale basis functions, after which the transposed multi-to-single scale transformation is applied.Without difficulty, the scheme applies to semi-linear PDEs as well.
The key idea behind this alternative residual evaluation scheme is not to split the residual functional into two terms and to approximate them separately, but to represent both terms in one common dictionary before integrating their difference against a wavelet basis.In this way, one benefits from the fact that the norm of this functional decreases proportionally to the error in the approximation, meaning that one is allowed to make a fixed relative error in the second step.The approach requires that the application of the PDO to any wavelet lands in L 2 , for operators of second-order meaning that the (piecewise polynomial) wavelets have to be globally C 1 .To avoid this unpleasant condition, in [34] we showed that any well-posed semi-linear secondorder PDE can be written as a well-posed first-order system least squares (FOSLS) problem to which the whole machinery can be applied.In this case, it suffices to have bases of globally continuous (piecewise polynomial) wavelets which are available on arbitrary polytopes.
In the current paper, we apply the approach introduced in [34] to parabolic evolutionary PDEs written in a well-posed simultaneously space-time variational formulation.We transform this problem to a well-posed FOSLS formulation, and equip the arising Bochner spaces with bases that are tensor products of temporal and spatial wavelet bases.The advantage of tensor product approximation is that it allows to solve the full time evolution at an order of computational complexity that is equal as when solving the corresponding stationary problem.
This advantage, however, does not come for free.As a counterpart for the tree approximations with non-tensor product wavelet approximation for stationary problems studied in [34], here we will allow only approximate solutions from spans of tensor-product wavelets whose index pairs form multi-trees (more particular, double trees).With this restriction, we will be able to show that a sufficiently accurate residual vector approximation is obtained as follows: Drop each entry of this residual vector that corresponds to a tensor-wavelet for which one or both its levels exceed by more than a fixed constant the corresponding level of any tensor-wavelet in the multi-tree with which it has an overlapping support.
To compute the remaining entries of the residual vector in linear complexity, we will rely on a generalisation from [25] of an algorithm originally developed in [8] in the (non-adaptive) sparse-grid setting.Note that now we cannot rely on a residual evaluation by writing the current approximation in a locally finite single-scale representation, since the complexity of such a representation generally will not be of the order of complexity of the tensor-wavelet representation.The unavailability of such a locally finite single-scale representation has the consequence that we are not able to handle non-linear PDEs.
In a non-FOSLS setting, we studied adaptive tensor product wavelet methods for solving simultaneously space-time variational formulations of parabolic PDEs already in [13,36].In those works, we considered unconstrained approximation, i.e., no multi-tree constraints, and relied on the apply-routine for the approximate residual evaluation.In [36] quantitative aspects were not considered, and in order to get a reasonably fast implementation in [13] we considered custom designed wavelets that for a PDO with constant coefficients on rectangular spatial domains yield a stiffness matrix that is truly sparse, and therefore can be applied exactly in optimal complexity.This paper is organized as follows: In Section 2, we give a well-posed FOSLS formulation of a parabolic PDE.One of the residuals in this system will be measured in a dual norm.To circumvent its evaluation, we replace this norm by an equivalent sequence norm of this residual integrated against a wavelet basis.
In Section 3, we recall some relevant facts about the adaptive wavelet Galerkin method (awgm) for solving general well-posed operator equations.
In Section 4, we apply the awgm to the FOSLS formulation of the parabolic PDE.We construct tensor product bases for the arising Bochner spaces, specify the multi-tree constraint on the index sets of the bases, and investigate the best possible approximation rate that can be achieved.Most of our efforts will be devoted to the construction of an approximate residual evaluation such that the overall scheme converges with the best possible rate in linear complexity.A number of technical decay estimates will be postponed to the Appendix.
In Section 5, we test our awgm on the heat equation on a two-dimensional L-shaped spatial domain with homogenous Dirichlet boundary conditions.In two examples where the initial condition satisfies the usual condition to vanish at the lateral boundary, we observe the best possible converge rate.In an example where the initial condition violates this lowest order compatibility condition, and thus the exact solution is discontinuous along a two-dimensional manifold, a reduced rate is observed.We envisage that the "full" rate will be restored by replacing the isotropic spatial wavelets by (piecewise) tensor product wavelets.
A short conclusion and a brief discussion of issues that are open to further investigations are presented in Section 6.
In this work, by C D we will mean that C can be bounded by a multiple of D, independently of parameters which C and D may depend on.Obviously, C D is defined as D C, and C D as C D and C D.
For normed linear spaces A and B, L(A , B) will denote the space of bounded linear mappings A → B endowed with the operator norm • L(A ,B) .The subset of invertible operators in L(A , B) with inverses in L(B, A ) will be denoted as Lis(A , B).

Well-posed FOSLS formulation of a parabolic PDE
For a bounded domain ⊂ R n , I := (0, T ) for some T > 0, and Multiplying this equation by smooth test functions v of time and space that vanish at [0, T ] × ∂ , integrating both sides over time and space, and, applying integration by parts in space, we arrive at the variational problem of finding such that for all it holds that We assume that these conditions are known to be satisfied even with DG(u) ∈ Lis(U , V ), see e.g.
[16, Ch.XVIII], [36].In this linear case, G is affine and so DG(u and the solution u is unique.
For the verification of the conditions in semi-linear cases, we refer to e.g.[42].
Using the general framework outlined in [34, Sect.2], we write our second-order PDE as a first-order system least squares (FOSLS) problem: With Consequently, for any solution u of G(u) = 0, it holds that (u, p) := (u, A∇ x u) ∈ U × P is a zero of the least squares functional where and so a solution of DQ(u, p) = 0.This latter operator equation is well-posed in the sense that in a neighborhood of (u, p).
Remark 2.2 (linear case cont'd) If N is linear, and thus G is affine, then DQ is affine, and therefore D 2 Q(u, p)(v, q) = DQ(v, q) − DQ(0, 0); in particular, D 2 Q is constant.The equation DQ(u, p) = 0 is uniquely solvable, and therefore its solution (u, p) is the zero of Q and u is the solution of G(u) = 0.The relation (2.4) holds globally.
Remark 2.3 In [28,29] a least squares functional similar to (2.3) has been studied with, essentially, being replaced by • 2 L 2 (I;L 2 ( ) n ) .As a consequence, a norm equivalence as in (2.4) could not be demonstrated, but on the other hand the evaluation of the dual norm is avoided.
To deal with the dual norm in the definition of Q, we equip V 1 with a Riesz basis ), and so its adjoint, known as the synthesis operator, In the definition of the least squares functional Q, we now replace the standard dual norm on V 1 by the equivalent norm , which yields that + DH 2 (u, p)(w, q), H 2 (u, p) L 2 ( ) + DH 3 (u, p)(w, q), H 3 (u, p) P .
To solve this operator equation DQ(u, p) = 0 we are going to apply the adaptive wavelet Galerkin method.

The adaptive wavelet Galerkin method (awgm)
In this section, we summarize findings about the awgm from [14,34,39].Let (I) F : H ⊃ (F ) → H , with H being a separable Hilbert space; (II) F (z) = 0; (III) F be continuously differentiable in a neighborhood of z; In our applications, the triple (F, H , z) will read as (DQ, U × P, (u, p)), so that we already know that (I)-(IV) are valid.Let = {ψ λ : λ ∈ ∨} be a Riesz basis for H , with analysis operator F : g → g( ) ∈ Lis(H , 2 (∨)), and so synthesis operator F : v → v ∈ Lis( 2 (∨), H ). For any ⊂ ∨, we set For satisfying the forthcoming Condition 3.3 that concerns the computational cost, it will be relevant that is a basis of wavelet type.
Writing z = F z, and with an equivalent formulation of F (z) = 0 is given by We are going to approximate z, and so z, by a sequence of Galerkin approximations from the spans of increasingly larger sets of wavelets, which sets are created by an adaptive process.Given ⊂ ∨, the Galerkin approximation z , or equivalently, z := z , are the solutions of F(z ), v 2 (∨) = 0 (v ∈ 2 ( )), i.e., F(z )| = 0, and F (z )(v ) = 0 (v ∈ span{ψ λ : λ ∈ }), respectively.
In order to be able to construct efficient algorithms, it will be needed to consider only sets from a certain subset of all finite subsets of ∨.This collection of socalled admissible will be specified later.For the moment, it suffices to know that the union of any two admissible sets is again admissible.
To provide a benchmark to evaluate our adaptive algorithm, for s > 0, we define the nonlinear approximation class A vector z is in A s if and only if there exists a sequence of admissible ( i ) i , with The adaptive wavelet Galerkin method (awgm) defined below produces a sequence of increasingly more accurate Galerkin approximations z to z. The, generally, infinite residual F(z ) is used as an a posteriori error estimator.This a posteriori error estimator guides an appropriate enlargement of the current set using a bulk chasing strategy, so that the sequence of approximations converge with the best possible rate to z.To arrive at an implementable method, that is even of optimal computational complexity, both the Galerkin solution and its residual are allowed to be computed inexactly within sufficiently small relative tolerances.
In step (R), by means of a loop in which an absolute tolerance is decreased, the true residual F(z i ) is approximated within a relative tolerance δ.In step (B), bulk chasing is performed on the approximate residual.The idea is to find a smallest admissible i+1 ⊃ i with ri | i+1 ≥ μ 0 ri .For reasons of computational efficiency, the condition of having a truly smallest i+1 has been relaxed.Finally, in step (G), a sufficiently accurate approximation of the Galerkin solution w.r.t. the new set i+1 is determined.
Convergence of the adaptive wavelet Galerkin method, with the best possible rate, is stated in the following theorem.
and the neighborhood Z of the solution z all be sufficiently small.Then, for some The computation of the approximate Galerkin solution z i+1 can be implemented by performing the simple fixed point iteration Taking ω > 0 to be a sufficiently small constant and starting with z (0) i+1 = z i , a fixed number of iterations suffices to meet the condition F(z This holds also true when each of the F(•)| i+1 evaluations is performed within an absolute tolerance that is a sufficiently small fixed multiple of ri .
Optimal computational complexity of the awgm -meaning that the work to obtain an approximation within a given tolerance ε > 0 can be bounded on some constant multiple of the bound on its support length from Thm. 3.1,-is guaranteed under the following two conditions concerning the cost of the "bulk chasing" process, and that of the approximate residual evaluation, respectively.
Condition 3.3 (Cost condition) For a sufficiently small, fixed ς > 0, there exists a neighborhood Z of the solution z of F(z) = 0, such that for all admissible ⊂ ∨, z ∈ 2 ( ) ∩ Z, and ε > 0, there exists an r ∈ 2 (∨) with Under both conditions, the awgm has optimal computational complexity:

Theorem 3.4 In the setting of Theorem 3.1, and under Conditions 3.2 and 3.3, not only #z i , but also the number of arithmetic operations required by awgm for the computation of z
In the setting of F = DQ, and Q being the FOSLS-functional associated to our parabolic time evolution problem, we will be able to verify Condition 3.3 only when DQ is affine, i.e., when our PDO is linear, i.e., N(u) = Nu as given in (2.2).In this case where and the neighborhood Z being sufficiently small can be dropped.
Secondly, for any ⊂ ∨ and ε > 0, our approximate residual r as meant in Condition 3.3 will be of the form . Consequently, by taking ζ sufficiently small, this A ,ε can be used for solving the arising Galerkin problems by a Krylov iteration assuming that the initial residual is computed sufficiently accurate (see [20,Thm. 2.5] for details).This method is much more efficient than the simple fixed point iteration applicable in the general nonlinear case.

Expression for the residual
As announced before, we apply the awgm to solving DQ(u, p) = 0 where, in order to be able to satisfy the cost condition Condition 3.3, we take N(u) = Nu as given in (2.2).Besides the Riesz basis V 1 for V 1 introduced earlier, let U := {ψ U λ : λ ∈ ∨ U } and P := {ψ P λ : λ ∈ ∨ P } be Riesz bases for U and P, respectively.Then with ∨ U × P := ∨ U ∪ ∨ P .Since with P := L 2 (I ; L 2 ( )), it holds that P = P n , we select P of the form {ψ P λ e i : (λ, i) ∈ ∨ P := ∨ P × {1, . . ., n}} with P := {ψ P λ : λ ∈ ∨ P } being a Riesz basis for P. We write For the application of the awgm, for each [w q ] supported on an admissible (and thus finite) subset of ∨ U × P we have to construct a computationally efficient approximation for the residual DQ([w q ] ), where DQ := FDQF .For that goal, we impose the condition that P ⊂ L 2 (I; H 1 ( )), so that Then with (w, q) := (w U , q P ), we obtain that where we applied (4.1), and the zero boundary conditions satisfied by . Note that the residual consists of three terms, each of them being essentially one of the three terms of the least squares functional (2.3) in strong form integrated against a wavelet basis.
In view of (2.4), for the current application of awgm the cost condition can be reformulated as follows.

Tensor product bases
In view of the definitions of U , V 1 , and P as being Bochner spaces, their bases will consist of tensor products of functions of collections of temporal and spatial functions, apart from a normalisation in case of We assume that U ⊂ H 1 (I), and that are Riesz bases for L 2 (I), H 1 (I), respectively.Here with U / U L 2 (I) , and similarly for other normalisations or collections, we mean the collection and are Riesz bases for H −1 ( ), H 1 0 ( ), (4.4) respectively.An interpolation argument shows that, consequently, Under the above assumptions, we have that are Riesz bases for V 1 , P, and U with index sets ∨ * = * × ♦ * for * being V 1 , P, or U , respectively.For the last statement we refer to [23].

Piecewise polynomial spatial and temporal wavelets
For * ∈ {U , P, V 1 }, we collect a number of (standard) assumptions on the spatial wavelet collections * = {σ * λ : λ ∈ ♦ * } on .To each λ ∈ ♦ * , we associate a value |λ| ∈ N 0 , which is called the level of λ.We will assume that the elements of * are locally supported, piecewise polynomial of some degree m, w.r.t.dyadically nested partitions in the following sense: * has the cancellation property of order 1 meaning that Generally, the polynomial degree m will be different for the different bases, but otherwise fixed.The collection O is shared among all bases.
In addition to (s 1 )-(s 4 ), we assume that U has the cancellation properties of order 2: Wavelets of in principle arbitrary order that satisfy all these assumptions can be found in e.g.[17,32].Remark 4.1 In both (s 4 ) and (s U 4 ), supp σ * μ could be read as a neighborhood of this support of diameter 2 −|μ| , which requires an only minor adaptation of some proofs.Definition 4.2 (tiling) A collection T ⊂ O such that = ∪ ω∈T ω, and for ω 1 = ω 2 ∈ T , meas(ω 1 ∩ ω 2 ) = 0 will be called a tiling.With P m (T ), we denote the space of piecewise polynomials of degree m w.r.t.T .The smallest common refinement of tilings T 1 and T 2 is denoted as To be able to find, in linear complexity, a representation of a function, given as linear combination of wavelets, as a piecewise polynomial w.r.t. a tiling we will impose a tree constraint on the underlying set of wavelet indices: To each λ ∈ ♦ * , we associate some neighbourhood S(σ * λ ) of supp σ * λ , with diameter 2 −|λ| , such that S(σ * λ ) ⊂ S(σ * μ ) when λ is a child of μ.We call a finite ⊂ ♦ * a tree, if it contains all λ ∈ ♦ * with |λ| = 0, as well as the parent of any λ ∈ with |λ| > 0.
Remark 4.4 Note that we have parent-child relations on the set O of polytopes as well as on the index sets ♦ * (and similarly later on the index sets * ).We trust that no confusion will arise.
A proof of the following proposition, as well as an algorithm to apply the multito-single-scale transformation that is mentioned, is given in [39, §4.3].Proposition 4.6 (tree-to-tiling) Given a tree ⊂ ♦ * , there exists a tiling T ( ) . Moreover, equipping P m (T ( )) with a basis of functions, each of which supported in ω for one ω ∈ T ( ), the representation of this embedding, known as the multi-to single-scale transform, can be applied in O(# ) operations.
Conversely, given a tiling, we define an element-tree and, given an integer k, a wavelet-tree: Definition 4.7 (tiling-to-tree) Given a tiling T ⊂ O , let t (T ) ⊂ O be its enlargement by adding all ancestors of all ω ∈ T .Given a k ∈ N 0 , we set the k-neighborhood of T in ♦ * by Proposition 4. 8 The set ♦ * (T , k) is a tree, and #♦ * (T , k) #T (dependent on k ∈ N 0 ).Remark 4.9 The idea behind the definitions related to tilings is the following: With the application of the awgm to FOSLS formulations of PDEs where the wavelet bases are of non-tensor product form as studied in [34], the residual consists of terms of the form * , g L 2 ( ) , where, for some tiling T , g ∈ P m (T ) because it is from the span of a set of wavelets with indices from a tree.Now estimates of the form lim k→∞ sup 0 =g∈P m (T ) * ,g L 2 ( ) | ♦ * \♦ * (T ,k) g * = 0 were shown, meaning that in order to approximate * , g L 2 ( ) within some given relative error it is sufficient to compute this vector on a k-neighborhood of T in ♦ * , where k is a suitable constant.Furthermore, with the aid of multi-to locally single-scale transformations, Such results, together with analogous ones for temporal wavelets, will be the basis for the residual approximation in the current setting of the application of tensor product wavelets, where will restrict to approximations from spans of sets of wavelets with indices that from multi-trees.
Finally, in this subsection, we add one more assumption on our PDE: We assume that its coefficients

Alpert wavelets
Recall the least squares functional Q from (2.3).It consists of three "residuals" ∂w ∂t + Nw − ∇ x • q − g, w(0, •) − h, and q − A∇ x w (not to be confused with the residual DQ([w q ] )), whose norms are minimized.A main ingredient of our approximate evaluation of DQ([w q ] ) will consist of representing all terms in each of the three 'residuals' in a common dictionary.If w and q were from spans of sets of non-tensor product wavelets whose index sets form trees, then such a dictionary can consist of the piecewise polynomials of some degree w.r.t. a tiling whose cardinality is of the order of the sum of the cardinalities of both trees.This is the setting considered in [34] for the solution of stationary PDEs.In the current setting of tensor product approximation, such a "single-scale" representation of optimal cardinality does not exist unless we put conditions on the wavelet index sets that are so restrictive that the advantages of tensor product approximation concerning favourable approximation rates are lost.Instead, focussing to the first and third "residual," we employ a representation in terms of tensor products of temporal and spatial Alpert wavelets.Unfortunately, this procedure does not apply to nonlinear terms being the reason for our restriction to N(u) = Nu from (2.2).We set a := a ⊗ a .

Multi-tree approximation
We need a definition of admissible subsets of the index set of our basis for U × P that on the one hand is sufficiently restrictive to allow for the evaluation of the approximate residuals in linear complexity, and on the other hand yields the favourable approximation rates known from unconstrained tensor product approximation.For that goal, we consider multi-trees as a substitute for the concept of a tree in the non-tensor product case.with the property that if (i, j ) ∈ S then {(max(i − 1, 0), j ), (i, max(j − 1, 0))} ∈ S. Examples of such multi-trees are index sets corresponding to 'full' or 'sparse-grids', see [5].
Concerning the efficient computation of residuals we recall the following result from [25] that builds on earlier work from [8] dealing with sparse-grids: Let a be a bilinear form such that for u , where the a i are local, i.e., a i (u i , v i ) = 0 when | supp u i ∩ supp v i | = 0, and such that for ω ∈ O or ω ∈ O I and p, q ∈ P m (ω) the evaluation of a i (p, q) can be performed in O(1) operations.Then for * , • ∈ {U , V 1 , P, a}, multi-trees * ⊂ ∨ * , • ⊂ ∨ • , and w ∈ 2 ( * ), the matrix-vector product Although the definition in [25] of a tree and therefore that of a multi-tree are slightly different from the current definitions, the results from [25] carry over to the current setting without much difficulty.For details we refer to [33].

Best possible rate
Let the bases U , U , P , and P be of orders Note that when limited by the orders of the spatial wavelets, the value s max for the approximation rate of the time-dependent problem is equal to the approximate rate for the corresponding stationary problem, being the major advantage of tensor product approximation.
The value of s max has been derived using "sparse-grid type" multi-trees assuming sufficient smoothness of the solution.Using multi-trees that are adapted to local singularities of the solution, it can be expected that this rate can be attained to a much larger class of functions.Characterizations of approximation classes corresponding to unconstrained tensor-product approximation (no admissibility condition) as anisotropic Besov spaces can be found in [31,41].Based on results on non-tensor product tree approximation ( [11]), we anticipate that the multi-tree constraint on the index sets makes the approximation classes only "slightly" smaller.See also [18] for results on multi-tree tensor-product approximation of the solution of elliptic PDEs.
Although Besov regularity theory for the heat equation can be found in [1], to the best of our knowledge corresponding anisotropic Besov regularity results suitable for tensor product approximation are yet not available.
The first statement of the following lemma shows that with (w, q) := (w U , q P ) it holds that ∂w ∂t + Nw −∇ x • q ∈ span a ∨ a ( ,0) .The second and third statements of this lemma will imply a similar statement for the third 'residual', being q − A∇ x w.Lemma 4. 16 Let ⊂ ∨ U × P be admissible.Then ( ,0) , Proof The first statement follows by the L 2 -orthogonality of the Alpert wavelets and the fact that, using assumption (4.6), the elements of the collections ( ∂ ∂t +N) U U and span ∇ x • P P are piecewise polynomials.The proofs of the second and third statements are similar.
The term U (0, •), w(0, •) − h L 2 ( ) in DQ([w q ] ), resulting from the second "residual" w(0, •)−h, reads as Its transpose E represents the trace mapping at t = 0 w.r.t. to the bases for U and L 2 ( ), respectively.Since w is finitely supported, E w can be computed exactly in optimal complexity.The function w(0, is piecewise polynomial w.r.t.some tiling T .Given an ε > 0, h will be approximated within tolerance ε by a piecewise polynomial h ε w.r.t.some tiling T (ε), so that w(0, •) − h ε is piecewise polynomial w.r.t. the tiling T ⊕ T (ε).
, w(0, •) − h L 2 ( ) ∈ 2 (♦ U ) will be approximated by restricting it to a k-neighborhood of T ⊕ T (ε) (cf.Def.4.7).The remaining issue how to approximate the application of E is dealt with in the following lemma.
Lemma 4.17 For k ∈ N 0 define E k by Proof By the inequalities which will be demonstrated below, by the trace inequality, and, finally, the definition of U , we obtain From this result, one easily infers the statements of the lemma.
From (s 4 ), we have which shows first inequality in (4.9).The second one is a consequence of ) .The first inequality in (4.8) is the inverse inequality for polynomials.Thanks to (t 4 ), the second one in a consequence of θ . Note that here (t 4 ) has been used also for v ∈ H 1 0 (I), cf.Remark 4.10.
Remark 4. 19 Collections ∨ a (ε) and T (ε) as in the statement of the theorem exist, and so the condition about them concerns their actual construction.Indeed, when [u p ] ∈ A s , then given an ε > 0 there exists an admissible Using that g = ∂ ∂t u + Nu − ∇ x • p, we infer that, by selecting a proper C, with and # ∨ a ( ε , 0) # ε ε −1/s , we conclude that a suitable ∨ a (ε) exists.
Similarly, since h = u(0, •), with ( ε ) ↓ defined similarly as in (S4), by taking The collection ( ε ) ↓ is a tree in ♦ U , and so h ε ∈ P m (T (ε)) with, thanks to Proposition 4.6, #T (ε) Proof of Theorem 4.18 The expression for DQ([w q ] ) given in (4.2), and the definitions of r i and ri , for i ∈ { 1  2 , 1, 3 2 , 2, 3}, show that The boundedness of E (cf.Lemma 4.17) and, by the Riesz basis properties of U , V 1 and P , that of Below, we will show that all five terms at the right-hand side are bounded by a multiple of the right-hand side of (4.10).
For g ε ∈ span a ∨ a (ε) , we write From the first statement of Lemma 4.16, we know that ∂ ∂t w ,ε) .An application of the forthcoming Theorem A.3 shows that consequently the norm of the first term is 2 . From (S1), we infer that Applications of the forthcoming Corollaries A.7 and A.9 show that r 1 − r1 For h ε ∈ P m (T (ε)), we write Note that w(0, •) ∈ span U ↓ , and that ↓ is a tree in ♦ U .Using that U satisfies (s 1 )-(s 4 ), analogously to [34, Prop.A4 (first statement)] one shows that that the norm of the first term is 2 −k/2 w(0, •)−h ε L 2 ( ) .From U / U L 2 ( ) being a Riesz basis it follows that the norm of the second term is An application of Lemma 4.17 shows that r 2 − r2 From the second and third statements of Lemma 4.16, we know that q − A∇ x w, A (A∇ x w − q) ∈ n i=1 span a | ∨ a ( ,0) e i .Now an application of the forthcoming Corollary A.11 shows that r 3 − r3 2 −k/2 q − A∇ x w L 2 (I× ) n , which completes the proof of (4.10).
The computation of r 1 2 , r1 , r 3 2 , r2 or r3 requires a number of operations that is of the order respectively.Each of these numbers can be bounded by a multiple (dependent on k) of # + ε −1/s which proves the statement about the total complexity of computing r.
The statement about the cost of computing r 1 2 follows by expressing ∂ ∂t w + Nw − ∇ x • q − g ε in terms of a | ∨ a ( ,ε) , and then by applying the statement about the cost of the evaluation of (4.7).A similar argument applies for the computation of r1 when ( , ε), k), 0), as well as for the computation of r3 .The evaluation of r 3 2 requires first the computation of E w, which takes O(# ) operations, then h ε needs to be subtracted taking O(# ↓ + #T (ε)) operations, and finally Remark 4.9).The statement about the cost of the evaluation of r2 follows directly from the definition of E k .
Finally, recall that in addition to the cost condition Condition 3.3* that has been verified in Theorem 4.18, another condition, Condition 3.2, is required to conclude by Theorem 3.4 that the awgm is optimal.This condition requires the determination of an admissible ˜ ⊃ with essentially quasi-minimal #( ˜ \ ) such that r| ˜ ≥ μ 0 r .Unfortunately in our setting of multi-tree approximation, we are not aware of an algorithm that is guaranteed to yield such a ˜ .In our numerical experiments, as a first step, we constructed some set ˆ ⊃ with quasi-minimal #( ˆ \ ) such that r| ˆ ≥ μ 0 r by applying a bucket sort procedure on the entries of r.Secondly, we enlarged ˆ to an admissible set.In experiments, we observed that #( ˜ \ ˆ ) is at most a small multiple of #( ˆ \ ) which means that Condition 3.2 is satisfied, but we do not have a proof of this.

Numerical results
We consider the heat equation, i.e.A = Id and N = 0, on the L-shaped domain = (0, 1)\[ 1 2 , 1) 2 and I = (0, 1).For our convenience, we take g = 1, and consider three different initial conditions h = 0, h = 1, and h(x, y) = 50x(x −1)(x − 1 2 )y(y − 1)(y − 1 2 ).The resulting solutions will exhibit a spatial singularity caused by the re-entrant corner, as well as, for the last two initial conditions, temporal-spatial singularities caused by the incompatibility of the initial and boundary conditions (most severely for h = 1) at the intersection of the bottom and lateral boundary.We select the spatial wavelet collections * for * ∈ {U , V 1 , P} s.t.σ * λ for |λ| = is piecewise polynomial w.r.t. the triangulation of indicated in Fig. 1, which specifies the collection O .We take V 1 to be the continuous piecewise linear three-point wavelet basis from [38], satisfying homogenous boundary conditions, and normalized such that it is a Riesz basis for H 1 0 ( ), and for P we select this three-point wavelet basis, now without boundary conditions, and normalized such that it is a Riesz basis for L 2 ( ).For U we take a continuous piecewise quadratic wavelet basis that will be described in [35], and that when normalized in H 1 ( ) or in H −1 ( ) is a Riesz basis for H 1 0 ( ) or H −1 ( ), respectively.For V 1 and P we take a discontinuous L 2 (I)-orthonormal piecewise linear wavelet basis, and for U the continuous piecewise linear three-point wavelet basis (without boundary conditions).
These collections satisfy all conditions (s 1 )-(s 4 ), (s U 4 ), and (t 1 )-(t 4 ), and using them we build the tensor product wavelet bases V 1 , P and U as in (4.5).
It holds that d U t = d P t = d P x = 2 and d U x = 3, meaning that the best possible rate s max = 1.
The auxiliary temporal and spatial Alpert wavelet bases a and a were taken of degrees m = 1 and m = 2, respectively, which are the smallest values such that the inclusions from Lemma 4.16 hold.
Recall that DQ is affine, so that its (symmetric positive definite) linear part is given by D 2 Q : [w q ] → DQ[w q ] − DQ[0 0] .We have approximated the condition number of this system matrix by restricting it to the square block corresponding to all wavelets with indices λ ∈ ∨ U , (μ, i) ∈ ∨ P ×{1, 2} with |λ|, |μ| less than some integer.Even these Galerkin matrices cannot be evaluated exactly because they are of the form A 1 A 1 +A 2 +A 3 where the rows of A 1 run over the infinite index set ∨ V 1 .Similarly to Step (S2) in Algorithm 2 we made an approximation by omitting all rows corresponding to indices γ ∈ ∨ V 1 for which |γ | exceeds any value of |λ| or |μ| by more than a constant k that was taken sufficiently large so that it hardly affected the computed condition numbers.These numbers, illustrated in Fig. 2 indicate that the condition number of the infinite system matrix can be expected to be of the order of 700.In [34] for an analogous FOSLS formulation of the corresponding stationary operator, i.e., the Laplace operator, and with the same spatial wavelets (and thus without temporal wavelets), we found a condition number of the order of 550.
We applied the awgm given in Algorithm 1.In step (R) of this algorithm, instead of performing a loop we simply computed the approximated residual by one application of Algorithm 2. In step (B) we applied bulk chasing as explained in the last paragraph of the previous Section 4 with parameter μ = 0.5.The Galerkin matrices in step (G) were approximated using Algorithm 2, and approximately solved with parameter γ = 0.2 using CG iteration as explained in the last paragraph of Section 3.
In the left pictures in Figs. 3, 4 and 5, for right-hand side g = 1, and initial conditions h = 0, h = 1, and h(x, y) = 50x(x − 1)(x − 1 2 )y(y − 1)(y − 1 2 ), respectively, the 2 -norm of the (approximate) residual is given vs. the number of wavelets from the basis for ).This norm is equivalent to the U × P-norm in the error of the approximation to (u, p) = (u, ∇ x u).In the right pictures one finds the centers of the supports of the tensor product wavelets that were selected by the adaptive method.
For h = 0 and h(x, y) = 50x(x − 1)(x − 1 2 )y(y − 1)(y − 1 2 ), one observes that the awgm converges with the best possible rate s = 1.Moreover, thanks to the tensor product approximation not only the rate but also in an absolute sense the results are rather close to the results we found in [33,Fig. 4] for the corresponding stationary problem with errors in (u, ∇ x u) measured in H 1 0 ( ) × L 2 ( ) 2 .This means that one obtains the additional time dimension nearly for free.For h = 1, the observed rate indicates that the exact solution is only in A 1 2 .This reduction in the best approximation rate can be understood as follows: With this initial condition, the solution u is discontinuous at the full intersection of the lateral boundary and the bottom of the space-time space cylinder, inducing strong refinements near this intersection, as illustrated in the right picture of Fig. 4.Although the solution is smooth in the direction tangential to this intersection, since the spatial wavelets are isotropic the method cannot benefit from this smoothness causing the reduced best approximation rate.
We expect that if we had applied a spatial piecewise tensor product wavelet basis as constructed in [12], this reduction would not have occurred, and also for this problem we would have obtained rates as if we would solve a one-dimensional problem.We have however chosen for the current isotropic spatial wavelets because their (relatively) easy construction, and because they apply to any polygon.Moreover, parabolic problems are usually studied assuming that the data satisfies the lowest order compatibility condition of a vanishing initial condition at the homogenous Dirichlet boundary.Note that the initial condition h(x, y) = 50x(x − 1)(x − 1 2 )y(y − 1)(y − 1 2 ) does vanish at ∂ , but does not satisfy the next compatibility condition of − h = g(= 1) at ∂ .Nevertheless, it seems to give rise to the best possible rate allowed by the polynomial orders that were applied.

Conclusion
In this paper, an optimal adaptive wavelet solver has been developed for solving a simultaneously space-time FOSLS formulation of parabolic PDEs.Thanks to the use of tensor products of wavelets in space and time the whole time evolution can be solved at a complexity of solving the corresponding stationary problem, which has been illustrated by numerical results.
A theoretical issue that has not yet been satisfactorily solved is that of bulk chasing under a multi-tree constraint (cf.last paragraph of Section 4).It may require a generalisation to multi-trees of the tree approximation routines given in [4].
Other than in our preceding paper [34] dealing with stationary PDEs and nontensor product approximation, in order to construct an approximate residual evaluation of linear complexity we had to restrict ourselves to linear PDOs.It would be interesting to circumvent this restriction.
In [37], we constructed well-posed simultaneously space-time saddle-point formulations of instationary Stokes and Navier-Stokes equations.Using the approach from [34], these formulations can be recast as well-posed FOSLSs so that, modulo the treatment of the nonlinear term in NSE, the adaptive scheme from the current work applies.
Adaptive finite element (afem) schemes usually have better quantitative properties than adaptive wavelet schemes.To the best of our knowledge, for simultaneously space-time variational formulations, currently there are no afem schemes available that are proven to converge, let alone to be optimal or to give rates as for stationary problems.Our FOSLS formulation might give an opening towards such results because it gives a well-posed symmetric positive definite problem.

Appendix: Decay estimates
In this Appendix we prove the technical results Theorem A.3, Corollaries A.7, A.9, and A.11 that were used in the proof of Theorem 4.18.
The following lemma is an application of Schur's lemma that is often used to bound the spectral norm of a matrix whose row and column indices run over index sets of multi-level bases.
The number of non-zero entries in each column or row of I , is 2 ξ( − ) or 1, respectively.Using ) .The next lemma concerns near-sparsity of a generalized mass matrix corresponding to two temporal wavelet bases.
Proof Using that * satisfies (t 1 )-(t 4 ), being the counterparts of (s 1 )-(s 4 ) for the spatial wavelets, we split the matrix into B r + B s , where B r contains all its entries θ * λ , θ • λ L 2 (I) for which supp θ * λ is contained in ω for some ω ∈ O I with |ω| = |λ| (the 'regular' entries), and where B s contains the remaining ('singular') entries.
The number of non-zero entries with |λ | = and |λ| = in each column or row of B r is 2 − or 1, respectively.Thanks to (t 4 ), for each of these entries we have The following theorem provides the main ingredient for bounding r 1 2 − r 1 2 .
Theorem A.3 Let a ⊂ ∨ a be a multi-tree, and r ∈ span a a For k ∈ N 0 , it holds that Proof We write r = (λ,μ)∈ a r λμ θ a λ ⊗ σ a μ , and let Here, we could insert the factor δ λ (μ ) in the second sum because of the following reason: where we used that V 1 is a Riesz basis for H 1 0 ( ), and that a is a Riesz basis for L 2 (I).
If a was a Riesz basis for H −1 ( ), then in the proof of Theorem A.3 it would have been natural to write σ . In this case the approach of the insertion of the factor δ λ (μ ) would have given the bound Although in the current setting where ), this estimate makes not much sense, for other collections this result, formulated in the next proposition, is going to be useful.
Proof For proving the first inequality, we split the matrix into B r + B s , where B r contains all its entries for which supp σ U μ is contained in ω for some ω ∈ O with |ω| = |μ| (the 'regular' entries), and where B s contains the remaining ('singular') entries.Thanks to (s U 4 ), for the regular entries we can estimate where we used σ ) .An application of Lemma A.1 with ξ = n and = 1 shows that |B r | 2 −k .Since the wavelets σ V 1 μ are piecewise polynomial functions in H 1 ( ), they are contained in W 1 ∞ ( ).Using (s 4 ), for the remaining singular entries we estimate ).An application of Lemma A.1 with ξ = n − 1 and = 1/2 shows that |B s | 2 −k/2 .Moving to the second inequality, we split the matrix into B r + B s , where B r contains all its entries for which supp σ U μ is contained in ω∩ for some ω ∈ O with |ω| = |μ| (the 'regular' entries), and where B s contains the remaining ('singular') entries.
Thanks to (s 4 ), for the regular entries we can estimate where we used that σ U μ L 2 ( ) ).An application of Lemma A.1 with ξ = n and = 2 shows that |B r | 4 −k .For the remaining singular entries we estimate ) .An application of Lemma A.1 with ξ = n − 1 and = 1/2 shows that |B s | 2 −k/2 .Lemma A. 6 For k ∈ N 0 , it holds that Proof We split the matrix into B r + B s , where B r contains all ('regular') entries for which supp θ U λ is contained in ω ∩ I for some ω ∈ O I with |ω| = |λ| (so that in particular θ U λ vanishes on ∂I ), and where B s contains the remaining ('singular') entries.
For the regular entries, we can estimate where we used (t 4 ), Poincaré's inequality, an inverse inequality, and 1.An application of Lemma A.1 with ξ = 1 and = 2 shows that |B r | 4 −k .For the remaining singular entries, we estimate
From U / U L 2 (I) , a , U / U H 1 ( ) , and a being Riesz bases for L 2 (I), L 2 (I), H 1 0 ( ), and L 2 ( ), we have (c).A subset of the arguments that showed the second inequality gives the third one.
Lemma A.8 For k ∈ N 0 and 1 ≤ i ≤ n, it holds that Proof We split the matrix into B r + B s , where B r contains all its entries σ P μ , ∂ i σ V 1 μ L 2 ( ) n for which supp σ P μ is contained in ω ∩ for some ω ∈ O with |ω| = |μ| (the 'regular' entries), and where B s contains the remaining ('singular') entries.
For the regular entries using (s 4 ) we can estimate An application of Lemma A.1 with ξ = n and = 1 shows that |B r | 2 −k .Since the wavelets σ V 1 μ are piecewise polynomial, and functions in H 1 ( ), they are contained in W 1 ∞ ( ).For the remaining singular entries we estimate An application of Lemma A.1 with ξ = n − 1 and = 1/2 shows that |B s | 2 −k/2 .
The following Corollary will be used to bound (r 1 − r1 )| ∨ P .

Fig. 2
Fig. 2 Approximate condition numbers of the Galerkin matrices vs. their dimension

Fig. 3
Fig.3Norm residual vs. number of wavelets (dashed line has slope −1), and centers supports of the selected 4500 wavelets for h = 0 and g = 1

Fig. 4
Fig. 4 Norm residual vs. number of wavelets (dashed line has slope − 1 2 ), and centers supports of the selected 4131 wavelets for h = 1 and g = 1