# Adaptive Near-Optimal Rank Tensor Approximation for High-Dimensional Operator Equations

- 459 Downloads
- 13 Citations

## Abstract

We consider a framework for the construction of iterative schemes for operator equations that combine low-rank approximation in tensor formats and adaptive approximation in a basis. Under fairly general assumptions, we conduct a rigorous convergence analysis, where all parameters required for the execution of the methods depend only on the underlying infinite-dimensional problem, but not on a concrete discretization. Under certain assumptions on the rates for the involved low-rank approximations and basis expansions, we can also give bounds on the computational complexity of the iteration as a function of the prescribed target error. Our theoretical findings are illustrated and supported by computational experiments. These demonstrate that problems in very high dimensions can be treated with controlled solution accuracy.

### Keywords

Low-rank tensor approximation Adaptive methods High-dimensional operator equations Computational complexity### Mathematics Subject Classification

41A46 41A63 65D99 65J10 65N12 65N15## 1 Introduction

### 1.1 Motivation

Any attempt to recover or approximate a function of a large number of variables with the aid of classical low-dimensional techniques is inevitably impeded by the *curse of dimensionality*. This means that when only assuming classical smoothness (e.g., in terms of Sobolev or Besov regularity) of order \(s>0\), the necessary computational work needed to realize a desired target accuracy \(\varepsilon \) in \(d\) dimensions scales like \(\varepsilon ^{-d/s}\), i.e., one faces an exponential increase in the spatial dimension \(d\). This can be ameliorated by *dimension-dependent* smoothness measures. In many high-dimensional problems of interest, the approximand has bounded high-order mixed derivatives, which under suitable assumptions can be used to construct sparse grid-type approximations where the computational work scales like \(C_d \varepsilon ^{-1/s}\). Under such regularity assumptions, one can thus obtain a convergence rate independent of \(d\). In general, however, the constant \(C_d\) will still grow exponentially in \(d\). This has been shown to hold even under extremely restrictive smoothness assumptions in [30] and has been observed numerically in a relatively simple but realistic example in [13].

Hence, in contrast to the low-dimensional regime, regularity is no longer a sufficient structural property that ensures computational feasibility, and further, low-dimensional structure of the sought high-dimensional object is required. Such a structure could be the dependence of the function on a much smaller (unknown) number of variables; see, for example, [12]. It could also mean *sparsity* with respect to *some* (a priori) unknown dictionary. In particular, dictionaries comprised of *rank-one tensors*\(g(x_1,\ldots , x_d)=g_1(x_1)\cdots g_d(x_d)=: (g_1\otimes \cdots \otimes g_d)(x)\) open very promising perspectives and have recently attracted substantial attention.

*nonlinear parametrization*of a reference basis – breaks the curse of dimensionality, the second one obviously does not.

*well approximable*by relatively short sums of rank-one tensors. By this we mean that for some norm \(\Vert \cdot \Vert \) we have

This argument, however, already indicates that good approximability in the sense of (2) is not governed by classical regularity assumptions. Instead, the key is to exploit an approximate global low-rank structure of \(u\). This leads to a highly nonlinear approximation problem, where one aims to identify suitable lower-dimensional tensor factors, which can be interpreted as a \(u\)-dependent dictionary.

This discussion, though admittedly somewhat oversimplified, immediately raises several questions that we will briefly discuss as they guide subsequent developments.

*Format of approximation*: the hope that \(r(\varepsilon )\) in (2) can be rather small is based on the fact that the rank-one tensors are allowed to “optimally adapt” to the approximand \(u\). The format of the approximation used in (2) is sometimes called *canonical* since it is a formal direct generalization of classical Hilbert–Schmidt expansions for \(d=2\). However, a closer look reveals a number of well-known pitfalls. In fact, they are already encountered in the *discrete* case. The collection of sums of rank-one tensors of a given length is not closed, and the best approximation problem is not well-posed; see, for example, [35]. There appears to be no reliable computational strategy that can be proven to yield near-minimal rank approximations for a given target accuracy in this format. In this work, we therefore employ different tensor formats that allow us to obtain provably near-minimal rank approximations, as explained later.

*A two-layered problem*: Given a suitable tensor format, even if a best tensor approximation is known in the infinite-dimensional setting of a continuous problem, the resulting lower-dimensional factors still need to be approximated. Since finding these factors is part of the solution process, the determination of efficient discretizations for these factors will need to be intertwined with the process of finding low-rank expansions. We have chosen here to organize this process through selecting low-dimensional orthonormal wavelet bases for the tensor factors. However, other types of basis expansions would be conceivable as well.

The issue of the total complexity of tensor approximations, taking into account the approximation of the involved lower-dimensional factors, is addressed in [19, 34].

### 1.2 Conceptual Preview

The problem of finding a suitable format of tensor approximations has been extensively studied in the literature over the years, however, mainly in the discrete or finite-dimensional setting; see, for example, [17, 22, 25, 31, 33]. Some further aspects in a function space setting have been addressed, for example, in [14, 39, 40]. For an overview and further references we also refer the reader to [20] and the recent survey [18]. A central question in these works is: given a tensor, how can one obtain in a stable manner low-rank approximations, and how accurate are they when compared with best tensor approximations in the respective format?

*continuous infinite-dimensional*setting, i.e., in sparse tensor approximations of a

*function*that is a priori not given in any finite tensor format but that one may expect to be well approximable by simple tensors in a way to be made precise later. We shall not discuss here the question of the concrete conditions under which this is actually the case. Moreover, the objects to be recovered are not given explicitly but only

*implicitly*as a solution to an operator equation

The main contribution of this work is to put forward a strategy that addresses the main obstacles identified previously and results in an algorithm that, under mild assumptions, can be rigorously proven to provide for any target accuracy \(\varepsilon \) an approximate solution of *near-minimal* rank and representation complexity of the involved tensor factors. Specifically, (i) it is based on stable tensor formats relying on optimal subspaces; (ii) successive solution updates involve a *combined refinement* of ranks and factor discretizations; (iii) (near-)optimality is achieved, thanks to (i), through accompanying suitable *subspace correction* and *coarsening* schemes.

*universal*basis for functions of a

*single variable*in \(H_i\). Here, we focus on wavelet bases, but other systems, like the trigonometric system for periodic problems, are conceivable as well. As soon as functions of a single variable, especially the factors in our rank-one tensors, are expanded in such a basis, the whole problem of approximating \(u\) reduces to approximating its

*infinite*coefficient tensor \(\mathbf{u}\) induced by the expansion

*Riesz basis*for \(V\). This, in turn, together with the fact that \(\kappa _{V\rightarrow V'}(A):= \Vert A\Vert _{V\rightarrow V'}\Vert A^{-1}\Vert _{V'\rightarrow V}\) is finite, allows one to show that \(\kappa _{\ell _2\rightarrow \ell _2}(\mathbf {A})\) is finite, see [11]. Hence one can find a positive \(\omega \) such that \(\Vert \mathbf{I}- \omega \mathbf {A}\Vert _{\ell _2\rightarrow \ell _2}\le \rho < 1\), i.e., the operator \(\mathbf{I}- \omega \mathbf {A}\) is a contraction, so that the iteration

Of course, (8) is only an idealization because the full coefficient sequences \(\mathbf{u}_k\) cannot be computed. Nevertheless, adaptive wavelet methods can be viewed as realizing (8) *approximately*, keeping possibly few wavelet coefficients “active” while still preserving enough accuracy to ensure convergence to \(\mathbf{u}\) (e.g., [9, 10]).

*reduction*operators, and the \(\varepsilon _i(k)\), \(i=1,2\), are suitable tolerances that decrease for increasing \(k\).

More precisely, the purpose of \(\mathrm{P}_{\varepsilon }\) is to “correct” the current tensor expansion and, in doing so, reduce the rank subject to an accuracy tolerance \(\varepsilon \). We shall always refer to such a rank reduction operation as a *recompression*. For this operation to work as desired, it is essential that the employed tensor format be stable in the sense that the best approximation problem for any given rank is well-posed. As explained earlier, this excludes the use of the canonical format. Instead, we use the so-called *hierarchical Tucker* (HT) format since on the one hand it inherits the stability of the Tucker format [14], as a classical best subspace method, while on the other hand it better ameliorates the curse of dimensionality that the Tucker format may still be prone to. In Sect. 2 we collect the relevant prerequisites. This draws to a large extent on known results for the finite-dimensional case but requires proper formulation and extension of these notions and facts for the current sequence space setting. The second reduction operation \(\mathrm{C}_{\varepsilon }\), in turn, is a *coarsening* scheme that reduces the number of degrees of freedom used by the wavelet representations of the tensor factors, again subject to some accuracy constraint \(\varepsilon \).

### 1.3 What Is New?

The use of rank reduction techniques in iterative schemes is in principle not new; see, for example, [3, 5, 6, 21, 24, 26, 28] and the further references given in [18]. To our knowledge, corresponding approaches can be subdivided roughly into two categories. In the first one, iterates are always truncated to a fixed tensor rank. This allows one to control the complexity of the approximation, but convergence of such iterations can be guaranteed only under very restrictive assumptions (e.g., concerning highly effective preconditioners). In the second category, schemes achieve a desired target accuracy by instead prescribing an error tolerance for the rank truncations, but the corresponding ranks arising during the iteration are not controlled. A common feature of both groups of results is that they operate on a *fixed discretization* of the underlying continuous problems.

In contrast, the principal novelty of the present approach can be sketched as follows. The first key element is to show that, based on a known error bound for a given approximation to the unknown solution, a judiciously chosen recompression produces a near-minimal rank approximation to the *solution of the continous problem* for a slightly larger accuracy tolerance. Moreover, the underlying projections are stable with respect to certain sparsity measures. As pointed out earlier, this reduction needs to be intertwined with a sufficiently accurate but possibly coarse approximation of the tensor factors. A direct coarsening of the full wavelet coefficient tensor would face the curse of dimensionality and, thus, would be practically infeasible. The second critical element is therefore to introduce certain lower-dimensional quantities, termed tensor *contractions*, from which the degrees of freedom to be discarded in the coarsening are identified. This notion of contractions also serves to define suitable sparsity classes with respect to wavelet coefficients, facilitating a computationally efficient, rigorously founded combination of tensor recompression and coefficient coarsening.

These concepts culminate in the main result of this paper, which can be summarized in an admittedly oversimplified way as follows.

**Meta-Theorem:***Whenever the solution to* (7) *has certain tensor-rank approximation rates and when the involved tensor factors have certain best N-term approximation rates, a judicious numerical realization of the iteration* (9) *realizes these rates. Moreover, up to logarithmic factors, the computational complexity is optimal. More specifically, for the smallest*\(k\)*such that the approximate solution*\(\mathbf{u}_k\)*satisfies*\(\Vert \mathbf{u}_k -\mathbf{u}\Vert _{\ell _2}\le \tau ,\,\mathbf{u}_k\)*has HT ranks that can be bounded, up to multiplication by a uniform constant, by the smallest possible HT ranks needed to realize accuracy*\(\tau \).

In the theorem that we will eventually prove we admit classes of operators with unbounded ranks, in which case the rank bounds contain a factor of the form \(|\log \tau |^c\), where \(c\) is a fixed exponent.

To our knowledge, this is the first result of this type, where convergence to the solution of the *infinite-dimensional* problem is guaranteed under realistic assumptions, and all ranks arising during the process remain proportional to the respective smallest possible ones. A rigorous proof of *rank near optimality*, using an iteration of the preceding type, is to be contrasted to approaches based on greedy approximation as studied, for example, in [7], where approximations in the (unstable) canonical format are constructed through successive greedy updates. This does, in principle, not seem to offer much hope for finding minimal or near-minimal rank approximations, as the greedy search operates far from orthonormal bases, and errors committed early in the iteration cannot easily be corrected. Although variants of the related *proper generalized decomposition*, as studied in [16], can alleviate some of these difficulties, for example by employing different tensor formats, the basic issue of controlling ranks in a greedy procedure remains.

### 1.4 Layout

The remainder of the paper is devoted to the development of the ingredients and their complexity analysis needed to make the statements in the preceding metatheorem precise. Trying to carry out this program raises some issues that we will briefly address now because they guide subsequent developments.

After collecting some preliminaries in Sect. 2, we devote Sect. 3 to a pivotal element of our approach, namely, the development and analysis of suitable *recompression and coarsening schemes* that yield an approximation in the HT format, that is, for a given target accuracy, of *near-minimal* rank with possibly sparse tensor factors (in a sense to be made precise later).

Of course, one can hope that the solution of (4) is particularly tensor sparse in the sense that relatively low HT ranks already provide high accuracy if the data \(f\) are tensor sparse and if the operator \(A\) (resp. \(\mathbf {A}\)) is tensor sparse in the sense that its application does not increase ranks too drastically. Suitable models of operator classes that allow us to properly weigh tensor sparsity and wavelet expansion sparsity are introduced and analyzed in Sect. 4. The approximate application of such operators with certified output accuracy builds on the findings in Sect. 3.

Finally, in Sect. 5 we formulate an *adaptive iterative algorithm* and analyze its complexity. Starting from the coarsest possible approximation \(\mathbf{u}^0 =0\), approximations in the tensor format are built successively, where the error tolerances in the iterative scheme are updated for each step in such a way that two goals are achieved. On the one hand, the tolerances are sufficiently stringent to guarantee the convergence of the iteration up to any desired target accuracy. On the other hand, we ensure that at each stage of the iteration, the approximations remain sufficiently coarse to realize the metatheorem formulated earlier. Here we specify concrete tensor approximability assumptions on \(\mathbf{u}\), \(\mathbf {f}\), and \(\mathbf {A}\) that allow us to make its statement precise.

## 2 Preliminaries

In this section we set the notation and collect the relevant ingredients for stable tensor formats in the infinite-dimensional setting. In the remainder of this work, for simplicity’s sake we shall use the abbreviation \(||\cdot ||:=||\cdot ||_{\mathrm{\ell }_{2}}\), with the \(\mathrm{\ell }_{2}\)-space on the appropriate index set.

Our basic assumption is that we have a Riesz basis \(\{\varPsi _\nu \}_{\nu \in \nabla ^d}\) for \(V\), where \(\nabla \) is a countable index set. In other words, we require that the index set have a Cartesian product structure. Therefore, any \(u\in V\) can be identified with its basis coefficient sequence \({\mathbf {u}} := {(u_\nu )_{\nu \in \nabla ^d}}\) in the unique representation \(u=\sum _{\nu \in \nabla ^d}u_\nu \varPsi _\nu \), with uniformly equivalent norms. Thus, \(d\) will in general correspond to the spatial dimension of the domain of functions under consideration. In addition, it can be important to reserve the option of grouping some of the variables in a possibly smaller number \(m\le d\) of portions of variables, i.e., \(m\in \mathrm{I}\!\mathrm{N}\) and \(d = d_1 + \cdots + d_m\) for \(d_i\in \mathbb {N}\).

A canonical point of departure for the construction of \(\{ \varPsi _\nu \}\) is a collection of Riesz bases for each component Hilbert space \(H_i\) [see (6)], which we denote by \(\{ \psi ^{H_i}_\nu \}_{\nu \in \nabla ^{H_i}}\). To fit in the preceding context, we may assume without loss of generality that all \(\nabla ^{H_i}\) are identical, denoted by \(\nabla \). The precise structure of \(\nabla \) is irrelevant at this point; however, in the case where the \(\psi ^{H_i}_\nu \) are wavelets, each \(\nu =(j,k)\) encodes a dyadic level \(j=|\nu |\) and a spatial index \(k=k(\nu )\). This latter case is of particular interest since, for instance, when \(V\) is a Sobolev space, a simple rescaling of \(\psi ^{H_1}_{\nu _1} \otimes \cdots \otimes \psi ^{H_d}_{\nu _d}\) yields a Riesz basis \(\{ \varPsi _\nu \}\) for \(V\subseteq H\) as well.

A simple scenario would be \(V=H=\mathrm{L}_{2}((0,1)^d)\), which is the situation considered in our numerical illustration in Sect. 6. A second example are elliptic diffusion equations with stochastic coefficients. In this case, \(V= \mathrm{H}^{1}_0(\varOmega ) \otimes \mathrm{L}_{2}((-1,1)^\infty )\), and \(H = \mathrm{L}_{2}(\varOmega \times (-1,1)^\infty )\). Here a typical choice of bases for \(\mathrm{L}_{2}((-1,1)^\infty )\) are tensor products of polynomials on \((-1,1)\), while one can take a wavelet basis for \(\mathrm{H}^{1}_0(\varOmega )\), obtained by rescaling a standard \(\mathrm{L}_{2}\) basis. A third representative scenario concerns diffusion equations on high-dimensional product domains \(\varOmega ^d\). Here, for instance, \(V = \mathrm{H}^{1}(\varOmega ^d)\) and \(H = \mathrm{L}_{2}(\varOmega ^d)\). We shall comment on some additional difficulties that arise in the application of operators in this case in Remark 18.

*orthonormal mode frames*. It will be convenient to use the notational convention \({\mathsf {k}} = (k_1,\ldots , k_t)\), \({\mathsf {n}} = (n_1,\ldots ,n_t)\), \({\mathsf {r}} = (r_1,\ldots ,r_t)\), and so forth, for multiindices in \(\mathrm{I}\!\mathrm{N}^t_0\), \(t\in \mathrm{I}\!\mathrm{N}\). Defining for \({\mathsf {r}} \in \mathrm{I}\!\mathrm{N}_0^m\)

*core tensor*. Moreover, when the \(\mathbf{U}^{(i)}_k\), \(k\in \mathrm{I}\!\mathrm{N}\), are bases for all of \(\ell _2(\nabla ^{d_i})\), that is, \({\mathsf {K}_m}({\mathsf {r}}) =\mathrm{I}\!\mathrm{N}^m\), one has, of course, \(\mathrm{P }_\mathbb {U}\mathbf{u}=\mathbf{u}\), while for any \({\mathsf {s}}\le {\mathsf {r}}\), componentwise, the “box truncation”

*ith mode frame*\(\mathbf{U}^{(i)}\) to have pairwise orthonormal column vectors \(\mathbf {U}^{(i)}_k\in \mathrm{\ell }_{2}(\nabla ^{d_i})\), \(k = 1,\ldots ,r_i\). However, these columns can always be orthonormalized, which results in a corresponding modification of the core tensor \(\mathbf {a} = (a_{{\mathsf {k}}})_{{\mathsf {k}}\in {\mathsf {K}_m}({\mathsf {r}})}\); for fixed mode frames, the latter is uniquely determined.

When writing sometimes for convenience \((\mathbf{U}^{(i)}_k)_{k\in \mathrm{I}\!\mathrm{N}}\), although the \(\mathbf{U}^{(i)}_k\) may be specified through (13) only for \(k\le r_i\), it will always be understood to mean \(\mathbf{U}^{(i)}_k=0\) for \(k> r_i\).

If the core tensor \(\mathbf {a}\) is represented directly by its entries, then (13) corresponds to the so-called *Tucker format* [37, 38] or *subspace representation*. The *hierarchical Tucker* format [22], as well as the special case of the *tensor train* format [33], corresponds to representations in the form (13) as well but use a further structured tensor decomposition for the core tensor \(\mathbf {a}\) that can exploit a stronger type of *information sparsity*. For \(m=2\) the *singular value decomposition* (SVD) or its infinite dimensional counterpart, the *Hilbert–Schmidt decomposition*, yields \(\mathbf{u}\)-dependent mode frames that even give a diagonal core tensor. Although this is no longer possible for \(m > 2\), the SVD remains the main workhorse behind Tucker as well as hierarchical Tucker formats. For the reader’s convenience, we summarize in what follows the relevant facts for these tensor representations in a way tailored to present needs.

### 2.1 Tucker Format

It is instructive to consider first the simpler case of the Tucker format in more detail.

#### 2.1.1 Some Prerequisites

*rank vector*\({{\mathrm{rank}}}(\mathbf {u})\) by its entries

*multilinear rank*of \(\mathbf {u}\). We denote by

*rigid*Tucker class

The following result, which can be found, for example, in [14, 20, 39], ensures the existence of best approximations in \( {\mathcal {T}}({\mathsf {r}})\) also for infinite ranks.

**Theorem 1**

*mode-i singular values*.

#### 2.1.2 Higher-Order Singular Value Decomposition

The representation (20) is the main building block of the *higher-order singular value decomposition* (HOSVD) [27] for the Tucker tensor format (13). In the following theorem, we summarize its properties in the more general case of infinite-dimensional sequence spaces, where the SVD is replaced by the spectral theorem for compact operators. These facts could also be extracted from the treatment in [20, Sect. 8.3].

**Theorem 2**

- (i)
For all \(i\in \{1,\ldots ,m\}\) we have \((\sigma ^{(i)}_k)_{k\in \mathrm{I}\!\mathrm{N}}\in \mathrm{\ell }_{2}(\mathrm{I}\!\mathrm{N})\), and \(\sigma ^{(i)}_k \ge \sigma ^{(i)}_{k+1}\ge 0\) for all \(k\in \mathrm{I}\!\mathrm{N}\), where \(\sigma ^{(i)}_k\) are the mode-\(i\) singular values in (20).

- (ii)
For all \(i\in \{1,\ldots ,m\}\) and all \(p,q\in \mathrm{I}\!\mathrm{N}\), we have \(a^{(i)}_{pq} = \bigl |\sigma ^{(i)}_p\bigr |^2 \delta _{pq}\), where the \(a^{(i)}_{pq}\) are defined by (22).

- (iii)For each \({\mathsf {r}} \in \mathrm{I}\!\mathrm{N}^m_0\), we have$$\begin{aligned} \Bigg \Vert {\mathbf {u} - \sum _{{\mathsf {k}}\in {\mathsf {K}_m}({\mathsf {r}})} a_{\mathsf {k}} \mathbb {U}_{\mathsf {k}} }\Bigg \Vert \le \Big (\sum _{i=1}^m \sum _{k = r_i + 1}^{\infty } |\sigma ^{(i)}_k|^2\Big )^{\frac{1}{2}} \le \sqrt{m} \inf _{{{\mathrm{rank}}}(\mathbf {w})\le {\mathsf {r}}} ||\mathbf {u} - \mathbf {w}||. \end{aligned}$$(23)

*Proof*

Property (iii) in Theorem 2 leads to a simple procedure for truncation to lower multilinear ranks with an explicit error estimate in terms of the mode-\(i\) singular values. In this manner, one does not necessarily obtain the best approximation for prescribed rank, but the approximation is quasi-optimal in the sense that the error is at most by a factor \(\sqrt{m}\) larger than the error of best approximation with the same multilinear rank.

**Corollary 1**

While projections to subspaces spanned by the \(\mathbb {U}_{\mathsf {k}}(\mathbf{u})\), \({\mathsf {k}}\in {\mathsf {K}_m}({\mathsf {r}})\), do in general not realize the best approximation from \({\mathcal {T}}({\mathsf {r}})\) [only from \({\mathcal {T}}(\mathbb {U}(\mathbf{u}),{\mathsf {r}})\)], exact best approximations are still orthogonal projections based on suitable mode frames.

**Corollary 2**

*Proof*

*Remark 1*

### 2.2 Hierarchical Tucker Format

The Tucker format as it stands, in general, still gives rise to an increase of degrees of freedom that is exponential in \(d\). One way to mitigate the curse of dimensionality is to further decompose the core tensor \(\mathbf{a}\) in (13). We now briefly formulate the relevant notions concerning the *hierarchical Tucker format* in the present sequence space context, following essentially the developments in [17, 22]; see also [20].

#### 2.2.1 Dimension Trees

**Definition 1**

- (i)
\(\{1,\ldots ,m\} \in \mathcal {D}_{m}\), and for each \(i\in \{1,\ldots ,m\}\) we have \(\{i\} \in \mathcal {D}_{m}\).

- (ii)
Each \(\alpha \in \mathcal {D}_{m}\) is either a singleton or there exist unique disjoint \(\alpha _1, \alpha _2 \in \mathcal {D}_{m}\), called

*children*of \(\alpha \), such that \(\alpha = \alpha _1 \cup \alpha _2\).

*leaves*,

*root*, and elements of \({\mathcal {I}}(\mathcal {D}_{m}) := \mathcal {D}_{m}\setminus \bigl \{{0_{m}} ,\{1\},\ldots ,\{m\} \bigr \}\) as

*interior nodes*. The set of leaves is denoted by \(\mathcal{L}(\mathcal {D}_{m})\), where we additionally set \(\mathcal{N}(\mathcal {D}_{m}) := \mathcal {D}_{m}\setminus \mathcal {L}(\mathcal {D}_m) = {\mathcal {I}}(\mathcal {D}_{m})\cup \{ {0_{m}}\}\). When an enumeration of \(\mathcal{L}(\mathcal {D}_{m})\) is required, we shall always assume the ascending order with respect to the indices, i.e., in the form \(\{\{1\},\ldots , \{m\}\}\).

Note that for a binary dimension tree as defined earlier, \(\# \mathcal {D}_{m} = 2m-1\) and \(\#\mathcal{N}(\mathcal {D}_{m}) = m-1\).

*Remark 2*

The restriction to binary trees in Definition 1 is not necessary, but it leads to the most favorable complexity estimates for algorithms operating on the resulting tensor format. With this restriction dropped, the Tucker format (13) can be treated in the same framework, with the \(m\)-ary dimension tree consisting only of root and leaves, i.e., \(\bigl \{ {0_{m}}, \{1\},\ldots ,\{m\} \bigr \}\). In principle, all subsequent results carry over to more general dimension trees (see [15, Sect. 5.2]).

**Definition 2**

*hierarchical mode frames*. In addition, these are called

*orthonormal*if for all \(\alpha \in \mathcal {D}_{m}\setminus \{{0_{m}}\}\) we have \(\langle \mathbf{U}^{(\alpha )}_i, \mathbf{U}^{(\alpha )}_j\rangle = \delta _{ij}\) for \(i,j=1,\ldots ,k_\alpha \), and

*nested*if

Again, to express that \(\mathbb {U}\) is associated with the hierarchical format, we sometimes write \(\mathbb {U}^{{\mathcal {H}}}\). Of course, \(\mathbb {U}^{{\mathcal {H}}}\) depends on the dimension tree \(\mathcal{D}_m\), which will be kept fixed in what follows.

*hierarchical rank vector*associated with \(\mathbf{u}\). Since in what follows the dimension tree \(\mathcal {D}_{m}\) will be kept fixed we suppress the corresponding subscript in the rank vector.

*hierarchical format*with ranks \({\mathsf {r}}\).

The observation that the specific systems of hierarchical mode frames \(\mathbb {U}(\mathbf{u})\) have the following *nestedness* property, including the root element, will be crucial. The following fact was established in a more generally applicable framework of minimal subspaces in [20] (cf. Corollary 6.18 and Theorem 6.31 there).

**Proposition 1**

*compatibility conditions*on the rank vectors \({\mathsf {r}}\). In fact, it readily follows from Proposition 1 that for \(\alpha \in \mathcal{D}_m\setminus \mathcal{L}(\mathcal{D}_m)\) one has \({{\mathrm{rank}}}_\alpha (\mathbf{u})\le {{\mathrm{rank}}}_{c_1(\alpha )}(\mathbf{u}) {{\mathrm{rank}}}_{c_2(\alpha )}(\mathbf{u})\). For necessary and sufficient conditions on a rank vector \({\mathsf {r}}=(r_\alpha )_{\alpha \in \mathcal{D}_m\setminus \{{0_{m}}\}}\) for the existence of corresponding nested hierarchical mode frames, we refer to [20, Sect. 11.2.3]. In what follows we denote by

Following [14, 20], we can now formulate the analog to Theorem 1.

**Theorem 3**

*Example 1*

*Example 2*

A tensor train (TT) representation for \(m=4\) as in Example 1 would correspond to \(\mathcal {D}_{4} = \bigl \{ \{1,2,3,4\}, \{1\}, \{2,3,4\}, \{2\}, \{3,4\}, \{3\}, \{4\} \bigr \}\), i.e., a degenerate instead of a balanced binary tree. More precisely, the special case of the hierarchical Tucker format resulting from this type of tree has also been considered under the name *extended TT format* [32].

#### 2.2.2 Hierarchical Singular Value Decomposition

For any given \(\mathbf{u}\in \ell _2(\nabla ^d)\) the decomposition (36), with \(\mathbf {a}\) defined by (38), can be regarded as a generalization of the HOSVD, which we shall refer to as a *hierarchical singular value decomposition* or \({\mathcal {H}}\)SVD. The next theorem summarizes the main properties of this decomposition in the present setting. The finite-dimensional versions of the following claims were established in [17]. All arguments given there carry over to the infinite-dimensional case, as in the proof of Theorem 2.

**Theorem 4**

- (i)
\(\langle \mathbf {U}^{(i)}_k, \mathbf {U}^{(i)}_l \rangle = \delta _{kl}\) for \(i=1,\ldots ,m\) and \(k,l\in \mathrm{I}\!\mathrm{N}\);

- (ii)
\({{\mathrm{rank}}}_{{0_{m}}}(\mathbf{u}) = 1\), \(||\mathbf {B}^{({0_{m}},1)}|| = ||\mathbf{u}||\), and \(\mathbf {B}^{({0_{m}},k)} = 0\) for \(k>1\);

- (iii)
\(\langle \mathbf {B}^{(\alpha ,k)}, \mathbf {B}^{(\alpha ,l)}\rangle = \delta _{kl}\) for \(\alpha \in {\mathcal {I}}(\mathcal {D}_{m})\) and \(k,l\in \mathrm{I}\!\mathrm{N}\);

- (iv)
for all \(i\in \{1,\ldots ,m\}\) we have \((\sigma ^{(i)}_k)_{k\in \mathrm{I}\!\mathrm{N}}\in \mathrm{\ell }_{2}(\mathrm{I}\!\mathrm{N})\), and \(\sigma ^{(i)}_k \ge \sigma ^{(i)}_{k+1}\ge 0\) for all \(k\in \mathrm{I}\!\mathrm{N}\);

- (v)
for all \(i\in \{1,\ldots ,m\}\) we have \(a^{(i)}_{pq} = \bigl |\sigma ^{(i)}_p\bigr |^2 \delta _{pq}\), \(1\le p,q\le \mathrm{rank}_i(\mathbf{u})\).

#### 2.2.3 Projections

*hierarchical*\(\mathbb {V}\)-

*rigid*tensor class of rank \({\mathsf {r}}\) is given by

In analogy to (12), we address next a truncation of hierarchical ranks to \({\tilde{{\mathsf {r}}}}\le {\mathsf {r}}\) for elements in \({\mathcal {H}}(\mathbb {V},{\mathsf {r}})\), where \(\mathbb {V}\) is a given system of orthonormal and nested mode frames with ranks \({\mathsf {r}}\). We assume first that \({\tilde{{\mathsf {r}}}}\) belongs also to \(\mathcal {R}_{\mathcal {H}}\). The main point is that an approximation with restricted mode frames can still be realized through an operation represented as a *sequence* of projections involving the given mode frames from \(\mathbb {V}\). However, the order in which these projections are applied now matters.

Specifically, given \(\mathbf{u}\in \ell _2(\nabla ^d)\), we can choose \(\mathbb {V}=\mathbb {U}(\mathbf{u})\) provided by the \({\mathcal {H}}\)SVD; see (30). Hence, \(\mathrm{P }_{\mathbb {U}(\mathbf{u}),{\tilde{{\mathsf {r}}}}}\mathbf{u}\) gives the truncation of \(\mathbf{u}\) based on the \({\mathcal {H}}\)SVD. For this particular truncation an error estimate, in terms of the error of best approximation with rank \({{\tilde{{\mathsf {r}}}}}\), is given in Theorem 5 below.

*Remark 3*

By (40) we have a representation of \(\tilde{\mathbf{u}}:=\mathrm{P }_{\mathbb {U}(\mathbf{u}),{\tilde{{\mathsf {r}}}}} \mathbf{u}\) in terms of a sequence of noncommuting orthogonal projections. When \({\tilde{{\mathsf {r}}}}\le {\mathsf {r}}\) does not belong to \(\mathcal {R}_{\mathcal {H}}\), the operator defined by (40) is still a projection that, however, modifies the mode frames for those nodes \(\alpha \in \mathcal {N}(\mathcal{D}_m)\) for which the rank compatibility conditions are violated. The resulting projected mode frames are then nested, that is, \(\tilde{\mathbf{u}}\) may again be represented in terms of the orthonormal and nested mode frames \(\tilde{\mathbb {U}} := \mathbb {U}(\tilde{\mathbf{u}})\).

The situation is simplified if we consider the projection to a fixed *nested* system of mode frames, without a further truncation of ranks that could entail non-nestedness.

**Lemma 1**

*Proof*

#### 2.2.4 Best Approximation

**Theorem 5**

**Corollary 3**

*Proof*

*Remark 4*

## 3 Recompression and Coarsening

*subspace correction*leading to tensor representations with ranks at least close to minimal ones. This consists in deriving from the

*known*\(\mathbf{v}\) a

*near-best*approximation to the

*unknown*\(\mathbf{u}\), where the notion of near best in terms of ranks is made precise below. Specifically, suppose that \(\mathbf{v}\in \mathrm{\ell }_{2}(\nabla ^d)\) is an approximation of \(\mathbf{u}\in \mathrm{\ell }_{2}(\nabla ^d)\), which for some \(\eta >0\) satisfies

*Remark 5*

As mentioned earlier, for \({\mathcal {F}}= {\mathcal {H}}\) the preceding notions depend on the dimension tree \(\mathcal{D}_m\). Since \(\mathcal{D}_m\) is fixed, we dispense with a corresponding notational reference.

### 3.1 Tensor Recompression

Given \(\mathbf {u}\in \mathrm{\ell }_{2}(\nabla ^d)\), in what follows, by \(\mathbb {U}(\mathbf{u})\) we either mean \(\mathbb {U}^{{\mathcal {T}}}(\mathbf{u})\) or \(\mathbb {U}^{{\mathcal {H}}}(\mathbf{u})\); see (25), (30).

We introduce next two notions of minimal ranks \(\mathrm{r }(\mathbf{u},\eta ), {\bar{\mathrm{r }}}(\mathbf{u}, \eta )\) for a given target accuracy \(\eta \), one for the specific mode frame system \(\mathbb {U}(\mathbf{u})\) provided by either HOSVD or \({\mathcal {H}}\)SVD, and one for the respective best mode frame systems.

**Definition 3**

**Lemma 2**

In other words, the ranks of \({\hat{\mathrm{P }}}_{\kappa _\mathrm{P}(1+\alpha )\eta }\mathbf{v}\) are bounded by the minimum ranks required to realize a somewhat higher accuracy.

*Proof*

Thus, appropriately coarsening \(\mathbf{v}\) yields an approximation to \(\mathbf{u}\) of still the same quality up to a fixed (dimension-dependent) constant, where the rank of this new approximation is bounded by a minimal rank of a *best* Tucker or hierarchical Tucker approximation to \(\mathbf{u}\) for somewhat higher accuracy.

**Definition 4**

*growth sequence*. For a given growth sequence \(\gamma \) we define

*admissible*if

In the particular case where \(\gamma (n)\sim n^{s}\) for some \(s >0\), \(||\mathbf{v}||_{{{\mathcal {A}}_{\mathcal {F}}({\gamma })}}:= ||\mathbf{v}|| + |\mathbf{v}|_{{{\mathcal {A}}_{\mathcal {F}}({\gamma })}}\) is a quasi-norm and \({{\mathcal {A}}_{\mathcal {F}}({\gamma })}\) is a linear space.

*Remark 6*

For the subsequent developments it will be helpful to keep in mind the following way of reading \(\mathbf{v}\in {{\mathcal {A}}_{\mathcal {F}}({\gamma })}\): a given target accuracy \(\varepsilon \) can be realized at the expense of ranks of size \(\gamma ^{-1}(|\mathbf{v}|_{{{\mathcal {A}}_{\mathcal {F}}({\gamma })}}/\varepsilon )\) so that a rank bound of the form \(\gamma ^{-1}(C|\mathbf{v}|_{{{\mathcal {A}}_{\mathcal {F}}({\gamma })}}/\varepsilon )\), where \(C\) is any constant, marks a near-optimal performance.

**Theorem 6**

*Proof*

### 3.2 Coarsening of Mode Frames

We now turn to a second type of operation for reducing the complexity of given coefficient sequences in tensor representation, an operation that coarsens mode frames by discarding basis indices whose contribution is negligible. We shall use the following standard notions for best \(N\)-term approximations.

**Definition 5**

*approximation classes*. For \(s >0\), we denote by \({{\mathcal {A}}^s}(\nabla ^{\hat{d}})\) the set of \(\mathbf {v}\in \mathrm{\ell }_{2}(\nabla ^{\hat{d}})\) such that

*Remark 7*

The same comment as in Remark 6 applies. Thinking of the growth sequence as being \(\gamma _s(n)=(n+1)^s\), realizing an accuracy \(\varepsilon \) at the expense of \((C ||\mathbf {v}||_{{{\mathcal {A}}^s}(\nabla ^{\hat{d}})} /\varepsilon )^{1/s}\) terms, where \(C\) is a constant independent of \(\varepsilon \), signifies an *optimal work-accuracy balance* over the class \({{\mathcal {A}}^s}(\nabla ^{\hat{d}})\).

We deliberately restrict the discussion to polynomial decay rates here since this corresponds to finite Sobolev or Besov regularity. However, with appropriate modifications, the subsequent considerations can be adapted also to approximation classes corresponding to more general growth sequences.

#### 3.2.1 Tensor Contractions

Searching through a sequence \(\mathbf{u}\in \ell _2(\nabla ^d)\) (of finite support) would suffer from the curse of dimensionality. Being content with *near-best*\(N\)-term approximations one can get around this by introducing, for each given \(\mathbf{u}\in \ell _2(\nabla ^d)\), the following quantities formed from certain contractions of the tensor \(\mathbf{u}\ \otimes \ \mathbf{u}\) that are given by \(\mathrm{diag}(T^{(i)}_\mathbf{u}(T^{(i)}_\mathbf{u})^*)\).

**Definition 6**

With a slight abuse of terminology, we shall refer to these \(\pi ^{(i)}(\cdot )\) simply as *contractions*. Their direct computation would involve high-dimensional summations over the index sets \(\nabla ^{d-d_i}\). However, the following observations show how this can be avoided. This makes essential use of the particular orthogonality properties of the tensor formats.

**Proposition 2**

- (i)
We have \(||\mathbf{u}|| = ||\pi ^{(i)}(\mathbf{u})||\), \(i=1,\ldots ,m\).

- (ii)Let \(\varLambda ^{(i)}\subseteq \nabla ^{d_i}\); then$$\begin{aligned} ||\mathbf{u}- \mathrm{R }_{\varLambda ^{(1)} \times \cdots \times \varLambda ^{(m)}} \mathbf{u}|| \le \Big (\sum _{i=1}^m \sum _{\nu \in \nabla ^{d_i}\setminus \varLambda ^{(i)}} |\pi ^{(i)}_\nu (\mathbf{u})|^2 \Big )^{\frac{1}{2}} . \end{aligned}$$(55)
- (iii)Let in addition \(\mathbf {U}^{(i)}\) and \(\mathbf {a}\) be mode frames and core tensor, respectively, as in Theorem 2 or 4, and let \((\sigma ^{(i)}_k)\) be the corresponding sequences of mode-\(i\) singular values. Then$$\begin{aligned} \pi ^{(i)}_\nu (\mathbf {u}) = \Big ( \sum _{k} \bigl |\mathbf {U}^{(i)}_{\nu , k}\bigr |^2 \bigl |\sigma ^{(i)}_{k}\bigr |^2 \Big )^{\frac{1}{2}},\quad \nu \in \nabla ^{d_i}. \end{aligned}$$(56)

*Proof*

The following subadditivity property is an immediate consequence of the triangle inequality.

**Proposition 3**

**Proposition 4**

*Proof*

*coarsening operator*

*best tensor coarsening*operator

**Lemma 3**

*Proof*

#### 3.2.2 Combination of Tensor Recompression and Coarsening

Recall that we use \(\Vert \cdot \Vert _{{{\mathcal {A}}^s}}\) to quantify the sparsity of the wavelet expansions of mode frames and \(\Vert \cdot \Vert _{{{\mathcal {A}}_{\mathcal {F}}({\gamma })}}\) to quantify low-rank approximability. The following main result of this section applies again to both the Tucker and the hierarchical Tucker format. It extends Theorem 6 in combining tensor recompression and wavelet coarsening and shows that both reduction techniques combined are optimal up to uniform constants and stable in the respective sparsity norms.

**Theorem 7**

*Proof*

Taking (49) in Lemma 2 and the definition (69) into account, the relation (72) follows from the triangle inequality.

The statements in (73) follow from Theorem 6. Note that the additional mode frame coarsening considered here does not affect these estimates.

## 4 Adaptive Approximation of Operators

Whether the solution to an operator equation actually exhibits some tensor and expansion sparsity is expected to depend strongly on the structure of the involved operator. The purpose of this section is to formulate a class of operators that are tensor-friendly in the sense that their approximate application does not increase the rank too much. Making this precise requires some *model assumptions* that at this point we feel are relevant in that a wide range of interesting cases is covered. But of course, many possible variants would be conceivable as well. In that sense the main issue in the subsequent discussion is to identify the essential structural mechanisms that would still work under somewhat different model assumptions.

*exact*low rank structure. Of course, assuming that the operator is a single tensor product of operators acting on functions of a smaller number of variables would be far too restrictive and also concern a trivial scenario, since ranks would be preserved. More interesting are

*sums*of tensor products such as the \(m\)-dimensional

*Laplacian*

*identity operators*acting on all but the \(j\)th variable with the second-order partial derivative with respect to the \(j\)th variable. Hence the wavelet representation \(\mathbf {A}\) of \(\varDelta \) in an \(L_2\)-orthonormal wavelet basis has the form

At a second stage it is important to cover also operators that do not have an explicit low-rank structure but can be approximated in a quantified manner by low-rank operators. A typical example are potential terms, such as those arising in electronic structure calculations (e.g., [2] and references cited therein), as well as the rescaled versions of operators of the type (78) mentioned earlier.

### 4.1 Operators with Explicit Low-Rank Form

#### 4.1.1 Tucker Format

*Example 3*

*nearly sparse*, as will be quantified next. To this end, suppose that for each \(\mathbf {A}^{(i)}_{n_i}\) we have a sequence of approximations (in the spectral norm) such that for a given sequence \(\varepsilon ^{(i)}_{n_i,p} \), \(p\in \mathrm{I}\!\mathrm{N}_0\), of tolerances,

*partition*\(\{ \varLambda ^{(i)}_{n_i,[p]} \}_{p\in \mathrm{I}\!\mathrm{N}_0}\) of \(\nabla ^{d_i}\).

*approximations*\( \tilde{\mathbf {A}}\) to \(\mathbf {A}\) of the form

**Lemma 4**

*Proof*

*Remark 8*

**Definition 7**

Let \(\varLambda \) be a countable index set, and let \(s^* > 0\). We call an operator \(\mathbf {B}:\mathrm{\ell }_{2}(\varLambda )\rightarrow \mathrm{\ell }_{2}(\varLambda )\)\(s^*\)*-compressible* if for any \(0 < s < s^*\) there exist summable positive sequences \((\alpha _j)_{j\ge 0}\), \((\beta _j)_{j\ge 0}\) and for each \(j\ge 0\) there exists \(\mathbf {B}_j\) with at most \(\alpha _j 2^j\) nonzero entries per row and column such that \(||\mathbf {B} - \mathbf {B}_j|| \le \beta _j 2^{-s j} \). For a given \(s^*\)-compressible operator \(\mathbf {B}\) we denote the corresponding sequences by \(\alpha (\mathbf {B})\), \(\beta (\mathbf {B})\).

*family*of operators \(\{ \mathbf {B}(n) \}_n\) is

*equi*-\(s^*\)

*-compressible*if all \(\mathbf {B}(n)\) are \(s^*\)-compressible with the same choice of sequences \((\alpha _j)\), \((\beta _j)\) and, in addition, for all \(\lambda \in \varLambda \) the number of nonzero elements in the rows and columns of the approximations \(\mathbf {B}(n)_j\) can be estimated jointly for all \(n\) in the form

*Example 4*

If \(\mathbf {B}(n) \in \mathcal {M}_{c(n),\sigma (n),\beta (n)}\) with \(c(n)\) and \(\sigma (n)^{-1},\beta (n)^{-1}\) uniformly bounded, then from the construction in the proof of [9, Proposition 3.4] it can be seen that the \(\mathbf {B}(n)\) are equi-\(s^*\)-compressible with \(s^* = \min \{\inf _n\sigma (n)-1/2, \inf _n\beta (n)-1\}\) since the same set of nonzero matrix entries can be used for each \(n\).

The key property of \(s^*\)-compressible matrices in the context of adaptive methods is that they are not only bounded in \(\ell _2\) but also on the smaller approximation spaces and, thus, preserve sparsity in a quantifiable manner. We wish to establish such concepts next for the tensor setting.

**Lemma 5**

*Proof*

*Remark 9*

**Theorem 8**

*Proof*

*Remark 10*

*Remark 11*

*Proof*

The sorting of entries of \(\pi ^{(i)}(\mathbf {v})\) required to obtain the index sets of best \(2^j\)-term approximations in Theorem 8 can be replaced by an *approximate sorting* by binary binning, requiring only \(\#{{\mathrm{supp}}}_i(\mathbf {v})\) operations, as suggested in [4, 29]. This only leads to a change in the generic constants in the resulting estimates.

Let \(\mathbf {v}\) have the HOSVD \(\mathbf {v} = \sum _k a_k \mathbb {U}_k\), then, on the one hand, we need to form the core tensor for the result, which takes \(\prod _{i=1}^m R_i r_i\) operations, and evaluate the approximations to \(\tilde{\mathbf {A}}^{(i)}_{n_i} \mathbf {U}^{(i)}_{k_i}\) for \(n_i=1,\ldots , R_i\) and \(k_i=1,\ldots ,r_i\). The number of operations for each of these terms can be estimated as in [9], which leads to (104). \(\square \)

As the first term on the right-hand side of (104) shows, the Tucker format still suffers from the curse of dimensionality due to the complexity of the core tensors.

#### 4.1.2 Hierarchical Tucker Format

*Example 5*

*Remark 12*

Comparing the first summand on the right-hand side of (108) to that in (104), we observe a substantial reduction in complexity regarding the dependence on \(m\) (and, hence, \(d\)).

### 4.2 Low-Rank Approximations of Operators

In many applications of interest, the involved operators do not have an explicit low-rank form, but there exist efficient approximations to these operators in low-rank representation.

Such a case can be handled by replacing a given operator \(\mathbf {A}\) by such an approximation and then applying the construction for operators given in low-rank form as in the previous subsections.

*uniformly*\(s^*\)

*-compressible*.

## 5 An Adaptive Iterative Scheme

### 5.1 Formulation and Basic Convergence Properties

**Proposition 5**

Let the step size \(\omega >0\) in Algorithm 5.1 satisfy \(||\mathrm{I}- \omega \mathbf {A}|| \le \rho < 1\). Then the intermediate steps \(\mathbf {u}_{k}\) of Algorithm 5.1 satisfy \(||\mathbf {u}_k - \mathbf {u}|| \le \theta ^k\delta \), and in particular, the output \(\mathbf {u}_\varepsilon \) of Algorithm 5.1 satisfies \(||\mathbf {u}_\varepsilon - \mathbf {u}|| \le \varepsilon \).

*Proof*

### 5.2 Complexity

Quite in the spirit of adaptive wavelet methods we analyze the performance of the foregoing scheme by comparing it to an *optimality benchmark* addressing the following question: suppose an unknown solution exhibits a certain (unknown) rate of tensor approximability where the involved tensors have a certain (unknown) best \(N\)-term approximability with respect to their wavelet representations. Does the scheme automatically recover these rates? Thus, unlike the situation in wavelet analysis, we are dealing here with *two* types of approximation, and the choice of corresponding rates as a benchmark model should, of course, be representative for relevant application scenarios. For the present complexity analysis we focus on growth sequences of a *subexponential or exponential* type for the involved low-rank approximations, combined with an *algebraic* approximation rate for the corresponding tensor mode frames. The rationale for this choice is as follows. Approximation rates in classical methods are governed by the *regularity* of the approximand, which, unless the approximand is analytic, results in algebraic rates suffering from the curse of dimensionality. However, functions of many variables may very well exhibit a high degree of tensor sparsity without being very regular in the Sobolev or Besov sense. Therefore, fast tensor rates, combined with polynomial rates for the compressibility of the mode frames, mark an ideal target scenario for tensor methods since, as will be shown, the curse of dimensionality can be significantly ameliorated without requiring excessive regularity.

The precise formulation of our benchmark model reads as follows.

**Assumption 1**

- (i)
\(\mathbf{u}\in {{\mathcal {A}}_{\mathcal {H}}({\gamma _\mathbf{u}})}\) with \(\gamma _\mathbf{u}(n) = e^{d_\mathbf{u}n^{1/b_\mathbf{u}}}\) for some \(d_\mathbf{u}>0\), \(b_\mathbf{u}\ge 1\).

- (ii)
\(\mathbf {A}\) satisfies (109) for an \(M_\mathbf {A} > 0\), with \(\gamma _\mathbf {A}(n) = e^{d_\mathbf {A} n^{1/b_\mathbf {A}}}\), where \(d_\mathbf {A} >0\), \(b_\mathbf {A} \ge 1\).

- (iii)
Furthermore, let \(\mathbf {f} \in {{\mathcal {A}}_{\mathcal {H}}({\gamma _\mathbf {f}})}\) with \(\gamma _\mathbf {f}(n) = e^{d_\mathbf {f} \, n^{1/b_\mathbf {f}}}\), where \(d_\mathbf {f} = \min \{ d_\mathbf{u}, d_\mathbf {A} \}\) and \(b_\mathbf {f} = b_\mathbf{u}+ b_\mathbf {A}\).

- (iv)
\(\pi ^{(i)}(\mathbf{u}) \in {{\mathcal {A}}^s}\) for \(i=1,\ldots ,m\), for any \(s\) with \(0 < s <s^*\).

- (v)
The low-rank approximations to \(\mathbf {A}\) are uniformly \(s^*\)-compressible in the sense of Sect. 4.2, with \(C_{\mathbf {A}} := \sup _{\eta >0} C_{\mathbf {A},\tilde{\mathbf {A}}} < \infty \), where \(C_{\mathbf {A},\tilde{\mathbf {A}}}\) is defined as in (110) for each value of \(\eta \).

- (vi)
\(\pi ^{(i)}(\mathbf {f}) \in {{\mathcal {A}}^s}\), for \(i=1,\ldots ,m\), for any \(s\) with \(0 < s <s^*\).

Note that the requirement on \(\mathbf{f}\) in (iii) is actually very mild because the data are typically more tensor sparse than the solution.

The following complexity estimates are formulated only for the more interesting case of the hierarchical Tucker format. Similar statements hold for the Tucker format, involving, however, additional terms that depend exponentially on \(m\), which makes this format suitable only for moderate values of \(m\).

*Remark 13*

*Remark 14*

*Remark 15*

We take \({{\mathrm{\textsc {recompress}}}}\) as a numerical realization of \({\hat{\mathrm{P }}}_{\eta }\) as defined in (47). This amounts to the computation of a HOSVD or \({\mathcal {H}}\)SVD, respectively, for which we have the complexity bounds given in Remarks 1 and 4.

Note that under the assumptions of Proposition 5, the iteration converges for any fixed \(\beta \ge 0\). A call to \({{\mathrm{\textsc {recompress}}}}\) (possibly with \(\beta =0\), i.e., without performing an approximation) is in fact necessary in each inner iteration to ensure the orthogonality properties required by \({{\mathrm{\textsc {apply}}}}\).

The main result of this paper is the following theorem. It says that whenever a solution has the approximation properties specified in Assumptions 1, the adaptive scheme recovers these rates and the required computational work has optimal complexity up to logarithmic factors. We have made an attempt to identify the dependencies of the involved constants on the problem parameters as explicitly as possible.

**Theorem 9**

*Remark 16*

Recalling the form of the growth sequence \(\gamma _\mathbf{u}(n)= e^{d_\mathbf{u}n^{1/b_\mathbf{u}}}\), the rank bound (124) can be reformulated in terms of \(\gamma _\mathbf{u}^{-1}\big (C||\mathbf{u}||_{{{\mathcal {A}}_{\mathcal {H}}({\gamma _\mathbf {\mathbf{u}}})}}/\varepsilon \big )\), which, in view of Remark 6, means that up to a multiplicative constant, the ranks remain minimal. On account of Remark 7, the same holds for the bound (125) on the sparsity of the factors.

*Remark 17*

The maximum number of inner iterations \(J\) that arises in the complexity estimate is defined in line 2 of Algorithm 5.1. This value depends on the freely chosen algorithm parameters \(\beta \) and \(\theta \), on the constants \(\omega \) and \(\rho \) that depend only on \(\mathbf {A}\), and on \(\kappa _1\). Thus, \(J\) depends on \(m\): the choice of \(\kappa _1\) in Theorem 9 leads to \(\kappa _1 \sim m^{-1}\) and, hence, \(J\sim \log m\). Note that since \(|\ln \varepsilon |^{c \ln m} = m^{c \ln |\ln \varepsilon |}\), this leads to an algebraic dependence of the complexity estimate on \(m\). Furthermore, the precise dependence of the constant in (128) on \(m\) is also influenced by the problem parameters from Assumption 1, which may contain additional implicit dependencies on \(m\). In particular, as can be seen from the proof, the constant has a linear dependence on \(C_\mathbf {A}^{J/s}\) if \(C_\mathbf {A}> 1\) (cf. Remark 8).

*Proof*

(Theorem 9) By the choice of \(\kappa _1\), \(\kappa _2\), and \(\kappa _3\), we can apply Lemma 7 to each \(\mathbf{u}_i\) produced in line 11 of Algorithm 5.1, which yields the bounds (124), (125), (126), and (127) for the values \(\varepsilon = \theta ^k \delta \), \(k\in \mathrm{I}\!\mathrm{N}\).

It therefore remains to estimate the computational complexity of each inner loop. Note that \({{\mathrm{\textsc {recompress}}}}\) in line 8 does not deteriorate the approximability of the intermediates \(\mathbf{w}_j\) as a consequence of Lemma 3.

*Remark 18*

The preceding results apply directly to problems posed on separable tensor product Hilbert spaces for which tensor product Riesz bases are available. Note that this is *not* the case for standard Sobolev spaces \(\mathrm{H}^{s}(\varOmega ^d)\) since in this case the norm induced by the scalar product is not a cross norm. However, for tensor product domains \(\varOmega ^d\) these spaces can be represented as *intersections* of \(d\) tensor product spaces with induced norms.

This diagonal rescaling, which in the case of finite-dimensional Galerkin approximations corresponds to a preconditioning of \(A\), leads to additional problems in our context: the sequence \((2^{-s \max _i|\nu _i|})_{\nu \in \nabla ^d}\) (as well as possible equivalent alternatives) has infinite rank on the full index set \(\nabla ^d\). Hence, the application of \(\mathbf {A}\) must involve an approximation by low-rank operators, as discussed in Sect. 4.2. Strategies for handling this issue are discussed in more detail in [2]. The complexity analysis of iterative schemes when \(\mathbf {A}\) involves such a rescaling will be treated in a separate paper.

## 6 Numerical Experiments

We choose our example to illustrate the results of the previous section numerically according to several criteria. To arrive at a valid comparison between different dimensions, we choose a problem on \(\mathrm{L}_{2}((0,1)^d)\) that has similar properties for different values of \(d\). The problem has a discontinuous right-hand side and solution, which means that reasonable convergence rates can be achieved only by adaptive approximation. It is also still sufficiently simple such that all constants used in Algorithm 5.1 can be chosen rigorously according to the requirements of the convergence analysis.

Hence, for our particular choice of \(f\) the action of \(A^{-1}\) is close to the identity. It should be emphasized, however, that this only simplifies the interpretation of the results but does not simplify the problem from a computational point of view since our algorithm does not make use of this particularity. We have also chosen a problem that is completely symmetric with respect to all variables to simplify the tests and the comparison between values of \(d\), but we do not make computational use of this symmetry.

For the additional constants arising in the iteration, we choose \(\theta := \frac{1}{2}\) and \(\beta :=1\). For the hierarchical Tucker format we have \(\kappa _\mathrm{P}= \sqrt{2m-3}\) and \(\kappa _\mathrm{C}= \sqrt{m}\) and fix the derived constants \(\kappa _1,\kappa _2,\kappa _3\) as in Theorem 9 by taking \(\alpha :=1\). Furthermore, we have \(\delta = \lambda _\mathbf {A}^{-1}||\mathbf {f}|| = 2\).

*Remark 19*

Since many steps of the algorithm – including the comparably expensive approximate application of lower-dimensional operators to tensor factors and \(QR\) factorizations of mode frames – can be done independently for each mode, an effective parallelization of our adaptive scheme is quite easy to achieve.

In all following examples, we use piecewise cubic wavelets. The implementation was done in C++ using standard LAPACK routines for linear algebra operations. Iterations are stopped as soon as a required wavelet index cannot be represented as a signed 64-bit integer.

We make some simplifications in counting the number of required operations: for each matrix–matrix product, \(QR\) factorization, and SVD we use the standard estimates for the required number of multiplications (e.g., [20]); for the approximation of \(\mathbf {A}\) and \(\mathbf {f}\) we count one operation per multiplication with a matrix entry and per generated right-hand side entry, respectively (note that we thus make the simplifying assumption that all required wavelet coefficients can be evaluated using \({\mathcal {O}}(1)\) operations, which could in principle be realized in the present example but is not strictly satisfied in our current implementation). We thus neglect some minor contributions that play no asymptotic role, such as the number of operations required for adding two tensor representations, and the sorting of tensor contraction values for \({{\mathrm{\textsc {coarsen}}}}\), which is done here by a standard library call for simplicity.

### 6.1 Results with Right-Hand Side of Rank 1

Note that the iteration is stopped a few steps earlier with increasing dimension because slightly stricter error tolerances are applied in the approximation of operator and right-hand side. This means that the technical limit for the maximum possible wavelet level is reached earlier.

### 6.2 Results with Right-Hand Side of Unbounded Rank

We now use the full right-hand side \(f\) as in (133), which leads to a solution with approximately the same exponential decay of singular values as \(f\).

## 7 Conclusion and Outlook

The presented theory and examples indicate that the schemes developed in this work can be applied to very high-dimensional problems, with a rigorous foundation for the type of elliptic operator equations considered here. The results can be extended to more general operator equations, as long as the variational formulation, in combination with a suitable basis, induces a well-conditioned isomorphism on \(\mathrm{\ell }_{2}\). However, when the operator represents an isomorphism between spaces that are not simple tensor products, such as Sobolev spaces and their duals, additional concepts are required, which will be developed in a subsequent publication.

## Notes

### Acknowledgments

This work was funded in part by the Excellence Initiative of the German Federal and State Governments, DFG Grant GSC 111 (Graduate School AICES), the DFG Special Priority Program 1324, and NSF Grant #1222390.

### References

- 1.Alpert, B.: A class of bases in \(L^2\) for the sparse representation of integral operators. SIAM J. Math. Anal.
**24**(1), 246–262 (1991)Google Scholar - 2.Bachmayr, M.: Adaptive low-rank wavelet methods and applications to two-electron Schrödinger equations. Ph.D. thesis, RWTH Aachen (2012)Google Scholar
- 3.Ballani, J., Grasedyck, L.: A projection method to solve linear systems in tensor format. Numer. Linear Algebra Appl.
**20**(1), 27–43 (2013)Google Scholar - 4.Barinka, A.: Fast evaluation tools for adaptive wavelet schemes. Ph.D. thesis, RWTH Aachen (2005)Google Scholar
- 5.Beylkin, G., Mohlenkamp, M.J.: Numerical operator calculus in higher dimensions. PNAS
**99**(16), 10246–10251 (2002)Google Scholar - 6.Beylkin, G., Mohlenkamp, M.J.: Algorithms for numerical analysis in high dimensions. SIAM J. Sci. Comput.
**26**(6), 2133–2159 (2005)Google Scholar - 7.Cances, E., Ehrlacher, V., Lelievre, T.: Convergence of a greedy algorithm for high-dimensional convex nonlinear problems. Math. Models Methods Appl. Sci.
**21**(12), 2433–2467 (2011)Google Scholar - 8.Cohen, A.: Numerical Analysis of Wavelet Methods, Studies in Mathematics and Its Applications, vol. 32. Elsevier, Amsterdam (2003)Google Scholar
- 9.Cohen, A., Dahmen, W., DeVore, R.: Adaptive wavelet methods for elliptic operator equations: Convergence rates. Math. Comput.
**70**(233), 27–75 (2001)Google Scholar - 10.Cohen, A., Dahmen, W., DeVore, R.: Adaptive wavelet methods II—beyond the elliptic case. Found. Comput. Math.
**2**(3), 203–245 (2002)Google Scholar - 11.Dahmen, W.: Wavelet and multiscale methods for operator equations. Acta Numer.
**6**, 55–228 (1997)Google Scholar - 12.DeVore, R., Petrova, G., Wojtaszczyk, P.: Approximation of functions of few variables in high dimensions. Constr. Approx.
**33**, 125–143 (2011)Google Scholar - 13.Dijkema, T.J., Schwab, C., Stevenson, R.: An adaptive wavelet method for solving high-dimensional elliptic PDEs. Constr. Approx.
**30**(3), 423–455 (2009)Google Scholar - 14.Falcó, A., Hackbusch, W.: On minimal subspaces in tensor representations. Found. Comput. Math.
**12**, 765–803 (2012)Google Scholar - 15.Falcó, A., Hackbusch, W., Nouy, A.: Geometric structures in tensor representations. Preprint 9/2013, Max Planck Institute of Mathematics in the Sciences, Leipzig (2013)Google Scholar
- 16.Falcó, A., Nouy, A.: Proper generalized decomposition for nonlinear convex problems in tensor banach spaces. Numer. Math.
**121**, 503–530 (2012)Google Scholar - 17.Grasedyck, L.: Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl.
**31**(4), 2029–2054 (2010)Google Scholar - 18.Grasedyck, L., Kressner, D., Tobler, C.: A literature survey of low-rank tensor approximation techniques. GAMM-Mitt.
**36**, 53–78 (2013)Google Scholar - 19.Griebel, M., Harbrecht, H.: Approximation of two-variate functions: Singular value decomposition versus regular sparse grids. INS Preprint No. 1109, Universität Bonn (2011)Google Scholar
- 20.Hackbusch, W.: Tensor Spaces and Numerical Tensor Calculus, Springer Series in Computational Mathematics, vol. 42. Springer, Berlin (2012)Google Scholar
- 21.Hackbusch, W., Khoromskij, B., Tyrtyshnikov, E.: Approximate iterations for structured matrices. Numer. Math.
**109**, 119–156 (2008)Google Scholar - 22.Hackbusch, W., Kühn, S.: A new scheme for the tensor representation. J. Fourier Anal. Appl.
**15**(5), 706–722 (2009)Google Scholar - 23.Hitchcock, F.L.: Multiple invariants and generalized rank of a \(p\)-way matrix or tensor. J. Math. Phys.
**7**, 39–79 (1927)Google Scholar - 24.Khoromskij, B.N., Schwab, C.: Tensor-structured Galerkin approximation of parametric and stochastic elliptic PDEs. SIAM J. Sci. Comput.
**33**(1), 364–385 (2011)Google Scholar - 25.Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev.
**51**(3), 455–500 (2009)Google Scholar - 26.Kressner, D., Tobler, C.: Preconditioned low-rank methods for high-dimensional elliptic PDE eigenvalue problems. Comput. Methods Appl. Math.
**11**(3), 363–381 (2011)Google Scholar - 27.Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl.
**21**(4), 1253–1278 (2000)Google Scholar - 28.Matthies, H.G., Zander, E.: Solving stochastic systems with low-rank tensor compression. Linear Algebra Appl.
**436**(10), 3819–3838 (2012)Google Scholar - 29.Metselaar, A.: Handling wavelet expansions in numerical methods. Ph.D. thesis, University of Twente (2002)Google Scholar
- 30.Novak, E., Wozniakowski, H.: Approximation of infinitely differentiable multivariate functions is intractable. J. Complex.
**25**, 398–404 (2009)Google Scholar - 31.Oseledets, I., Tyrtyshnikov, E.: Breaking the curse of dimensionality, or how to use SVD in many dimensions. SIAM J. Sci. Comput.
**31**(5), 3744–3759 (2009)Google Scholar - 32.Oseledets, I., Tyrtyshnikov, E.: Tensor tree decomposition does not need a tree. Tech. Rep., RAS, Moscow 2009–08 (2009)Google Scholar
- 33.Oseledets, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput.
**33**(5), 2295–2317 (2011)Google Scholar - 34.Schneider, R., Uschmajew, A.: Approximation rates for the hierarchical tensor format in periodic Sobolev spaces. J. Complexity
**30**, 56–71 (2014)Google Scholar - 35.de Silva, V., Lim, L.H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl.
**30**(3), 1084–1127 (2008)Google Scholar - 36.Stevenson, R.: On the compressibility of operators in wavelet coordinates. SIAM J. Math. Anal.
**35**(5), 1110–1132 (2004)Google Scholar - 37.Tucker, L.R.: The extension of factor analysis to three-dimensional matrices. Contributions to Mathematical Psychology, pp. 109–127. Holt, Rinehart & Winston, New York (1964)Google Scholar
- 38.Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika
**31**, 279–311 (1966)Google Scholar - 39.Uschmajew, A.: Well-posedness of convex maximization problems on Stiefel manifolds and orthogonal tensor product approximations. Numer. Math.
**115**, 309–331 (2010)Google Scholar - 40.Uschmajew, A.: Regularity of tensor product approximations to square integrable functions. Constr. Approx.
**34**, 371–391 (2011)Google Scholar