1 Introduction

The following chapter is an outline of Riemannian optimization and integration methods on manifolds of low-rank matrices and tensors. This field is relatively new. While the minimization of functions or the time evolution of dynamical systems under smooth manifold constraints is of course classical, and can be treated in a quite general context, there are specific peculiarities to sets of low-rank matrices and tensors that make Riemannian methods particularly amenable to these sets in actual algorithms. There are at least two main reasons for this.

The first is that manifolds of low-rank matrices or tensors are images of multilinear maps. This does not only have the advantage of having at hand an explicit global parametrization of the manifold itself, but also provides a simple representation of tangent vectors and tangent space projections by the product rule. The second reason is the singular value decomposition (SVD), which for matrices has the remarkable property of providing metric projections onto the non-convex sets of bounded rank matrices. As we will see, for certain low-rank tensor manifolds the SVD can be of a similar use.

A classical and powerful set of algorithms for handling low-rank constraints for matrices or tensors is based on eliminating the constraints by using the aforementioned multilinear parametrizations, and then optimize the block parameters separately, typically in the form of alternating optimization. In contrast, Riemannian methods try to take advantage of the actual geometry of the image, which for instance can overcome problems of ill-conditioning of the typically non-unique multilinear parametrizations. One of the earlier works where the tangent space geometry of non-symmetric fixed rank matrices was quite explicitly exploited in numerical algorithms is [59]. It introduced the dynamical low-rank approximation method for calculating low-rank approximations when integrating a matrix that satisfies a set of ordinary differential equations (ODEs), as we will explain in Sect. 9.5.1. In the context of finding rank bounded feasible points for linear matrix inequalities, a similar exploitation of the tangent space for fixed rank symmetric definite matrices already appeared in [84]. For optimization problems with rank constraints, several Riemannian optimization methods were first presented in [79, 98, 113] that each use slightly different geometries of the sets fixed rank matrices. However, all of them show in great detail how the geometry can be exploited in the algorithms, and [98, 113] also include Riemannian Hessians to obtain superlinear convergence. These algorithms fit in the general framework of optimization on manifolds, summarized in the monograph [2], which however does not deal with manifolds of fixed rank matrices. An influential earlier work using geometrical tools close to the subject of this chapter is [45] about the best rank approximation problem for matrices.

The geometric viewpoint on low-rank matrices can be carried over to low-rank tensors as well. Here, some of the main ideas emanated from mathematical physics, specifically spin systems and molecular dynamics which involves low-rank representation of high-dimensional functions [69]. The embedded geometry of tensor train and hierarchical Tucker manifolds has then been worked out in [46, 108] with the goal of providing the tool of Riemannian optimization also to problems of scientific computing and optimization with tensors. Some examples and references for successful application of such methods will be presented in some details later.

1.1 Aims and Outline

Our aim in this chapter is to provide a high-level overview of the main ideas and tools for optimization and time integration on low-rank manifolds. For this we decided to avoid formal definitions, assumptions or arguments that we considered too technical, and tried to develop the concepts in a more descriptive manner. As a result the chapter contains few rigorous theorems, but the provided references should enable the reader to look up most of the technical details. We also stick to a quite concrete ‘matrix language’ as much as possible and avoid abstract tensor product spaces. In this sense, a tensor will be just an array of numbers, and while this is often sufficient when dealing with practical problems, coordinate-free multilinear algebra can of course be essential for understanding the theoretical foundations, but is out of scope here.

There are several topics that will not be touched at all in this chapter. First of all, for tensors we have restricted to manifolds of tensors with fixed tensor train rank, because it can be quite easily presented. The two other tensor formats that allow for geometric methods in a similar spirit are the Tucker format (related to the multilinear rank) and its hierarchical version, the hierarchical Tucker format. Another important ignored topic is about the choice of the rank. While we present methods for optimization and integration on manifolds of fixed rank matrices and tensors, the choice of the rank is quite problem dependent and needs to balance the reachable model error with the numerical complexity. This is often achieved adaptively. Of course, if a problem at hand does not allow for a ‘low-rank’ solution in the first place, the methods presented in this chapter are of limited use, albeit still mathematically interesting. Finding conditions that ensure low-rank solutions to a class of optimization problems or ODEs can be challenging and several questions in this context are still unanswered, especially for tensors.

Finally, the alternating optimization methods mentioned above, like the alternating least squares or DMRG algorithm, will not be further discussed in this chapter. Compared to Riemannian optimization, these classic approaches to low-rank optimization are much better known and have been used in countless applications. For further reading we would like to refer to the several overview articles taking different perspectives on low-rank optimization, see [6, 15,16,17, 37, 61, 100], and the monographs [39, 51, 53].

The chapter is structured as follows. In Sect. 9.2 we provide an elementary outline of the geometry of the set of fixed rank matrices as an embedded submanifold with focus on the geometric concepts that are needed in efficient algorithms. In Sect. 9.3 we introduce the tensor train format and show that its geometry shares many similarities to that of the matrix case. The next two Sects. 9.4 and 9.5, are devoted to optimization problems and the integration of ODEs over low-rank matrices and tensor train tensors. In both cases we will show how the geometry that was just derived plays a crucial role. Finally, in Sect. 9.6 we mention typical applications that can be treated well with low-rank tensor techniques and in particular with geometric methods.

2 The Geometry of Low-Rank Matrices

As motivated in the introduction, many approximation and identification problems involving low-rank matrices or tensors can be formulated as nonlinear, rank constrained optimization problems. To design and understand efficient geometric methods for their solution, it is therefore necessary to understand the geometry of sets of matrices and tensors of bounded rank. The most basic ingredients for such methods are the representation of tangent vectors, the computation of tangent space projections and the availability of retractions. In this section we present these concepts for the well known case of low-rank matrices in quite some detail as it features all the core ideas on an easily understandable level. We will then in the next section consider manifolds of tensors in low rank tensor train format as an exemplary case for tensors, since it is a tensor decomposition with many parallels to the matrix case.

We restrict the considerations to the linear space \(\mathbb {R}^{m \times n}\) of real m × n matrices, although most of the following theory can be developed for complex matrices too. The Euclidean structure of this space is given by the Frobenius inner product of two matrices,

which induces the Frobenius norm \(\| X \|{ }_F = (X,X)_F^{1/2}\).

As is well known, the rank of a matrix \(X \in \mathbb {R}^{m \times n}\) is the smallest number \(r =\operatorname {\mathrm {rank}}(X)\) such that there exist a decomposition

$$\displaystyle \begin{aligned} X = G H^T, \qquad G \in \mathbb{R}^{m \times r}, \ H \in \mathbb{R}^{n \times r}. \end{aligned} $$
(9.1)

Necessarily, it holds \(r \le \min (m,n)\). We call such a rank revealing decomposition of X the (G, H)-format.

Note that the decomposition (9.1) is not unique, since we may replace G with GA and H with HA T, where A is an invertible r × r matrix. This ambiguity can be removed by requiring additional constraints. A special case is the rank revealing QR decomposition X = QR, where \(Q \in \mathbb {R}^{m \times r}\) has pairwise orthonormal columns, and \(R \in \mathbb {R}^{r \times n}\) is an upper triangular matrix with positive diagonal entries. Such a decomposition can be computed by the column pivoted QR algorithm; see [35].

When m or n are very large, but r is small, it is obviously beneficial in computations to store the matrix X in the (G, H)-format (9.1): instead of storing mn entries of the full matrix X, we only need to know the (m + n)r entries of the matrices G and H. When (m + n)r is much smaller than mn, we may rightfully say that X is of low rank. The key idea of low-rank approximation is that in many applications X may not be of exact low rank, but still can be well approximated by a low-rank matrix.

2.1 Singular Value Decomposition and Low-Rank Approximation

The fundamental tool for low-rank approximation is the singular value decomposition (SVD). Let \( \operatorname {\mathrm {rank}}(X) \le r \le \min (m,n)\), then the SVD of X is a decomposition

$$\displaystyle \begin{aligned} X = U \Sigma V^T = \sum_{\ell = 1}^r \sigma_\ell^{} u_\ell^{} v_\ell^T, \end{aligned} $$
(9.2)

where and have orthonormal columns and \(\Sigma \in \mathbb {R}^{r \times r}\) is a diagonal matrix. Its diagonal entries σ 1, …, σ r are called the singular values of X and will always be taken to be nonnegative and ordered: σ 1 ≥⋯ ≥ σ r ≥ 0. Note that if \(k =\operatorname {\mathrm {rank}}(X) < r\), then σ k > 0, while σ k+1 = ⋯ = σ r = 0.

The discovery of the SVD is usually attributed to Beltrami and Jordan around 1873/1874, with important later contributions by Sylvester, Schmidt, and Weyl; see, e.g., [104] for a history. Its existence is not difficult to show when appealing to the spectral theorem for symmetric matrices. It is enough to consider \(r =\operatorname {\mathrm {rank}}(X)\). The positive semidefinite matrix XX T then has r positive eigenvalues and admits an eigenvalue decomposition XX T = U ΛU T with \(\Lambda \in \mathbb {R}^{r \times r}\) being a diagonal matrix with a positive diagonal, and U TU = I r. The matrix UU T is then the orthogonal projector on the column space of X, and hence UU TX = X. Now setting Σ = Λ1∕2 and V = X TU Σ−1 we obtain U ΣV T = UU TX = X, that is, an SVD of X. Note that V  indeed has orthonormal columns, as V TV = Σ−1U TXX TU Σ−1 = Σ−1 Λ Σ−1 = I r.

The following theorem is the reason for the importance of the SVD in modern applications involving low rank approximation of matrices and—as we will explain later—of tensors.

Theorem 9.1

Consider an SVD (9.2) of a matrix X with σ 1 ≥⋯ ≥ σ r ≥ 0. For any k < r, the truncated SVD

$$\displaystyle \begin{aligned} X_k = \sum_{\ell = 1}^k \sigma_\ell^{} u_\ell^{} v_\ell^T \end{aligned}$$

provides a matrix of rank at most k that is closest in Frobenius norm to X. The distance is

$$\displaystyle \begin{aligned} \| X - X_k \|{}_F = \min_{\operatorname{\mathrm{rank}}(Y) \le k} \| X - Y \|{}_F = \bigg( \sum_{\ell = k+1}^r \sigma_\ell^2 \ \bigg)^{1/2}. \end{aligned} $$
(9.3)

If σ k > σ k+1, then X k has rank k and is the unique best approximation of rank at most k.

This famous theorem is due to Schmidt [96] dating 1907 who proved it for compact integral operators. Later in 1936 it was rediscovered by Eckart and Young [25]. In 1937, Mirksy [80] proved a much more general version of this theorem stating that the same truncated SVD provides a best rank-k approximation in any unitarily invariant norm. A norm ∥⋅∥ on \(\mathbb {R}^{m \times n}\) is called unitarily invariant if ∥X∥ = ∥QXP∥ for all orthogonal Q and P. For such a norm it holds that ∥X∥ = ∥ Σ∥, that is, the norm is entirely defined by the vector of singular values.

The SVD of an m × n matrix can be computed from a symmetric eigenvalue problem or, better, using the Golub–Kahan algorithm [34]. The amount of work in double precisionFootnote 1 when m ≥ n is O(14mn 2 + 8n 3); see [35, Chapter 8.6]. For a large matrix X, computing the full SVD is prohibitively expensive if one is only interested in its low-rank approximation X k and if \(k \ll \min (m,n)\). To this end, there exist many so-called matrix-free methods based on Krylov subspaces or randomized linear algebra; see, e.g., [43, 67]. In general, these methods are less predictable than the Golub–Kahan algorithm and are not guaranteed to always give (good approximations of) X k. They can, however, exploit sparsity since they only require matrix vector products with X and X T.

Observe that the existence of a best approximation of any matrix X by another matrix of rank at most k implies that the set

$$\displaystyle \begin{aligned} \mathcal{M}_{\le k} = \{ X \in \mathbb{R}^{m \times n} \colon \operatorname{\mathrm{rank}}(X) \le k \} \end{aligned} $$
(9.4)

is a closed subset of \(\mathbb {R}^{m \times n}\). Therefore any continuous function \(f \colon \mathbb {R}^{m \times n} \to \mathbb {R}\) with bounded sublevel sets attains a minimum on \(\mathcal {M}_{\le k}\). The formula (9.3) for the distance from \(\mathcal {M}_{\le k}\) implies that a matrix admits a good low-rank approximation in Frobenius norm if its singular values decay sufficiently fast. Consequently, low-rank optimization is suitable for such matrix problems, in which the true solution can be expected to have such a property.

2.2 Fixed Rank Manifold

Geometric optimization methods, like the ones we will discuss later, typically operate explicitly on smooth manifolds. The set \(\mathcal {M}_{\le k}\) of matrices of rank at most k is a real algebraic variety, but not smooth in those points X of rank strictly less than k. The good news is that the set \(\mathcal {M}_{\le k-1}\) of these points is of relative Lebesgue measure zero.

The smooth part of the variety \(\mathcal {M}_{\le k}\) is the set

$$\displaystyle \begin{aligned} \mathcal{M}_{k} = \{ X \in \mathbb{R}^{m \times n} \colon \operatorname{\mathrm{rank}}(X) = k \} \end{aligned}$$

of matrices of fixed rank k. It is a folklore result in differential geometry (see, e.g., [66, Example 8.14]) that \(\mathcal {M}_k\) is a C smooth embedded submanifold of \(\mathbb {R}^{m \times n}\) of dimension

$$\displaystyle \begin{aligned} \dim(\mathcal{M}_k) = mn - (m-k)(n-k) = (m + n -k)k. \end{aligned} $$
(9.5)

The easiest way to show this is by explicitly constructing \(\mathcal {M}_k\) as the union of level sets of submersions. The idea is as follows.

We partition the matrices in \({\mathbb {R}}^{m \times n}\) as

and consider the open set \(\mathcal U\) of all matrices, for which block A is invertible. A matrix X in \(\mathcal U\) then has rank k if and only if the Schur complement F(X) = D − CA −1B vanishes, that is, \(\mathcal M_k \cap \mathcal U = F^{-1}(0)\). The function F is a submersion from \(\mathcal U\) to \({\mathbb {R}}^{(m - k) \times (n-k)}\) because it is surjective (consider B = C = 0), and its partial derivative at any point \(X \in \mathcal U\) with respect to D is the identity, hence the derivative F′(X) at X is surjective. By the submersion theorem, the above preimage \(\mathcal M_k \cap \mathcal U\) is therefore an embedded submanifold of the specified dimension (9.5), and it remains to note that the full set \(\mathcal {M}_k\) is the finite union of such manifolds \(\mathcal M_k \cap \mathcal U\) over all possible positions of a k × k invertible submatrix A.

As an alternative to the above proof, \(\mathcal {M}_{k}\) can also be described as a smooth quotient manifold as in [82]; see also [1] for an overview.

Another important remark concerning optimization is that for \(k < \min (m,n)\) both the sets \(\mathcal {M}_k\) and \(\mathcal {M}_{\le k}\) are simply connected. This follows from the rank revealing decomposition (9.1) and the connectivity of non-singular k frames in \({\mathbb {R}}^{n}\).

2.3 Tangent Space

The explicit knowledge of the tangent spaces and the efficient representation of tangent vectors is crucial for the practical implementation of geometric optimization methods on a manifold. For the fixed rank manifold we have several options for representing tangent vectors.

First of all, it follows from the bilinearity of the map (G, H)↦GH T that matrices of the form

(9.6)

are tangent vectors to \(\mathcal {M}_k\) at X = GH T. Like the (G, H)-format, this representation of tangent vectors has the disadvantage of not being unique, and it might be sensitive to numerical errors when G or H are ill conditioned.

On the other hand, the representation (9.6) reveals that the tangent vector ξ lies in the sum of two overlapping linear spaces, namely, the subspaces of all matrices whose column (resp. row) space is contained in the column (resp. row) space of X. Based on this observation we can find another representation for ξ. Let \(U \in {\mathbb {R}}^{m \times k}\) and \(V \in {\mathbb {R}}^{n \times k}\) contain orthonormal bases for the column and row space of \(X \in \mathcal {M}_k\). Then X = USV T for some \(S \in {\mathbb {R}}^{k \times k}\) (a possible choice here is the SVD (9.2) of X, that is, S =  Σ). We choose corresponding orthonormal bases \(U_\perp \in {\mathbb {R}}^{m \times (m-k)}\) and \(V_\perp \in {\mathbb {R}}^{n \times (n-k)}\) for the orthogonal complements. Then the tangent vector ξ is an element of the linear space

(9.7)

Vice versa, it is not too difficult to show that every element in \( T_X \mathcal {M}_k\) can be written in the form (9.6) and hence is a tangent vector. Since the dimension of \(T_X \mathcal {M}_k\) equals that of \(\mathcal {M}_k\), it follows that in fact \(T_X \mathcal {M}_k\) is equal to the tangent space to \(\mathcal {M}_k\) at X.

In (9.7) we have decomposed the tangent space \(T_X \mathcal {M}_k\) into three mutually orthogonal subspaces represented by the three matrices C 11, C 21 and C 12. The orthogonal projection of any matrix \(Z \in {\mathbb {R}}^{m \times n}\) onto \(T_X \mathcal {M}_k\) is hence obtained by projecting on these three spaces separately. This gives

$$\displaystyle \begin{aligned} \mathcal P_X(Z) = P_U Z P_V + (I-P_U) Z P_V + P_U Z (I-P_V), \end{aligned} $$
(9.8)

where P U = UU T and P V = VV T are the orthogonal projections onto the column and row space of X, respectively. Expanding this expression, gives the alternative formula

$$\displaystyle \begin{aligned} \mathcal P_X(Z) = P_U Z + Z P_V - P_U Z P_V. \end{aligned} $$
(9.9)

While the characterization (9.7) of \(T_X \mathcal {M}_k\) is very convenient for theoretical purposes, it is less suitable in calculations when k is small but m or n are very large, since then also one of the matrices U or V will be very large. In that situation, the factored representation proposed in [98, 113] is preferable:

(9.10)

where

(9.11)

This only requires storing the smaller matrices , , and . Observe that the columns of and are orthogonal to the columns of U and V , respectively, which is also called a gauging condition.

To conclude, once U, S and V  are chosen to represent X = USV T, all the factored parametrizations of tangent vectors at X belong to the linear subspaceFootnote 2

The representation of \(T_X \mathcal M_k\) by H (U,S,V ) is bijective. One can therefore directly compute the result of the projection \(\mathcal P_X(Z)\) as a factored parametrization:

(9.12)

Observe that this requires k matrix vector products with Z and Z T, hence sparsity or a low rank of Z can be exploited nicely.

2.4 Retraction

The other main ingredient for efficient geometric methods are retractions. A retraction for a manifold \(\mathcal M\) is a smooth map R on the tangent bundle \(T \mathcal M\), and maps at every X the tangent space \(T_X \mathcal M\) to \(\mathcal M\). The decisive property of a retraction is that this mapping is exact to first order, that is,

$$\displaystyle \begin{aligned} R_X(\xi) = X + \xi + o( \| \xi \| ). \end{aligned} $$
(9.13)

Obviously, such a map will be useful in optimization methods for turning an increment X + ξ on the affine tangent plane to a new point R X(ξ) on the manifold. For Riemannian manifolds it can be shown that retractions always exist. A very natural way from a differential geometry viewpoint is the so called exponential map, which maps along geodesics in direction of the tangent vector. In practice, the exponential map may be very complicated to compute. There are, however, alternative choices. Retractions in our current contextFootnote 3 seem to be first introduced in [99]; see also [2] for more details.

For the embedded submanifold \(\mathcal M_k\) (more precisely, for \(\mathcal {M}_{\le k}\)) we are in the fortunate situation that, by Theorem 9.1, we can compute the metric projection (best approximation) in the ambient space equipped with the Frobenius norm as metric through the truncated SVD. It hence provides an easy-to-use retraction with respect to this metric. Note that in general for a C m smooth embedded submanifold \(\mathcal M\) of an Euclidean space with m ≥ 2 and a point \(X \in \mathcal M\), there exists an open neighborhood of \(0 \in T_X \mathcal M\) on which a metric projection \(\xi \mapsto {\mathcal P}_{\mathcal M}(X+\xi )\) is uniquely defined and satisfies the retraction property

$$\displaystyle \begin{aligned} \| X + \xi - \mathcal P_{\mathcal{M}}(X+\xi) \| = o(\| \xi \|). \end{aligned}$$

In addition, \(\mathcal P_{\mathcal {M}}\) is C m−1 smooth on that neighborhood; see, e.g., [68, Lemma 2.1].

When using truncated SVD as a retraction for \(\mathcal {M}_k\), the crucial question arises whether it can be computed efficiently. This indeed is the case. If X = USV T and \(\xi \in T_X \mathcal M_k\) are represented in the factored form (9.10), we first compute QR decompositions of and ,

It then holds

(9.14)

with the 2k × 2k block matrix

Since the matrices and each have orthonormal columns (as before we assume that both U and V  have orthonormal columns), we can obtain an SVD of the ‘big’ matrix X + ξ from an SVD of the small matrix K, which can be done in O(k 3) time.

3 The Geometry of the Low-Rank Tensor Train Decomposition

In this section we present the tensor train decomposition as a possible generalization of low-rank matrix decomposition to tensors. By tensors we simply mean higher-order analogs of matrices: an n 1 ×⋯ × n d tensor X is an array of this size containing real valued entries X(i 1, …, i d); see Fig. 9.1. Such data structures appear in many applications. Another way to see them is as multivariate functions depending on discrete variables/indices. The tensors of given size form a linear space denoted as \({\mathbb {R}}^{n_1 \times \dots \times n_d}\). The number d of directions is called the order of the tensor. Matrices are hence tensors of order d = 2. As for matrices, it is common to also call the natural Euclidean inner product for tensors,

$$\displaystyle \begin{aligned} \langle X, Y \rangle_F = \sum_{i_1=1}^{n_1} \cdots \sum_{i_d = 1}^{n_d} X(i_1,\dots,i_d) Y(i_1,\dots,i_d), \end{aligned} $$
(9.15)

the Frobenius inner product, and it induces the Frobenius norm.

Fig. 9.1
figure 1

Tensors of order one (vectors), two (matrices), and three

An n ×⋯ × n tensor has n d entries, which can quickly become unmanageable in practice when d is large. This is sometimes called a curse of dimensionality. Besides other important reasons, the use of low-rank tensor formats provides a tool to circumvent this problem and deal with high dimensional data structures in practice. From a geometric viewpoint a low-rank tensor format defines a nonlinear subset in the space \({\mathbb {R}}^{n_1 \times \dots \times n_d}\), like the sets \(\mathcal M_{\le k}\) from (9.4) in the space of matrices, which can be conveniently represented as the image of a multilinear map. Several choices are possible here.

Let us recall the (G, H)-format (9.1) for a matrix. One way to look at it is as a separation of the variables/indices:

$$\displaystyle \begin{aligned} X(i_1,i_2) = \sum_{\ell = 1}^r G(i_1,\ell) H(i_2,\ell). \end{aligned} $$
(9.16)

The rank is the minimal number r needed for such a separation. A straightforward analog for tensors would be a decomposition

$$\displaystyle \begin{aligned} X(i_1,\dots,i_d) = \sum_{\ell =1}^r C_1(i_1,\ell) \cdots C_d(i_d, \ell) \end{aligned}$$

with factor matrices \(C_\mu \in {\mathbb {R}}^{n_\mu \times r}\), μ = 1, …, d. This tensor format is called the canonical polyadic (CP) format. The minimal r required for such a decomposition is called the (canonical) tensor rank of X. As for matrices, if r is small then storing a tensor in the CP format is beneficial compared to storing all n 1n d entries since one only needs to know the d factor matrices C 1, …, C d.

The CP format has numerous useful applications in data science and scientific computing; see [61] for an overview. One major difference to the matrix case, however, is that the set of all tensors with canonical rank bounded by k is typically not closed. Moreover, while the closure of this set is an algebraic variety, its smooth part is in general not equal to the set of tensors of fixed rank k and does not admit an easy explicit description. An exception is the case of rank-one tensors (k = 1): the set of all outer products X = c 1 ∘⋯ ∘ c d, defined by X(i 1, …, i d) = c 1(i 1)⋯c d(i d), of nonzero vectors \(c_\mu \in {\mathbb {R}}^{n_\mu }\), μ = 1, …, d, is an embedded submanifold of dimension (n 1 + ⋯ + n d) − (d − 1). (It is indeed a special case of manifolds of fixed tensor train rank to be introduced below.) Riemannian optimization in the CP format is hence possible by considering the d-fold sum of rank-one tensors as a manifold, as proposed in [13]. We will, however, not consider this format further in this chapter. Instead, we will present another way to separate the indices of a tensor, which leads to the tensor train format and yields smooth manifolds more similar to the matrix case.

3.1 The Tensor Train Decomposition

The tensor train (TT) format of a tensor \(X \in {\mathbb {R}}^{n_1 \times \dots \times n_d}\) can be derived recursively. First, index i 1 is separated from the others, that is,

$$\displaystyle \begin{aligned} X(i_1,i_2,\dots,i_d) = \sum_{\ell_1=1}^{r_1} G_1(i_1,\ell_1) H_1(\ell_1,i_2,\dots,i_d).\end{aligned} $$
(9.17)

Note that this is a usual matrix decomposition of the form (9.16) when treating the multi-index (i 2, …, i d) as a single index. Next, in the tensor H 1 the indices ( 1, i 2) are separated from the rest, again by a matrix decomposition,

$$\displaystyle \begin{aligned} H_1(\ell_1,i_2,\dots,i_d) = \sum_{\ell_2=1}^{r_2} G_2(\ell_1,i_1,\ell_2) H_2(\ell_2,i_3,\dots,i_d),\end{aligned} $$
(9.18)

yielding

$$\displaystyle \begin{aligned} X(i_1,i_2,\dots,i_d) = \sum_{\ell_1=1}^{r_1} \sum_{\ell_2 = 1}^{r_2} G_1(i_1,\ell_1) G_2(\ell_1,i_2,\ell_2) H_2(\ell_2,i_3,\dots,i_d).\end{aligned} $$
(9.19)

Proceeding in this way, one arrives after d steps at a decomposition of the form

$$\displaystyle \begin{aligned} &\ \ X(i_1, \ldots, i_d) = \\& \sum_{\ell_1=1}^{r_1} \cdots \sum_{\ell_{d-1}=1}^{r_{d-1}} G_1(i_1, \ell_1) G_2(\ell_1,i_2,\ell_2) \cdots G_{d-1}(\ell_{d-2},i_{d-1},\ell_{d-1}) G_d(\ell_{d-1},i_d), \end{aligned} $$
(9.20)

with core tensors \(G_\mu \in \mathbb {R}^{r_{\mu -1} \times n_\mu \times r_\mu }\), μ = 1, …, d, and r 0 = r d = 1. (The third dummy mode was added to G 1 and G d to unify notation.) The core tensors G 1 and G d are hence just matrices, while G 2, …, G d−1 are tensors of order three. A decomposition (9.20) is called a tensor train or TT decomposition of X.

The nested summation in formula (9.20) is in fact a long matrix product. If we denote by G μ(i μ) the r μ−1 × r μ matrix slices of G μ, one gets the compact representation

$$\displaystyle \begin{aligned} X(i_1,\dots,i_d) = G_1(i_1) G_2(i_2) \cdots G_{d-1}(i_{d-1}) G_d(i_d) \end{aligned} $$
(9.21)

of the TT format, which explains the alternative name matrix product state (MPS) of this tensor decomposition common in physics. This formula clearly shows the multilinearity of the TT decomposition with respect to the core tensors. Also it is easy to see from (9.21) that a TT decomposition is never unique: we can insert the identity \(A_\mu ^{} A_\mu ^{-1}\) between any two matrix factors to obtain another decomposition. It will turn out below that this group action is essentially the only ambiguity.

In the numerical analysis community, the TT format was developed by Oseledets and Tyrtyshnikov in [86, 87] with related formats proposed in [36, 40]. In earlier work, it appeared in theoretical physics under a variety of different forms and names, but is now accepted as MPS; see [97] for an overview.

The number of parameters in the TT decomposition (9.20) is bounded by dnr 2 where \(n = \max n_\mu \) and \(r = \max r_\mu \). When r ≪ n d−2, this constitutes a great reduction compared to storing the n 1n d entries in X explicitly. Hence the minimal possible choices for the ‘ranks’ r μ appearing in the above construction are of interest. The crucial concept in this context is unfoldings of a tensor into matrices.

We define the μth unfolding of a tensor X as the matrix X μ of size (n 1n μ) × (n μ+1n d) obtained by taking the partial multi-indices (i 1, …, i μ) as row indices, and (i μ+1, …, i d) as column indices.Footnote 4 In other words,

$$\displaystyle \begin{aligned} X^{\langle \mu \rangle}(i_1,\dots,i_\mu;i_{\mu+1},\dots,i_d) = X(i_1,\dots,i_d) \end{aligned}$$

where the semicolon indicates the separation between the row- and column indices. One can then show the following theorem.

Theorem 9.2

In a TT decomposition (9.20) it necessarily holds that

$$\displaystyle \begin{aligned} r_\mu \ge \operatorname{\mathrm{rank}}(X^{\langle \mu \rangle}), \quad \mu = 1,\dots,d-1. \end{aligned} $$
(9.22)

It is furthermore possible to obtain a decomposition such that equality holds.

To get an insight into why the above statement is true, first observe that, by isolating the summation over the index j μ, the TT decomposition (9.20) is in fact equivalent to the simultaneous matrix decompositions

(9.23)

with ‘partial’ TT unfoldings

$$\displaystyle \begin{aligned}{}[G_{\le \mu}(i_1,\dots,i_\mu;\ell_\mu)] = [G_1(i_1) \cdots G_\mu(i_\mu)] \in {\mathbb{R}}^{n_1\cdots n_\mu \times r_\mu} \end{aligned}$$

and

$$\displaystyle \begin{aligned}{}[G_{\ge \mu+1}(i_{\mu+1},\dots,i_d;\ell_\mu)] = [G_{\mu+1}(i_{\mu+1}) \cdots G_d(i_d)]^T \in {\mathbb{R}}^{n_{\mu+1} \cdots n_d \times r_\mu}. \end{aligned}$$

From (9.23) it follows immediately that the rank condition (9.22) is necessary. Equality can be achieved using the constructive procedure leading to (9.20) with minimal matrix ranks in every step. Let us explain this for the first two steps. Clearly, the first step (9.17) is a rank revealing decomposition of X 〈1〉, so the rank of that matrix can be used as r 1. The minimal admissible r 2 in the second step (9.19) is the rank of the second unfolding \(H_1^{\langle 2 \rangle }\) of tensor H 1. Let us show that this rank is not larger than the rank of X 〈1〉, and hence both are equal by (9.22). Indeed, if z = [z(i 3, …, i d)] is a vector of length n 3n d such that X 〈2〉z = 0 and y = H 〈2〉z, then (9.17) yields \( 0 = \sum _{\ell _1=1}^{r_1}G_1(i_1,\ell _1)y(\ell _1,i_2), \) which implies y = 0, since G 1 has rank r 1. This implies \( \operatorname {\mathrm {rank}}(H^{\langle 2 \rangle }) \le\operatorname {\mathrm {rank}}(X^{\langle 2 \rangle })\). One can proceed with a similar argument for the subsequent ranks r 3, …, r d.

Theorem 9.2 justifies the following definition.

Definition 9.3

The vector r = (r 1, …, r d−1) with \(r_\mu =\operatorname {\mathrm {rank}}(X^{\langle \mu \rangle })\), μ = 1, …, d − 1 is called the TT rank of a tensor \(X \in {\mathbb {R}}^{n_1 \times \dots \times n_d}\).

For matrices, the SVD-like decompositions X = USV T with U and V  having orthonormal columns are often particularly useful in algorithms since they provide orthonormal bases for the row and column space. This was for instance important for the projection onto the tangent space \(T_X \mathcal M_k\) at X, see (9.8) and (9.9). It is possible to impose similar orthogonality conditions in the TT decomposition. Recall, that the TT decomposition of a tensor X is obtained by subsequent rank-revealing matrix decompositions for separating the indices i 1, …, i d one from another. This can actually be done from left-to-right, from right-to-left, or from both directions simultaneously and stopping at some middle index i μ. By employing QR (resp. LQ) matrix decompositions in every splitting step, it is not so difficult to show that one can find core tensors U 1, …, U d−1, as well as V 2, …, V d such that for every μ between 1 and d − 1 it holds

$$\displaystyle \begin{aligned} X^{\langle \mu \rangle} = U_{\le \mu}^{} S_\mu^{} V_{\ge \mu+1}^T, \end{aligned} $$
(9.24)

for some \(S_\mu \in {\mathbb {R}}^{r_\mu \times r_\mu }\), and

$$\displaystyle \begin{aligned} U_{\le \mu}^T U_{\le \mu}^{} = V_{\ge \mu+1}^T V_{\ge \mu + 1}^{} = I_{r_{\mu}}. \end{aligned} $$
(9.25)

Note that these orthogonality conditions inductively imply that the unfoldings \(U_\mu ^{\langle 3 \rangle }\) as well as \(V_\mu ^{\langle 1 \rangle }\) of core tensors itself have orthonormal columns. In general, for a given μ, we call a TT decomposition with cores G ν = U ν for ν < μ, G μ(i μ) = U μ(i μ)S μ and G ν = V ν for ν ≥ μ + 1, and satisfying (9.25) a μ-orthogonal TT decomposition of X. It implies (9.24).

One advantage of such a μ-orthogonal TT decomposition is that it provides the orthogonal projections \( U_{\le \mu }^{} U_{\le \mu }^{T} \) and \( V_{\ge \mu +1}^{} V_{\ge \mu + 1}^{T} \) for the column and row space of X μ in the form of partial TT unfoldings that are hence easily applicable to tensors in TT decomposition. From these projections it will be possible to construct the tangent space projectors to TT manifolds in Sect. 9.3.4.

Note that if a TT decomposition with some cores G 1, …, G d is already given, a μ-orthogonal decomposition can be obtained efficiently by manipulating cores in a left-to-right, respectively, right-to-left sweep, where each step consists of elementary matrix operations and QR decompositions and costs O(dnr 4) operations in total. In particular, switching from a μ-orthogonal to a (μ + 1)- or (μ − 1)-orthogonal decomposition, only one such step is necessary costing O(nr 4). Observe that the costs are linear in the order d and mode sizes n μ but fourth-order in the ranks r μ. In practice, this means the limit for r μ is about 102 to 103, depending on the computing power. We refer to [46, 72, 85] for more details on the implementation and properties of the orthogonalization of TT decompositions.

We conclude with the general remark that algorithmically the TT tensor decomposition is characterized by the concept of sweeping, which means that most operations are performed recursively from left-to-right, then right-to-left, and so on. Furthermore, the manipulations on the cores of a TT are based on basic linear algebra. We have already seen that building the decomposition by itself or orthogonalizing a given decomposition can be achieved by a left-to-right sweep involving matrix decompositions only. Next we discuss the important operation of rank truncation that is also achieved in this recursive way.

3.2 TT-SVD and Quasi Optimal Rank Truncation

Instead of QR decompositions, one can also use singular value decompositions for constructing a μ-orthogonal TT representation (9.24). One then obtains

$$\displaystyle \begin{aligned} X^{\langle \mu \rangle} = U_{\le \mu}^{} \Sigma_\mu^{} V_{\ge \mu+1}^T \end{aligned} $$
(9.26)

with \(\Sigma _{\mu } \in {\mathbb {R}}^{r_\mu \times r_\mu }\) being diagonal. In other words, (9.26) is an SVD of X μ.

The advantage of using SVDs for constructing the TT decomposition is that they can be truncated ‘on the fly’, that is, the index splitting decompositions like (9.17) and (9.19) are replaced by truncated SVDs to enforce a certain rank. Specifically, in a left-to-right sweep, at the μth step, let us assume a partial decomposition

$$\displaystyle \begin{aligned} \widetilde X^{\langle \mu - 1 \rangle} = U_{\le \mu - 1} H_{\mu - 1} \end{aligned}$$

with U μ−1 having orthonormal columns has been constructed.Footnote 5 Here we write \(\widetilde X\), since the tensor may not equal X anymore due to previous rank truncations. The next core U μ is then obtained from the left singular vectors of a truncated SVD of \(H_{\mu -1 }^{\langle 2 \rangle }\). This procedure is called the TT-SVD algorithm [86, 88]. Note that since U μ−1 is orthogonal, the truncated SVD of \(H_{\mu -1}^{\langle 2 \rangle }\) is implicitly also a truncated SVD of \(\widetilde X^{\langle \mu \rangle }\).

So if at every step of the TT-SVD algorithm instead of the exact rank r μ a smaller rank k μ is used, the result will be a tensor X k of TT rank (at most) k = (k 1, …, k d−1) in d-orthogonal TT format. It now turns out that this result provides a quasi-optimal approximation of TT rank at most k to the initial tensor X. Thus the TT-SVD algorithm plays a similar role for TT tensors as the SVD truncation for matrices.

To state this result, let us define the sets

$$\displaystyle \begin{aligned} \mathcal{M}_{\le \mathbf{k}} = \{ X \in {\mathbb{R}}^{n_1 \times \dots \times n_d} \colon \operatorname{\mathrm{TT-rank}}(X) \le \mathbf{k} \} \end{aligned}$$

of tensors of TT rank at most k = (k 1, …, k d−1), where the inequality for the rank vector is understood pointwise. By Theorem 9.2, this set is an intersection of low-rank matrix varieties:

$$\displaystyle \begin{aligned} \mathcal{M}_{\le \mathbf{k}} = \bigcap_{\mu = 1}^{d-1} \{ X \in {\mathbb{R}}^{n_1 \times \dots \times n_d} \colon \operatorname{\mathrm{rank}}(X^{\langle \mu \rangle}) \le k_\mu \}. \end{aligned} $$
(9.27)

Since each of the sets in this intersection is closed, the set \(\mathcal {M}_{\le \mathbf {k}}\) is also closed in \({\mathbb {R}}^{n_1 \times \dots \times n_d}\). As a result, every tensor X admits a best approximation by a tensor in the set \(\mathcal {M}_{\le \mathbf {k}}\), which we denote by \(X_{\mathbf {k}}^{\text{best}}\), that is,

$$\displaystyle \begin{aligned} \| X - X_{\mathbf{k}}^{\text{best}} \|{}_F = \min_{\operatorname{\mathrm{TT-rank}}(Y) \le \mathbf{k}} \| X - Y \|{}_F. \end{aligned}$$

The TT-SVD algorithm, on the other hand, can be seen as an alternating projection method for computing an approximation to X in the intersection (9.27).

The following theorem has been obtained in [88].

Theorem 9.4

Let \(X \in {\mathbb {R}}^{n_1 \times \dots \times n_d}\) have TT rank r and k ≤r. Denote by X k the result of the TT-SVD algorithm applied to X with target rank k. Let ε μ be the error in Frobenius norm committed in the μth truncation step. Then the following estimates hold:

$$\displaystyle \begin{aligned} \| X - X_{\mathbf{k}} \|{}_F^2 \le \sum_{\mu=1}^{d-1} \varepsilon_\mu^2, \end{aligned} $$
(9.28)

and

$$\displaystyle \begin{aligned} \varepsilon_\mu^2 \le \sum_{\ell > k_\mu} (\sigma_{\ell}^{\mu})^2 \le \| X - X_{\mathbf{k}}^{\mathrm{best}} \|{}_F^2, \end{aligned} $$
(9.29)

where \(\sigma _{\ell }^{\mu }\) are the singular values of the μth unfolding X μ.

The theorem has two immediate and equally important corollaries. The first of them is that the sequential rank truncation performed by the TT-SVD is, as announced above, a quasi-optimal projection:

$$\displaystyle \begin{aligned} \| X - X_{\mathbf{k}} \|{}_F \le \sqrt{d-1} \| X - X_{\mathbf{k}}^{\text{best}} \|{}_F. \end{aligned} $$
(9.30)

The second corollary is a complete characterization of low-rank approximability in the TT format. Since \(\| X - X_{\mathbf {k}}^{\mathrm {best}} \|{ }_F \le \| X - X_{\mathbf {k}} \|{ }_F\), the above inequalities imply

$$\displaystyle \begin{aligned} \| X - X_{\mathbf{k}}^{\mathrm{best}} \|{}_F^2 \le \sum_{\mu=1}^{d-1} \sum_{\ell_\mu > k_\mu} (\sigma_{\ell_\mu}^{\mu})^2. \end{aligned}$$

A tensor X will therefore admit good approximation by TT tensors of small rank if the singular values of all the unfoldings X 〈1〉, …, X d−1〉 decay sufficiently fast to zero. By (9.29) such a decay is also a necessary condition. Similar to the comment on matrix problems, the low-rank TT format is hence suitable in practice for tensor problems where the solution has such a property. Justifying this a-priori can be, however, a difficult task, especially for very large problems, and will not be discussed.

We now sketch a proof of Theorem 9.4. The main argument is the observation that while the best rank-k truncation of a matrix is a nonlinear operation, it is for every input indeed performing a linear orthogonal projection that can be realized by multiplying from the left an orthogonal projector onto the subspace spanned by the dominant k left singular vectors of the input. Therefore, before the μth truncation step, the current μth unfolding is the result of some μ − 1 previous orthogonal projections

$$\displaystyle \begin{aligned} \widetilde X^{\langle \mu \rangle} = \widetilde P_{\mu - 1} \cdots \widetilde P_1 X^{\langle \mu \rangle}, \end{aligned} $$
(9.31)

which, however, have all been achieved by a matrix multiplication from the left (since only indices i 1, …, i μ−1 have been separated at this point). By comparing to the projected best rank-k μ approximation of X μ, it is then easy to prove that \(\widetilde X^{\langle \mu \rangle }\) has no larger distance (in Frobenius norm) to the set of rank-k μ matrices than X k itself. Hence

$$\displaystyle \begin{aligned} \varepsilon_\mu \le \min_{\operatorname{\mathrm{rank}}(Y^{\langle \mu \rangle}) \le k_\mu} \| X^{\langle \mu \rangle} - Y^{\langle \mu \rangle} \|{}_F \le \min_{\operatorname{\mathrm{TT-rank}}(Y) \le \mathbf{k}} \| X - Y\|{}_F = \| X - X_{\mathbf{k}}^{\text{best}} \|{}_F, \end{aligned}$$

where the second inequality is due to (9.27). Since the squared Frobenius distance of X μ to \(\mathcal {M}_{\le k_\mu }\) equals \(\sum _{\ell > k_\mu } (\sigma _{\ell }^{\mu })^2\), this proves the second statement (9.29) of the theorem.

Showing the first statement (9.28) is more subtle. One writes X k as the result of corresponding d − 1 orthogonal projections in tensor space:

$$\displaystyle \begin{aligned} X_{\mathbf{k}} = \mathcal P_{d-1} \cdots \mathcal P_1 X. \end{aligned}$$

The error can then be decomposed into

$$\displaystyle \begin{aligned} X - X_{\mathbf{k}} = (\mathcal P_{d-2}\cdots \mathcal P_1 X - \mathcal P_{d-1} \cdots \mathcal P_1 X) + (X - \mathcal P_{d-2}\cdots \mathcal P_1 X). \end{aligned}$$

The Frobenius norm of the first term is precisely ε d−1. One now has to show that both terms are orthogonal to proceed by induction. Indeed, an easy way to see that for every μ = 1, …, d − 1 the result \(\mathcal P_{\mu } \cdots \mathcal P_1 X\) after the μth truncation is still in the range of the operator \(\mathcal P_{\mu -1} \cdots \mathcal P_1\) is that the rank truncation of \(\widetilde X^{\langle \mu \rangle }\) as given by (9.31) may equally be achieved by multiplying from the right an orthogonal projector on the dominant k μ right singular values. Then it is clear that multiplying \(\widetilde P_{\mu - 1} \cdots \widetilde P_1\) from the left again will have no effect.

We conclude with two remarks. The first is that the TT-SVD algorithm can be implemented very efficiently if X is already given in a μ-orthogonal TT decomposition as in (9.24), say, with μ = 1, with moderate TT rank. Then in a left-to-right sweep it is sufficient to compute SVDs of single cores, which is computationally feasible if ranks are not too large. This is important in practice when using the TT-SVD algorithm as a retraction as explained below.

The second remark is that the target ranks in the TT-SVD procedure can be chosen adaptively depending on the desired accuracies ε μ. Thanks to Theorem 9.4 this gives full control of the final error. In this scenario the algorithm is sometimes called TT-rounding [86].

3.3 Manifold Structure

It may appear at this point that it is difficult to deal with the TT tensor format (and thus with its geometry) computationally, but this is not the case. Tensors of low TT rank can be handled very well by geometric methods in a remarkably analogous way as to low-rank matrices. To do so, one first needs to reveal the geometric structure.

Similar to matrices, the set \(\mathcal {M}_{\le \mathbf {k}}\) of tensors of TT rank bounded by k = (k 1, …, k d−1) is a closed algebraic variety but not a smooth manifold. Let us assume that the set of tensors of fixed TT rank k, that is, the set

$$\displaystyle \begin{aligned} \mathcal{M}_{\mathbf{k}} = \{ X \in {\mathbb{R}}^{n_1 \times \dots \times n_d} \colon \operatorname{\mathrm{TT-rank}}(X) = \mathbf{k} \}, \end{aligned}$$

is not empty (the conditions for this are given in (9.32) below). Based on Theorem 9.2 it is then easy to show that \(\mathcal {M}_{\mathbf {k}}\) is relatively open and dense in \(\mathcal {M}_{\le \mathbf {k}}\). One may rightfully conjecture that \(\mathcal {M}_{\mathbf {k}}\) is a smooth embedded manifold in \({\mathbb {R}}^{n_1 \times \dots \times n_d}\). Note that while \(\mathcal {M}_{\mathbf {k}}\) is the intersection of smooth manifolds (arising from taking the conditions \( \operatorname {\mathrm {rank}}(X^{\langle \mu \rangle }) = k_\mu \) in (9.27)), this by itself does not prove that \(\mathcal {M}_{\mathbf {k}}\) is a smooth manifold.

Instead, one can look again at the global parametrization (G 1, …, G d)↦X of TT tensors given in (9.20) but with ranks k μ. This is a multilinear map τ from the linear parameter space \(\mathcal W_{\mathbf {k}} = {\mathbb {R}}^{k_0 \times n_1 \times k_1} \times \dots \times {\mathbb {R}}^{k_{d-1} \times n_d \times k_d}\) (with k 0 = k d = 1) to \({\mathbb {R}}^{n_1 \times \dots \times n_d}\) and its image is \(\mathcal M_{\le \mathbf {k}}\). One can now show that the condition \( \operatorname {\mathrm {TT-rank}}(X) = \mathbf {k}\) is equivalent to the conditions \( \operatorname {\mathrm {rank}}(G^{\langle 1 \rangle }_\mu ) = k_{\mu -1}\) and \( \operatorname {\mathrm {rank}}(G^{\langle 2 \rangle }_\mu ) = k_\mu \) on the unfoldings of core tensors, which defines a subset \(\mathcal {W}_{\mathbf {k}}^*\) of parameters. The conditions

$$\displaystyle \begin{aligned} k_{\mu-1} \le n_\mu k_\mu, \quad k_\mu \le n_\mu k_{\mu-1}, \quad \mu = 1,\dots,d, \end{aligned} $$
(9.32)

are necessary and sufficient for the existence of such cores, and hence for \(\mathcal {M}_{\mathbf {k}}\) being non-empty. Given these conditions the set \(\mathcal {W}_{\mathbf {k}}^*\) is open and dense in \(\mathcal {W}_{\mathbf {k}}\) and its image under τ is \(\mathcal {M}_{\mathbf {k}}\). Yet this parametrization is not injective. From the compact matrix product formula (9.21), we have already observed that the substitution

$$\displaystyle \begin{aligned} G_\mu(i_\mu) \rightarrow A_{\mu-1}^{-1} G_\mu(i_1) A_\mu^{}, \end{aligned} $$
(9.33)

where A μ are invertible r μ × r μ matrices, does not change the resulting tensor X. One can show that this is the only non-uniqueness in case that X has exact TT rank k, basically by referring to the equivalence with the simultaneous matrix decompositions (9.23). After removing this ambiguity by suitable gauging conditions, one obtains a locally unique parametrization of \(\mathcal {M}_{\mathbf {k}}\) and a local manifold structure [46].

An alternative approach, that provides a global embedding of \(\mathcal {M}_{\mathbf {k}}\), is to define an equivalence relation of equivalent TT decompositions of a tensor \(X \in \mathcal {M}_{\mathbf {k}}\). The equivalence classes match the orbits of the Lie group \(\mathcal {G}_{\mathbf {k}}\) of tuples (A 1, …, A d−1) of invertible matrices acting on \(\mathcal {W}_{\mathbf {k}}^*\) through (9.33). One can then apply a common procedure in differential geometry and first establish that the quotient space \(\mathcal {W}_{\mathbf {k}}^* / \mathcal G_{\mathbf {k}}\) possesses a smooth manifold structure such that the quotient map \(\mathcal {W}_{\mathbf {k}}^* \to \mathcal {W}_{\mathbf {k}}^* / \mathcal {G}_{\mathbf {k}}\) is a submersion. As a second step, one shows that the parametrization \(\mathcal {W}_{\mathbf {k}}^* / \mathcal {G}_{\mathbf {k}} \to \mathcal {M}_{\mathbf {k}}\) by the quotient manifold is an injective immersion and a homeomorphism in the topology of the ambient space \({\mathbb {R}}^{n_1 \times \dots \times n_d}\). It then follows from standard results (see, e.g., [66, Prop. 8.3]), that \(\mathcal {M}_{\mathbf {k}}\) is an embedded submanifold of \({\mathbb {R}}^{n_1 \times \dots \times n_d}\) and its dimension is

$$\displaystyle \begin{aligned} \dim(\mathcal M_{\mathbf{k}}) = \dim(\mathcal W_{\mathbf{k}}^*) - \dim (\mathcal G_{\mathbf{k}}) = 1 + \sum_{\mu = 1}^d r_{\mu-1}n_\mu r_\mu - r_\mu^2. \end{aligned} $$
(9.34)

The details of this construction can be found in [108].

3.4 Tangent Space and Retraction

In view of the practical geometric methods on the manifold \(\mathcal M_{\mathbf {k}}\) to be described later, we now consider the efficient representation of tangent vectors and the computation of retractions. These are quite analogous to the matrix case. First of all, using, e.g., the compact notation (9.21) for the multilinear and surjective parametrization (G 1, …, G d)↦X of the TT manifold \(\mathcal W_{\mathbf {k}}^*\), it is clear that the tangent space \(T_X \mathcal M_{\mathbf {k}}\) at a point \(X \in \mathcal {M}_{\mathbf {k}}\) with TT-cores \((G_1,\dots ,G_d) \in \mathcal W_{\mathbf {k}}^*\) (see Sect. 9.3.3) consists of all tensors ξ of the form

(9.35)

where the cores at position μ can be chosen freely. In view of (9.34), this representation has too many degrees of freedom, even when fixing the TT decomposition G 1, …, G d of X, but this redundancy can be removed by gauging conditions.

A very reasonable way to do this is the following [56, 103]. We assume that the cores U 1, …, U d−1 and V 2, …, V d for the orthogonal decompositions (9.24)–(9.25) are available. Then, since the in (9.35) are entirely free, we do not loose generality by orthogonalizing every term of the sum around :

(9.36)

We now can add the gauging conditions

(9.37)

which remove \(r_\mu ^2\) degrees of freedom from each of the cores . The last core is not constrained.

What this representation of tangent vectors achieves is that all d terms in (9.36) now reside in mutually orthogonal subspaces T 1, …, T d. In other words, the tangent space \(T_X \mathcal {M}_{\mathbf {k}}\) is orthogonally decomposed:

$$\displaystyle \begin{aligned} T_X \mathcal{M}_{\mathbf{k}} = T_1 \oplus \dots \oplus T_d. \end{aligned}$$

This allows to write the orthogonal projection onto \(T_X \mathcal M_{\mathbf {k}}\) as a sum of orthogonal projections onto the spaces T 1, …, T d. To derive these projections, consider first the operators that realize the orthogonal projection onto the row and column space of the unfoldings X μ. They read

$$\displaystyle \begin{aligned} \mathcal P_{\leq \mu}(Z) = \operatorname{\mathrm{Ten}}_\mu ( U_{\leq \mu}^{} U_{\leq \mu}^T Z^{\langle \mu \rangle}) \quad \text{and} \quad \mathcal P_{\geq \mu+1}(Z) = \operatorname{\mathrm{Ten}}_\mu ( Z^{\langle \mu \rangle} V_{\geq \mu+1}^{} V_{\geq \mu+1}^T), \end{aligned} $$
(9.38)

where \( \operatorname {\mathrm {Ten}}_\mu \) denotes the inverse operation of the μth unfolding so that \(\mathcal P_{\leq \mu }\) and \(\mathcal P_{\geq \mu +1}\) are in fact orthogonal projectors in the space \({\mathbb {R}}^{n_1 \times \dots \times n_d}\). Note that \(\mathcal P_{\leq \mu }\) and \(\mathcal P_{\geq \nu }\) commute when μ < ν. Furthermore, \(\mathcal P_{\leq \mu } \mathcal P_{\le \nu } = \mathcal P_{\le \nu }\) and \(\mathcal P_{\geq \nu } \mathcal P_{\ge \mu } = \mathcal P_{\ge \mu }\) if μ < ν.

By inspecting the different terms in (9.36) and taking the gauging (9.37) into account, it is not so difficult to verify that the projection on T 1 is given by

$$\displaystyle \begin{aligned} Z \mapsto (\mathcal I - \mathcal P_{\le 1})\mathcal P_{\ge 2}Z, \end{aligned}$$

the projection on T 2 is given by

$$\displaystyle \begin{aligned} Z \mapsto \mathcal P_{\le 1}(\mathcal I - \mathcal P_{\le 2})\mathcal P_{\ge 3}Z = (\mathcal P_{\le 1} - \mathcal P_{\le 2})\mathcal P_{\ge 3}Z \end{aligned}$$

and so forth. Setting \(\mathcal P_{\le 0} = \mathcal P_{\ge d+1} = \mathcal I\) (identity) for convenience, the overall projector \(\mathcal P_X\) onto the tangent space \(T_X \mathcal {M}_{\mathbf {k}}\) is thus given in one of the two following forms [72]:

$$\displaystyle \begin{aligned} \begin{aligned} \mathcal P_X &= \sum_{\mu=1}^{d-1} (\mathcal P_{\leq \mu-1} - \mathcal P_{\leq \mu}) \mathcal P_{\geq \mu+1} + \mathcal P_{\leq d-1} \\ &= \mathcal P_{\ge 2} + \sum_{\mu = 2}^{d} \mathcal P_{\le \mu-1} (\mathcal P_{\ge \mu+1} - \mathcal P_\mu). \end{aligned} \end{aligned} $$
(9.39)

The formulas (9.39) for the projector on the tangent space are conceptually insightful but still extrinsic. An efficient implementation of this projection for actually getting the gauged components that represent the resulting tangent vector is possible if Z is itself a TT tensor of small ranks or a very sparse tensor. For example, due to the partial TT structure of projectors (9.38), when computing \(\mathcal P_{\le \mu +1} Z\), the partial result from \(\mathcal P_{\le \mu }Z\) can be reused and so on. The full details are cumbersome to explain so we do not present them here and refer to [73, §7] and [103, §4].

It is also interesting to note that the tangent space \(T_X \mathcal M_{\mathbf {k}}\) itself contains only tensors of TT rank at most 2k. This is due to the structure (9.35) of tangent vectors as sums of TT decompositions that vary in a single core each [50]. Since X itself is in \(T_X \mathcal M_{\mathbf {k}}\), we directly write the TT decomposition of X + ξ, since this will be the tensors that need to be retracted in optimization methods. In terms of the left- and right-orthogonal cores U 1, …, U d−1 and V 2, …, V d from (9.25) we have [103]

$$\displaystyle \begin{aligned} (X + \xi)(i_1,\dots,i_d) = W_1(i_1) W_2(i_2) \cdots W_{d-1}(i_{d-1}) W_d(i_d), \end{aligned} $$
(9.40)

with the cores

for μ = 2, …, d − 1, and

where S d is the matrix from the d-orthogonal decomposition (9.24) of X. The formula (9.40) is the TT analog to (9.14).

Finally we mention that since \(\mathcal {M}_{\mathbf {k}}\) is a smooth manifold, the best approximation of X + ξ would be in principle a feasible retraction from the tangent space to the manifold. It is, however, computationally not available. The TT-SVD algorithm applied to X + ξ with target ranks k is a valid surrogate, which due to the TT representation (9.40) of tangent vectors is efficiently applicable. As discussed in Sect. 9.3.2 the TT-SVD procedure is essentially a composition of nonlinear projections on low-rank matrix manifolds, which are locally smooth around a given \(X \in \mathcal M_{\mathbf {k}}\). This provides the necessary smoothness properties of the TT-SVD algorithm when viewed as a projection on \(\mathcal {M}_{\mathbf {k}}\). On the other hand, the quasi-optimality of this projection as established in (9.30) implies the retraction property (9.13); see [103] for the details.

3.5 Elementary Operations and TT Matrix Format

Provided that the ranks are small enough, the TT representation introduced above allows to store very high-dimensional tensors in practice and to access each entry individually by computing the matrix product (9.21). Furthermore, it is possible to efficiently perform certain linear algebra operations. For instance the sum of two TT tensors X and \(\hat X\) with TT cores G 1, …, G d and \(\hat G_1, \dots , \hat G_d\) has the matrix product representation

Hence the core tensors are simply augmented, and no addition at all is required when implementing this operation. Note that this shows that the TT rank of \(X + \hat X\) is bounded by the (entry-wise) sum of TT ranks of X and \(\hat X\).

As another example, the Frobenius inner product of X and \(\hat X\) can be implemented by performing the nested summation in

$$\displaystyle \begin{aligned} \langle X, \hat X \rangle_F = \sum_{i_1=1}^{n_1} \cdots \sum_{i_d=1}^{n_d} G_1(i_1) \cdots G_d(i_d) \hat G_d(i_d)^T \cdots \hat G_1(i_1)^T \end{aligned}$$

sequentially: first, the matrix

$$\displaystyle \begin{aligned} Z_d = \sum_{i_d=1}^{n_d} G_d(i_d) \hat G_d(i_d)^T \end{aligned}$$

is computed, then

$$\displaystyle \begin{aligned} Z_{d-1} = \sum_{i_{d-1} = 1}^{n_{d-1}} G_{d-1}(i_{d-1}) Z \hat G_{d-1}(i_{d-1})^T \end{aligned}$$

and so on. These computations only involve matrix products and the final result Z 1 will be the desired inner product. The computational complexity for computing inner products is hence O(dnr 3) with \(n = \max n_\mu \) and \(r = \max \{ r_\mu , \hat r_\mu \}\), where r and \(\mathbf {\hat r}\) are the TT-ranks of X and \(\hat X\), respectively. As a special case, the Frobenius norm of a TT tensor can be computed.

Obviously, these elementary operations are crucial for applying methods from numerical linear algebra and optimization. However, in many applications the most important operation is the computation of the ‘matrix-vector-product’, that is, in our case the action of a given linear operator \(\mathcal {A}\) on a tensor X. In order to use low-rank techniques like Riemannian optimization it is mandatory that the given operator \(\mathcal {A}\) can be applied efficiently. In some applications, sparsity of \(\mathcal A\) makes this possible. More naturally, most low-rank formats for tensors come with a corresponding low-rank format for linear operators acting on such tensors that enable their efficient application. For the TT format, the corresponding operator format is called the TT matrix format [86] or matrix product operator (MPO) format [115].

A linear map \(\mathcal {A} \colon {\mathbb {R}}^{n_1 \times \dots \times n_d} \to {\mathbb {R}}^{n_1 \times \dots \times n_d}\) can be identified with an (n 1n d) × (n 1n d) matrix with entries \([\mathcal A (i_1,\dots ,i_d ; j_1,\dots ,j_d)]\), where both the rows and columns are indexed with multi-indices. The operator \(\mathcal {A}\) is then said to be in the TT matrix format with TT matrix ranks (R 1, …, R d−1) if its entries can be written as

$$\displaystyle \begin{aligned} \mathcal{A}(i_1,\dots,i_d;j_1,\dots,j_d) = O_1(i_1,j_1) O_2(i_2,j_2) \cdots O_d(i_d,j_d), \end{aligned}$$

where O μ(i μ, j μ) are matrices of size R μ−1 × R μ (R 0 = R d = 1). Clearly, the TT matrix format becomes the usual TT format when treating \(\mathcal A\) as an \(n_1^2 \times \dots \times n_d^2\) tensor.

Note that if \(\mathcal A\) is an operator on matrices, that is, in the case d = 2, O 1(i μ, j μ) and O 2(i μ, j μ) are just vectors of length R 1 = R, and the formula can be written as

$$\displaystyle \begin{aligned} \mathcal{A}(i_1,i_2;j_1,j_2) = \sum_{\ell=1}^R O_{1,\ell}(i_1,j_1) O_{2,\ell}(i_2,j_2). \end{aligned}$$

In other words, such an operator \(\mathcal A\) is a sum

$$\displaystyle \begin{aligned} \mathcal A = \sum_{\ell=1}^R A_\ell \otimes B_\ell \end{aligned}$$

of Kronecker products of matrices [A (i, j)] = [O 1,(i, j)] and [B (i, j)] = [O 2,(i, j)].

An operator in the TT matrix format can be efficiently applied to a TT tensor, yielding a result in the TT format again. Indeed, let \(Y = \mathcal A(X)\), then a TT decomposition of Y  can be found using the properties of the Kronecker product ⊗ of matrices [86]:

$$\displaystyle \begin{aligned} Y(i_1,\dots,i_d) &= \sum_{j_1=1}^{n_1} \cdots \sum_{j_d = 1}^{n_d} \mathcal{A}(i_1,\dots,i_d; j_1,\dots,j_d) X(j_1,\dots,j_d) \\ &= \sum_{j_1=1}^{n_1} \cdots \sum_{j_d = 1}^{n_d} \big(O_1(i_1,j_1) \cdots O_d(i_d,j_j) \big) \otimes \big(G_1(j_1) \cdots G_d(j_d) \big)\\ &= \sum_{j_1=1}^{n_1} \cdots \sum_{j_d = 1}^{n_d} \big( O_1(i_1,j_1) \otimes G_1(j_1) \big) \cdots \big( O_d(i_d,j_d) \otimes G_d(j_d) \big) \\ &= \hat G_1(i_1) \cdots \hat G_d(i_d) \end{aligned} $$

with resulting TT cores

$$\displaystyle \begin{aligned} \hat G_\mu(i_\mu) = \sum_{j_\mu=1}^{n_\mu} O_\mu(i_\mu,j_\mu) \otimes G_\mu(j_\mu), \quad \mu = 1,\dots,d. \end{aligned}$$

Forming all these cores has a complexity of O(dnr 2R 2), where \(R = \max R_\mu \).

Note that G μ(i μ) is a matrix of size r μ−1R μ−1 × r μR μ so the TT ranks of \(\mathcal A\) and X are multiplied when applying \(\mathcal A\) to X. In algorithms where this operation is performed several times it therefore can become necessary to apply the TT-SVD procedure to the result as a post-processing step for reducing ranks again. This is akin to rounding in floating point arithmetic and is therefore also called TT-rounding.

4 Optimization Problems

As we have explained above, the sets of matrices of fixed rank k and tensors of fixed TT rank k are smooth submanifolds \(\mathcal {M}_k \subset \mathbb {R}^{m \times n}\) and \(\mathcal {M}_{\mathbf {k}} \subset \mathbb {R}^{n_1 \times \cdots \times n_d}\), respectively. In this section we will see how to efficiently exploit these smooth structures in optimization problems.

Here and in the following \(\mathbb {V}\) denotes a finite dimensional real vector space, that depending on the context, can be just \({\mathbb {R}}^{N}\), a space \({\mathbb {R}}^{m \times n}\) of matrices, or a space \({\mathbb {R}}^{n_1 \times \dots \times n_d}\) of tensors.

4.1 Riemannian Optimization

We start with a relatively general introduction to local optimization methods on smooth manifolds; see [2] for a broader but still self-contained treatment of this topic.

Let \(\mathcal {M}\) be a smooth submanifold in \(\mathbb {V}\), like \(\mathcal {M}_k\) or \(\mathcal {M}_{\mathbf {k}}\). Since \(\mathcal {M} \subset \mathbb {V}\), we can represent a point X on \(\mathcal {M}\) as an element of \(\mathbb {V}\). We can do the same for its tangent vectors \(\xi \in T_X \mathcal {M}\) since \(T_X \mathcal {M} \subset T_X \mathbb {V} \simeq \mathbb {V}\). This allows us to restrict any smoothly varying inner product on \(\mathbb {V}\) to \(T_X\mathcal {M}\) and obtain a Riemannian metric (⋅, ⋅)X on \(\mathcal {M}\). For simplicity, we choose the Euclidean metric:

$$\displaystyle \begin{aligned} (\xi, \eta)_X = \xi^T\eta, \qquad \xi,\eta \in T_X \mathcal{M} \subset \mathbb{V}. \end{aligned}$$

Consider now a smooth objective function \(f\colon \mathbb {V} \to \mathbb {R}\). If we restrict its domain to \(\mathcal {M}\), we obtain an optimization problem on a Riemannian manifold:

$$\displaystyle \begin{aligned} \min f(X) \quad \text{s.t.} \quad X \in \mathcal{M}. \end{aligned} $$
(9.41)

The aim of a Riemannian optimization method is to generate iterates X 1, X 2, … that remain on \(\mathcal {M}\) and converge to a (local) minimum of f constrained to \(\mathcal {M}\); see Fig. 9.2. It uses only local knowledge of f, like first and second-order derivatives. It thus belongs to the family of feasible methods for constrained optimization, which is a very useful property in our setting since general tensors or matrices in \(\mathbb {V}\) with arbitrary rank might otherwise be too large to store. A distinctive difference with other methods for constrained optimization is that a Riemannian optimization method has a detailed geometric picture of the constraint set \(\mathcal {M}\) at its disposal.

Fig. 9.2
figure 2

A Riemannian optimization method generates iterates X starting from X 1 to minimize f on a manifold \(\mathcal {M}\). The thin gray lines are level sets of f and X is a (local) minimum of f on \(\mathcal {M}\)

In its most basic form, a Riemannian optimization method is the update formula

$$\displaystyle \begin{aligned} X_{+} = R_{X}(t \, \xi), \end{aligned} $$
(9.42)

that is then repeated after replacing X by X +. The formula (9.42) is defined by the following ‘ingredients’; see also the left panel of Fig. 9.3.

Fig. 9.3
figure 3

One step of a typical Riemannian optimization method with step direction ξ on the submanifold (left). Example of one step of steepest descent on the pullback (right)

  1. 1.

    The search direction \(\xi \in T_{X} \mathcal {M}\) that indicates the direction of the update. Similar as in Euclidean unconstrained optimization, the search direction can be obtained from first-order (gradient) or second-order (Hessian) information.Footnote 6 Generally, f will locally decrease in the direction of ξ, that is, the directional derivative satisfies f′(X) ξ < 0.

  2. 2.

    As explained in Sect. 9.2.4, the retraction \(R_{X} \colon T_X \mathcal {M} \to \mathcal {M}\) is a smooth map that replaces the usual update X + t ξ from Euclidean space to the manifold setting. Running over t, we thus replace a straight ray with a curve that (locally) lies on \(\mathcal {M}\) by construction. By the retraction property (9.13), the curve is rigid at t = 0, which means that R X(0) = X and \(\tfrac {d}{dt}R(t\xi )|{ }_{t=0} = \xi \) for all \(\xi \in T_X \mathcal {M}\).

  3. 3.

    The step size t > 0 is usually chosen to guarantee sufficient decrease of f in X +, although non-monotone strategies also exist. Given ξ, the step size is typically found by line search strategies like backtracking, whereas an exact line search would provide a global minimum along direction ξ, if it exists. As an alternative one can use the trust-region mechanism to generate t ξ.

To explain the possible search directions ξ at a point \(X \in \mathcal {M}\), we take a slight detour and consider the pullback of f at X:

$$\displaystyle \begin{aligned} \widehat f_X = f \circ R_{X} \colon T_{X} \mathcal{M} \to \mathcal{M}. \end{aligned}$$

Since \(\widehat f_X\) is defined on the linear subspace \(T_{X} \mathcal {M}\), we can for example minimize it by the standard steepest descent method; see the right panel of Fig. 9.3. Observe that rigidity of R X implies \(\widehat f_X(0) = f(X)\). Hence, the starting guess is the zero tangent vector, which will get updated as

$$\displaystyle \begin{aligned} \xi_+ = 0 - \beta^\ell \beta_0 \, \nabla \widehat f_X(0) \end{aligned}$$

and Armijo backtracking determines the smallest  = 0, 1, … such that

$$\displaystyle \begin{aligned} \widehat f_X(\xi_+) \leq \widehat f_X(0) - c \, \beta^\ell \beta_0 \| \nabla \widehat f_X(0) \|{}_F^2. \end{aligned} $$
(9.43)

Here, β = 1∕2, β 0 = 1, and c = 0.99 are standard choices. We could keep on iterating, but the crucial point is that in Riemannian optimization, we perform such a step only once, and then redefine the pullback function for X + = R X(ξ +) before repeating the procedure.

Formally, the iteration just described is clearly of the form as (9.42), but it is much more fruitful to regard this procedure from a geometric point of view. To this end, observe that rigidity of R X also implies

$$\displaystyle \begin{aligned} \widehat f_X^{\prime}(0)\,\xi = f'(X)\, R_X^{\prime}(0)\,\xi = f'(X)\,\xi \qquad \text{for all }\xi \in T_X\mathcal{M}. \end{aligned}$$

With \(\mathcal P_{X}\colon \mathbb {V} \to T_X\mathcal {M}\) the orthogonal projection, we thus obtain

$$\displaystyle \begin{aligned} (\nabla \widehat f_X(0), \xi)_F = (\nabla f(X), \mathcal P_{X} (\xi))_F = (\mathcal P_{X} (\nabla f(X)), \xi)_F. \end{aligned} $$
(9.44)

These identities allow us to define the Riemannian gradient of f at X to \(\mathcal {M}\) simply as the tangent vector \(\mathcal P_{X} (\nabla f(X))\). This vector is conveniently also a direction of steepest ascent among all tangent vectors at X with the same length. We can thus define the Riemannian steepest descent method as

$$\displaystyle \begin{aligned} X_+ = R_X(-t \, \mathcal P_{X}(\nabla f(X))), \quad \text{with }t = \beta^\ell \beta_0. \end{aligned} $$
(9.45)

Here, Armijo backtracking picks again the smallest (since 0 < β < 1) such that

$$\displaystyle \begin{aligned} f(R_{X}(X_+) ) \leq f(X) - c \, \beta^\ell \beta_0 \| \mathcal P_{X} \nabla f(X) \|{}_F^2. \end{aligned}$$

Observe that we have arrived at the same iteration as above but instead of using a pullback we derived it directly from geometric concepts, where we have benefited from choosing the Euclidean metric on \(T_X \mathcal {M}\) for obtaining the simple formula (9.44) for the Riemannian gradient. Using the notion of second-order retractions, one can in this way derive the Riemannian Newton method either using the Riemannian Hessian with pullbacks or directly with the Riemannian connection. We refer to [2] for details, where also trust-region strategies are discussed.

The ‘recipe’ above leaves a lot of freedom, which can be used to our advantage to choose computational efficient components that work well in practice. Below we will focus on approaches that are ‘geometric versions’ of classical, non-Riemannian algorithms, yet can be implemented efficiently on a manifold so that they become competitive.

4.2 Linear Systems

We now explain how Riemannian optimization can be used to solve very large linear systems. Given a linear operator \(\mathcal L \colon \mathbb {V} \to \mathbb {V}\) and a ‘right-hand side’ \(B \in \mathbb {V}\), the aim is to calculate any X ex that satisfies the equation

$$\displaystyle \begin{aligned} \mathcal L(X_{\text{ex}}) = B. \end{aligned}$$

Since our strategy is optimization, observe that X ex can also be found as a global minimizer of the residual objective function

$$\displaystyle \begin{aligned} f_{LS}(X) = (\mathcal L(X)-B, \mathcal L(X)-B)_F = \| \mathcal L(X) - B \|{}_F^2. \end{aligned}$$

If in addition \(\mathcal L\) is symmetric and positive semi-definite on \(\mathbb {V}\), the same is true for the energy norm function

$$\displaystyle \begin{aligned} f_{\mathcal L}(X) &= (X,\mathcal L(X))_F - 2 (X,B)_F \\ &= (X-X_{\text{ex}}, \mathcal{L}(X-X_{\text{ex}}))_F - (X_{\text{ex}},\mathcal L(X_{\text{ex}}))_F. \end{aligned} $$

The second identity shows that \(f_{\mathcal L}(X)\) is indeed, up to a constant, the square of the error in the induced \(\mathcal L\)-(semi)norm. In the following, we will assume that \(\mathcal L\) is positive semi-definite and focus only on \(f=f_{\mathcal L}\) since it leads to better conditioned problems compared to f LS.

When X ex is a large matrix or tensor, we want to approximate it by a low-rank matrix or tensor. Since we do not know X ex we cannot use the quasi-best truncation procedures as explained in Sect. 9.3. Instead, we minimize the restriction of \(f=f_{\mathcal L}\) onto an approximation manifold \(\mathcal {M}=\mathcal {M}_k\) or \(\mathcal {M}=\mathcal {M}_{\mathbf {k}}\):

$$\displaystyle \begin{aligned} \min f(X) \quad \text{s.t.} \quad X \in \mathcal{M}. \end{aligned}$$

This is exactly a problem of the form (9.41) and we can, for example, attempt to solve it with the Riemannian steepest descent algorithm. With \(X \in \mathcal {M}_{\mathbf {k}}\) and the definition of \(f_{\mathcal L}\), this iteration reads

$$\displaystyle \begin{aligned} X_{+} = R_{X}(-t \, \mathcal P_{X}(\mathcal L(X) - B)). \end{aligned} $$
(9.46)

When dealing with ill-conditioned problems, as they occur frequently with discretized PDEs, it is advisable to include some preconditioning. In the Riemannian context, one way of doing this is by modifying (9.46) to

$$\displaystyle \begin{aligned} X_{+} = R_{X}(-t \, \mathcal P_{X}(\mathcal{Q}(\mathcal L(X) - B))), \end{aligned} $$
(9.47)

where \(\mathcal {Q} \colon \mathbb {V} \to \mathbb {V}\) is a suitable preconditioner for \(\mathcal L\). This iteration is called truncated Riemannian preconditioned Richardson iteration in [65] since it resembles a classical Richardson iteration.

4.3 Computational Cost

Let us comment which parts of (9.46) are typically the most expensive. Since the retraction operates on a tangent vector, it is cheap both for matrices and tensors in TT format as long as their ranks are moderate; see Sect. 9.3. The remaining potentially expensive steps are therefore the application of the projector \(\mathcal {P}_{X}\) and the computation of the step size t.

Let \(Z = \mathcal L(X) - B\) be the residual. Recall that the projected tangent vector \(\xi = \mathcal P_{X}(Z)\) will be computed using (9.12) for matrices and (9.36)–(9.37) for TT tensors. As briefly mentioned before, these formulas are essentially many (unfolded) matrix multiplications that can efficiently be computed if Z is a sparse or low rank matrix/tensor.

Sparsity occurs for example in the matrix and tensor completion problems (see Sect. 9.6 later) where \(\mathcal L\) is the orthogonal projector \(\mathcal P_\Omega \) onto a sampling set Ω ⊂{1, …, n 1}×⋯ ×{1, …, n d} of known entries of an otherwise unknown matrix/tensor \(X_{\text{ex}} \in \mathbb {V}\). The matrix/tensor B in this problem is then the sparse matrix/tensor containing the known entries of X ex. Then if, for example, \(X = USV^T \in \mathcal {M}_k\) is a matrix in SVD-like format, the residual \(Z = \mathcal P_\Omega (X) - B\) is also a sparse matrix whose entries are computed as

$$\displaystyle \begin{aligned} Z(i_1,i_2) = \begin{cases} \sum_{\ell=1}^r U(i_1, \ell) S(\ell,\ell) V(i_2,\ell) - B(i_1,i_2) & \text{if }(i_1,i_2) \in \Omega, \\ 0 & \text{otherwise.} \end{cases} \end{aligned}$$

Hence the computation of \(\mathcal P_{X}(Z)\) now requires two sparse matrix multiplications ZU and Z TV ; see [112]. For tensor completion, a little bit more care is needed but an efficient implementation for applying the tangent space projector exists; see [103, §4.2]. In all cases, the computation becomes cheaper the sparser Z is.

If on the other hand \(\mathcal L\) is a low-rank TT matrix operator as explained in Sect. 9.3.5, and B is a low-rank TT tensor, then \(Z=\mathcal L(X) - B\) will be also of low-rank since \(X\in \mathcal {M}_{\mathbf {k}}\). This makes the tangent space projection \(\mathcal P_X(Z)\) efficiently applicable afterwards as explained before. Operators with TT matrix structure are the most typical situation when TT tensors are used for parametric PDEs and for the Schrödinger equation; see again Sect. 9.6 later.

Regarding the computation of the step size t, we can approximate an exact line search method by minimizing the first-order approximation

$$\displaystyle \begin{aligned} {g}(t) = f(X-t\,\xi) \approx f(R_X(-t\, \xi)). \end{aligned}$$

For quadratic functions f, the function g(t) is a quadratic polynomial in t and can thus be exactly minimized. For instance, with \(f_{\mathcal L}\) it satisfies

$$\displaystyle \begin{aligned} {g}(t) = (\xi,\mathcal L(\xi))_F \, t^2 - 2(\xi,\mathcal L(X)-B)_F \, t + \text{constant}. \end{aligned}$$

Recall that, by (9.40), the matrix or TT rank of a tangent vector ξ is bounded by two times that of X. Hence, in the same situation as for \(\mathcal L\) above, these inner products can be computed very efficiently. It has been observed in [112] that with this initialization of the step size almost no extra backtracking is needed.

4.4 Difference to Iterative Thresholding Methods

A popular algorithm for solving optimization problems with low-rank constraints, like matrix completion [49] and linear tensor systems [8, 54], is iterative hard thresholding (IHT).Footnote 7 It is an iteration of the form

$$\displaystyle \begin{aligned} X_+ = \mathcal P_{\mathcal{M}}(X - t \, \nabla f(X)), \end{aligned}$$

where \(\mathcal P_{\mathcal {M}}\colon \mathbb {V} \to \mathcal {M}\) denotes the (quasi) projection on the set \(\mathcal {M}\), like the truncated SVD for low-rank matrices and TT-SVD for tensors as explained in Sects. 9.2.1 and 9.3.2. Variations of this idea also include alternating projection schemes like in [101]. Figure 9.4 compares IHT to Riemannian steepest descent. The main difference between the two methods is the extra tangent space projection \(\mathcal P_{X}\) of the negative gradient −∇f(X) for the Riemannian version. Thanks to this projection, the truncated SVD in the Riemannian case has to be applied to a tangent vector which can be implemented cheaply with direct linear algebra and is thus very reliable, as explained in Sects. 9.2.4 and 9.3.4. In IHT on the other hand, the truncated SVD is applied to a generally unstructured search direction and needs to be implemented with sparse or randomized linear algebra, which are typically less reliable and more expensive.

Fig. 9.4
figure 4

Iterative hard thresholding (IHT) and Riemannian steepest descent (SD) for fixed-rank matrices and tensors

This difference becomes even more pronounced with preconditioning for linear systems \(\mathcal {L}(X)=B\) as in (9.47). As approximate inverse of \(\mathcal {L}\), the operator \(\mathcal {Q}\) there has typically high TT matrix rank and so the additional tangent space projector in (9.47) is very beneficial compared to the seemingly more simpler truncated preconditioned Richardson method

$$\displaystyle \begin{aligned} X_+ = \mathcal P_{\mathcal{M}}(X - t \, \mathcal{Q} (\mathcal{L}(X)-B)). \end{aligned}$$

The numerical experiments from [65] confirm this behavior. For example, in Fig. 9.5 we see the convergence history when solving a Laplace-type equation with Newton potential in the low-rank Tucker format, which has not been discussed, but illustrates the same issue. Since the Newton potential is approximated by a rank 10 Tucker matrix, applying \(\mathcal {Q}\mathcal {L}\) greatly increases the rank of the argument. Thanks to the tangent space projections, the time per iteration is reduced significantly and there is virtually no change in the number of iterations needed.

Fig. 9.5
figure 5

Convergence of the Riemannian and non-Riemannian versions of preconditioned Richardson iteration. The approximation quality of the preconditioner is proportional to the k value. (We do not explain the other Riemannian method “approx. Newton”.) Picture taken from [65]. Copyright 2016 Society for Industrial and Applied Mathematics. Reprinted with permission. All rights reserved

There is another benefit of Riemannian algorithms over more standard rank truncated schemes. Thanks to the global smoothness of the fixed-rank manifolds \(\mathcal {M}\), it is relatively straightforward to accelerate manifold algorithms using non-linear CG or BFGS, and perform efficient line search. For example, Fig. 9.6 compares the Riemannian non-linear CG algorithm from [64] to a specific IHT algorithm based on nuclear norm relaxation from [101] for the low-rank tensor completion problem as explained in Sect. 9.6.3. We can see that the Riemannian algorithm takes less iterations and less time. While this example is again for fixed-rank Tucker tensors, the same conclusion is also valid for fixed-rank matrices and TT tensors; see, e.g., [112, Fig. 5.1].

Fig. 9.6
figure 6

Relative testing error in function of number of iterations (left) and time (right). For an IHT algorithm (denoted “Hard”) and a Riemannian method (denoted by “geomCG”) for different sampling sizes when solving the tensor completion problem. Picture taken from [64]

4.5 Convergence

Theoretical results for Riemannian optimization parallel closely the results from Euclidean unconstrained optimization. In particular, with standard line search or trust-region techniques, limit points are guaranteed to be critical points, and additional Hessian information can enforce attraction to local minimal points; see [2]. For example, when the initial point X 1 is sufficiently close to a strict local minimizer X of f on \(\mathcal {M}\), Riemannian gradient descent will converge exponentially fast. Specifically, if the Riemannian Hessian of f at X has all positive eigenvalues λ p ≥⋯ ≥ λ 1 > 0, then the iterates X with exact line search satisfy the following asymptotic Q-linear convergence rate [74]:

$$\displaystyle \begin{aligned} \lim_{\ell \to \infty} \frac{\| X_{\ell+1} - X_* \|{}_F}{\| X_\ell - X_* \|{}_F} = \frac{\lambda_p - \lambda_1}{\lambda_p+ \lambda_1} \leq 1-\kappa, \qquad \text{with }\kappa = \frac{\lambda_1}{\lambda_p}. \end{aligned}$$

With more practical line searches, like those that ensure the Armijo condition (9.43), this rate deteriorates but remains 1 − O(κ); see [2]. As in the Euclidean non-convex case, non-asymptotic results that are valid for arbitrary X 1 can only guarantee algebraic rates; see [12]. If however X 1 is in a region where f is locally convex, then also fast exponential convergence is guaranteed; see [107]. Results of this kind but specific to matrix completion are available in [117].

For particular problems, one can show that gradient schemes converge to the global minimum when started at any X 1. The main idea is that, while these problems are not convex, their optimization landscape is still favorable for gradient schemes in the sense that all critical points are either strict saddle points or close to a global minimum. Strict saddles are characterized as having directions of sufficient negative curvature so that they push away the iterates of a gradient scheme that might be attracted to such a saddle [76]. This property has been established in detail for matrix sensing with RIP (restricted isometry property) operators, which are essentially very well-conditioned when applied to low-rank matrices. Most of the results are formulated for particular non-Riemannian algorithms (see, e.g., [90]), but landscape properties can be directly applied to Riemannian algorithms as well; see [18, 110]. As far as we know, such landscape results have not been generalized to TT tensors but related work on completion exists [93].

4.6 Eigenvalue Problems

Another class of optimization problems arises when computing extremal eigenvalues of Hermitian operators. This is arguably the most important application of low-rank tensors in theoretical physics since it includes the problem of computing ground-states (eigenvectors of minimal eigenvalues) of the Schrödinger equation.

The main idea is similar to the previous section. Suppose we want to compute an eigenvector \(X \in \mathbb {V}\) of a minimal eigenvalue of the Hermitian linear operator \(\mathcal H \colon \mathbb {V} \to \mathbb {V}\). Then, instead of minimizing the Rayleigh function on \(\mathbb {V}\), we restrict the optimization space to an approximation manifold:

$$\displaystyle \begin{aligned} \min \rho(X) = \frac{(X, \mathcal H(X))_F}{(X,X)_F} \quad \text{s.t.} \quad X \in \mathcal{M}. \end{aligned}$$

Since f is homogeneous in X, the normalization (X, X)F = 1 can also be imposed as a constraint:

$$\displaystyle \begin{aligned} \min \widetilde{\rho}(X) = (X, \mathcal H(X))_F \quad \text{s.t.} \quad X \in \widetilde{\mathcal{M}} = \mathcal{M} \cap \{ X \colon (X,X)_F=1 \}. \end{aligned}$$

This intersection is transversal in cases when \(\mathcal {M}\) is a manifold of low-rank matrices or tensors, so \(\widetilde {\mathcal {M}}\) is again a Riemannian submanifold of \(\mathbb {V}\) with a geometry very similar to that of \(\mathcal {M}\); see [91] for details on the matrix case. One can now proceed and apply Riemannian optimization to either problem formulation.

Standard algorithms for eigenvalue problems typically do not use pure gradient schemes. Thanks to the specific form of the problem, it is computationally feasible to find the global minimum of ρ on a small subspace of \(\mathbb {V}\). This allows to enrich the gradient direction with additional directions in order to accelerate convergence. Several strategies of this type exist of which LOBPCG and Jacob–Davidson have been extended to low-rank matrices and tensors. In particular, thanks to the multilinear structure of the TT format, it is feasible to minimize globally over a subspace in one of the TT cores. Proceeding in a sweeping manner, one can mimic the Jacob–Davidson method to TT tensors; see [91, 92].

5 Initial Value Problems

Instead of approximating only a single (very large) matrix or tensor X by low rank, we now consider the task of approximating a time-dependent tensor X(t) directly by a low-rank tensor Y (t). The tensor X(t) is either given explicitly, or more interesting, as the solution of an initial value problem (IVP)

(9.48)

where means dXdt. As it is usual, we assume that F is Lipschitz continuous with constant Λ,

$$\displaystyle \begin{aligned} \| F(X) - F(Z) \| \leq \Lambda \, \| X - Z \| \quad \text{for all }X,Z \in \mathbb{V}, \end{aligned} $$
(9.49)

so that the solution to (9.48) exists at least on some interval [t 0, T]. We took F autonomous, which can always be done by adding t as an extra integration parameter. For simplicity, we assume that the desired rank for the approximation Y (t) is known and constant. In most applications, it will be important however that the numerical method that computes Y (t) is robust to overestimation of the rank and/or allows for adapting the rank to improve the accuracy.

The aim is to obtain good approximations of X(t) on the whole interval [t 0, T]. This is usually done by computing approximations X  ≈ X(t 0 + ℓh) with h the time step. Classical time stepping methods for this include Runge–Kutta and BDF methods. Sometimes, one is only interested in the steady-state solution, that is, X(t) for t →. This is for example the case for gradient flows, where F is the negative gradient of an objective function \(f\colon \mathbb {V} \to {\mathbb {R}}\). The steady state solution of (9.48) is then a critical point of f, for example, a local minimizer. However, in such situations, it may be better to directly minimize f using methods from numerical optimization as explained in Sect. 9.4.

5.1 Dynamical Low-Rank Approximation

We now explain how to obtain a low-rank approximation to (9.48) without needing to first solve for X(t). Given an approximation submanifold \(\mathcal {M} = \mathcal {M}_k\) or \(\mathcal {M} = \mathcal {M}_{\mathbf {k}}\) of fixed-rank matrices or tensors, the idea is to replace in (9.48) by the tangent vector in \(\mathcal {M}\) that is closest to F(X); see also Fig. 9.7. It is easy to see that for the Frobenius norm, this tangent vector is \(\mathcal P_{X}(F(X))\) where \(\mathcal P_{X}\colon \mathbb {V} \to T_{X} \mathcal {M}\) is the orthogonal projection. Applying this substitution at every time t, we obtain a new IVP

(9.50)

where \(Y_0 = \mathcal P_{\mathcal {M}}(X_0)\) is a quasi-best approximation of X 0 in \(\mathcal {M}\). In [59], the IVP (9.50) (or its solution) is aptly called the dynamical low-rank approximation (DLRA) of X(t). Thanks to the tangent space projection, the solution Y (t) will belong to \(\mathcal {M}\) as long as \(\mathcal P_{Y(t)}\) exists, that is, until the rank of Y (t) drops. In the following we assume that (9.50) can be integrated on [t 0, T].

Fig. 9.7
figure 7

Graphical depiction of the dynamical low-rank approximation Y (t) at time t

The DLRA (9.50) can equivalently be defined in weak form as follows: find, for each t ∈ [t 0, T], an element \(Y(t) \in \mathcal {M}\) such that

(9.51)

and \(Y(0) = Y_0 \in \mathcal {M}\). Observe that this can be seen as a time-dependent Galerkin condition since \(T_{Y(t)}\mathcal {M}\) is a linear subspace that varies with t.

In the concrete case of low-rank matrices, DLRA appeared first in [59]. The same approximation principle, called dynamically orthogonal (DO), was also proposed in [94] for time-dependent stochastic PDEs. It was shown in [32, 83] that DO satisfies (9.50) after discretization of the stochastic and spatial domain. In theoretical physics, the time-dependent variational principle (TDVP) from [41] seems to be the first application of DLRA for simulating spin systems with uniform MPS, a variant of TT tensors. It is very likely similar ideas appeared well before since obtaining approximations in a manifold from testing with tangent vectors as in (9.51) goes back as far as 1930 with the works of Dirac [22] and Frenkel [33]. We refer to [69] for a mathematical overview of this idea in quantum physics.

5.2 Approximation Properties

The local error at t of replacing (9.48) by (9.50) is minimized in Frobenius norm by the choice ; see also Fig. 9.7. In order to quantity the effect of this approximation on the global error at the final time T, the simplest analysis is to assume as in [57, 58] that the vector field F is ε close to the tangent bundle of \(\mathcal {M}\), that is,

$$\displaystyle \begin{aligned} \| F(Y(t)) - \mathcal P_{Y(t)} F(Y(t)) \|{}_F \leq \varepsilon \quad \text{for all }t \in [t_0, T]. \end{aligned}$$

A simple comparison of IVPs then gives

$$\displaystyle \begin{aligned} \| Y(t) - X(t) \|{}_F \leq e^{\lambda t} \delta + (e^{\lambda t} - 1){\lambda}^{-1} \varepsilon = O(\varepsilon + \delta), \end{aligned} $$
(9.52)

where ∥X 0 − Y 0F ≤ δ and λ is a one-sided Lipschitz constant of F satisfyingFootnote 8

$$\displaystyle \begin{aligned} ( X-Z, F(X) - F(Z) )_F \leq \lambda \, \|X-Z\|{}_F^2 \quad \text{for all }X,Z \in \mathbb{V}. \end{aligned}$$

From (9.52), we observe that Y (t) is guaranteed to be a good approximation of X(t) but only for (relatively) short time intervals when λ > 0.

Alternatively, one can compare Y (t) with a quasi-best approximation \(Y_{\mathrm{qb}}(t) \in \mathcal {M}\) to X(t). Assuming Y qb(t) is continuously differentiable on [t 0, T], this can be done by assuming that \(\mathcal {M}\) is not too curved along Y qb(t). In the matrix case, this means that the kth singular value of Y qb(t) is bounded from below, i.e., there exists ρ > 0 such that σ k(Y qb(t)) ≥ ρ for t ∈ [t 0, T]. Now a typical result from [59] is as follows: Let F be the identity operator and assume and ∥X(t) − Y qb(t)∥F ≤ ρ∕16 for t ∈ [t 0, T]. Then,

$$\displaystyle \begin{aligned} \|Y(t) - Y_{\mathrm{qb}}(t) \|{}_F \leq 2 \beta e^{\beta t} \int_0^t \|Y_{\mathrm{qb}}(s) - X(s) \|{}_F ds \qquad \text{with }\beta = 8 c \rho^{-1} \end{aligned}$$

for \(t_0 \leq t \leq \min (T,0.55 \beta ^{-1})\). Hence, the approximation Y (t) stays close to Y qb(t) for short times. We refer to [59] for additional results that also include the case of F not the identity. Most of the analysis was also extended to manifolds of fixed TT rank (as well as to Tucker and hierarchical) tensors in [60, 73] and to Hilbert spaces [83].

We remark that these a-priori results only hold for (very) short times. In practice, they are overly pessimistic and in actual problems the accuracy is typically much higher than theoretically predicted; see [57, 71, 72, 83, 94], and the numerical example from Fig. 9.8 further below.

Fig. 9.8
figure 8

Errors of the dynamical low-rank approximation for (9.55) integrated by a standard explicit Runge–Kutta scheme for (9.53) and the projector-splitting integrator (9.58). Picture taken from [58]. Copyright 2016 Society for Industrial and Applied Mathematics. Reprinted with permission. All rights reserved

5.3 Low-Dimensional Evolution Equations

The dynamical low-rank problem (9.50) is an IVP that evolves on a manifold \(\mathcal {M}\) of fixed-rank matrices or tensors. In relevant applications, the rank will be small and hence we would like to integrate (9.50) by exploiting that \(\mathcal {M}\) has low dimension.

Let us explain how this is done for m × n matrices of rank k, that is, for the manifold \(\mathcal {M}_k\). Then \( \operatorname {\mathrm {rank}}(Y(t)) = k\) and we can write Y (t) = U(t)S(t)V (t)T where \(U(t) \in \mathbb {R}^{m \times k}\) and \(V(t) \in \mathbb {R}^{m \times k}\) have orthonormal columns and \(S(t) \in \mathbb {R}^{k \times k}\). This is an SVD-like decomposition but we do not require S(t) to be diagonal. The aim is now to formulate evolution equations for U(t), S(t), and V (t).

To this end, recall from (9.10) that for fixed U, S, V  every tangent vector has a unique decomposition

Since , we can isolate by applying (9.12) with Z = F(Y ). The result is a new IVP equivalent to (9.50) but formulated in the factors:

(9.53)

(Here, U, S, V  all depend on t.) Observe that this is a set of coupled non-linear ODEs, parametrized using (m + n)k + k 2 entries.

The ODE (9.53) appeared in [59, Prop. 2.1]. For DO [94], one uses the parametrization Y (t) = U(t)M(t)T where only U has orthonormal columns and obtains

(9.54)

These two coupled non-linear ODEs are very similar to (9.53) with respect to theoretical and numerical behavior. In particular, they also involve the normalization condition and an explicit inverse (M TM)−1.

The derivation of these ODEs can be generalized to TT tensors with factored and gauged parametrizations for the tangent vectors. The equations are more tedious to write down explicitly, but relatively easy to implement. We refer to [73] for details. See also [4, 60] for application to the (hierarchical) Tucker tensor format.

For matrices and for tensors, the new IVPs have the advantage of being formulated in low dimensional parameters. However, they both suffer from a major problem: the time step in explicit methods needs to be in proportion to the smallest positive singular value of (each unfolding) of Y (t). If these singular values become small (which is typically the case, since the DLRA approach by itself is reasonable for those applications where the true solution exhibits fast decaying singular values), Eq. (9.53) is very stiff. The presence of the terms S −1 in (9.53) and (M TM)−1 in (9.54) already suggests this and numerical experiments make this very clear. In Fig. 9.8, we report on the approximation errors for DLRA applied to the explicit time-dependent matrix

$$\displaystyle \begin{aligned} A(t) = \exp(t W_1) \, \exp(t) \, D \, \exp(t W_2), \qquad 0 \leq t \leq 1,\end{aligned} $$
(9.55)

with W 1, W 2 being skew-symmetric of size 100 × 100 and D a diagonal matrix with entries 2−1, ⋯ , 2−100; see [58] for details. The left panel shows the results of a Runge–Kutta method applied to the resulting system (9.53). The method succeeds in computing a good low-rank approximation when the step size h is sufficiently small, but becomes unstable when h is larger than the smallest singular value of Y (t). Due to this step-size restriction it hence becomes very expensive when aiming for accurate low-rank approximations. See also [57, Fig. 3] for similar results.

One solution would be to use expensive implicit methods or an ad-hoc regularization of S −1. In the next subsection, a different approach is presented that is based on a splitting of the tangent space projector, and is robust to small singular values.

5.4 Projector-Splitting Integrator

Instead of immediately aiming for an ODE in the small factors U, S, V , the idea of the splitting integrator of [71] is to first apply a Lie splitting to the orthogonal projector \(\mathcal P_Y\) in

(9.56)

and then—thanks to some serendipitous observation—obtain low dimensional ODEs at a later stage. For instance, in the matrix case, as stated in (9.9), the projector can be written as

$$\displaystyle \begin{aligned} \mathcal P_Y(Z) = Z VV^T - UU^T Z VV^T + UU^T V \quad \text{with }Y = U S V^T. \end{aligned} $$
(9.57)

When we integrate each of these three terms consecutively (labeled a, b, c) from t 0 to t 1 = t 0 + h, we obtain the following scheme (all matrices depend on time):

(9.58)

Here, all U x and V x are matrices with orthonormal columns. Observe the minus sign for Y b. The result Y c(t 1) is an O(h 2) approximation to Y (t 1). We then repeat this scheme starting at Y c(t 1) and integrate from t 1 to t 2 = t 1 + h, and so on. By standard theory for Lie splittings, this scheme is first-order accurate for (9.56), that is, ∥Y c(T) − Y (T)∥F = O(h) where T = ℓh.

To integrate (9.58) we will first write it using much smaller matrices. To this end, observe that with exact integration \(Y_a(t_1)\in \mathcal {M}_k\) since and \(Y_a(t_0) \in \mathcal {M}_k\). Hence, we can substitute the ansatz \(Y_a(t) = U_a^{}(t) S_a^{}(t) V_a(t)^T\) in the first substep and obtain

Judiciously choosing , we can simplify to

$$\displaystyle \begin{aligned} V_a(t) = V_a(t_0), \quad \frac{d}{dt}[U_a(t) S_a(t)] = F(Y_a(t)) V_a(t_0). \end{aligned}$$

Denoting K(t) = U a(t)S a(t), the first substep is therefore equivalent to

(9.59)

Contrary to the earlier formulation, this is an IVP for an n × k matrix K(t). The orthonormal matrix U b for the next substep can be computed in O(nk 2) work by a QR decomposition of K(t 1).

The second and third substeps can be integrated analogously in terms of evolution equations only for S b(t) and L c(t) = V c(t)S c(t). Also note that we can take V b = V a and U c = U b. We thus get a scheme, called KSL, that integrates in order K, S, and L. A second-order accurate scheme is the symmetric Strang splitting: one step consists of computing the K, S, L substeps for F with h∕2 and afterwards the L, S, K substeps for the adjoint of F with h∕2.

In both versions of the splitting scheme, care must be taken in the integration of the substeps since they are computationally the most expensive part. Fortunately, the ODEs in the substeps are formally of the same form as the original equation for the vector field F(Y ) since the projected subspace is constant; see, e.g., V a(t 0) in (9.59). This means that one can usually adapt specialized integrators for F(Y ). In [28], for example, the substeps arising from the Vlasov–Poisson equations in plasma physics (see also Sect. 9.6.5) can be integrated by spectral or semi-Lagrangian methods. In addition, when F is linear and has low TT matrix rank, the large matrix K(t)V a(t 0)T in (9.59), for example, does not need to be formed explicitly when evaluating F. As illustration, for the Lyapunov operator F(Z) = LZ + ZL T, the equation for K becomes

where \(L \in \mathbb {R}^{n \times n}\) is large but usually sparse, and \(L_a \in \mathbb {R}^{k \times k}\) is small. Hence, an exponential integrator with a Krylov subspace method is ideally suited to integrate K(t); see, e.g., [70].

Let us finish by summarizing some interesting properties of the splitting integrator for matrices. Let Y be the solution after steps of the scheme explained above with step size h. For simplicity, we assume that each substep is solved exactly (or sufficiently accurately). Recall that X(t) is the solution to the original ODE (9.48) that we approximate with the dynamical low-rank solution Y (t) of (9.50).

  1. (a)

    Exactness [ 71, Thm. 4.1]: If the solution X(t) of the original ODE lies on \(\mathcal {M}_k\), then Y  = X(t ) when F equals the identity.

  2. (b)

    Robustness to small singular values [ 58, Thm. 2.1]: There is no step size restriction due to small singular values. Concretely, under the same assumptions that lead to (9.52), the approximation error satisfies

    $$\displaystyle \begin{aligned} \|Y_\ell - X(t_0 + \ell h) \|{}_F \leq C (\delta + \varepsilon + h) \quad \text{for all }\ell,h\ \text{such that }t_0 + \ell h \leq T \end{aligned}$$

    where C only depends on F and T. Observe that this only introduced the time step error h. For the integration error ∥Y  − Y (t 0 + ℓh)∥F, a similar bound exists in [71, Thm. 4.2] but it requires rather technical assumptions on F.

  3. (c)

    Norm and energy conservation [ 70, Lemma 6.3]: If the splitting scheme is applied to complex tensors for a Hamiltonian problem \(F(X) = -i \mathcal H(X)\) with complex Hermitian \(\mathcal H\), the Frobenius norm and energy are preserved:

    $$\displaystyle \begin{aligned} \| Y_\ell \|{}_F = \|Y_0 \|{}_F \quad \text{and} \quad \langle Y_\ell, \mathcal H(Y_\ell) \rangle_F = \langle Y_0, \mathcal H(Y_0) \rangle_F \quad \text{for all }n. \end{aligned}$$

    In the case of real tensors, the norm is preserved if 〈F(Y ), YF = 0.

Property (a) does not seem very useful but it is key to showing the much more relevant property (b). All three properties are not shared when solving (9.53) by a standard integrator, like explicit Runge–Kutta. Even more, properties (a) and (b) are also lost for a different ordering of the splitting scheme, like KLS, even though that would still result in a first-order scheme. We remark that these properties also hold for the solution Y (t) of the continuous problem by (formally) replacing h by 0.

To extend the idea of projector splitting to TT tensors \(Y(t) \in \mathcal {M}_{\mathbf {k}}\), the correct splitting of the tangent space projector \(\mathcal P_Y \colon \mathbb {V} \to T_Y \mathcal {M}_{\mathbf {k}}\) has to be determined. The idea in [72] is to take the sum expression (9.39) and split it as

$$\displaystyle \begin{aligned} \mathcal P_Y = \mathcal P_1^+ - \mathcal P_1^- + \mathcal P_2^+ - \mathcal P_2^- \cdots - \mathcal P_{d-1}^- + \mathcal P_d^+ \end{aligned} $$
(9.60)

where

$$\displaystyle \begin{aligned} \mathcal P_\mu^+(Z) = \mathcal P_{\leq \mu-1}( \mathcal P_{\geq \mu+1} (Z)) \quad \text{and} \quad \mathcal P_\mu^-(Z) = \mathcal P_{\leq \mu} ( \mathcal P_{\geq \mu+1}(Z)). \end{aligned}$$

Observe that \(\mathcal P_\mu ^\pm \) depends on Y  and that this splitting reduces to the matrix case in (9.57) when d = 2. The projector-splitting integrator for TT is now obtained by integrating each term in (9.60) from left to right:

(9.61)

Quite remarkably, this splitting scheme for TT tensors shares many of the important properties from the matrix case. In particular, it allows for an efficient integration since only one core varies with time in each substep (see [72, Sec. 4]) and it is robust to small singular values in each unfolding (see [58, Thm. 3.1]). We refer to [42] for more details on its efficient implementation and its application to quantum spin systems in theoretical physics.

6 Applications

In this section, we explain different types of problems that have been solved by low-rank matrix and tensor methods in the literature. We will in particular focus on problems that can be approached by the geometry-oriented methods considered in this chapter, either via optimization on low-rank manifolds or via dynamical low-rank integration. Our references to the literature are meant to give a broad and recent view of the usefulness of these methods, but we do not claim they are exhaustive.

6.1 Matrix Equations

In control and systems theory (see, e.g., [3]), a number of applications requires solving the following types of matrix equations:

$$\displaystyle \begin{aligned} \begin{aligned} \text{Lyapunov:} & \qquad AX+XA^T = C, \\ \text{Sylvester:} & \qquad AX+XB = C, \\ \text{Riccati:} & \qquad AX+XA^T+XBX = C. \end{aligned} \end{aligned} $$
(9.62)

Here, A, B, C are given matrices and X is the unknown matrix (of possible different size in each equation). The first two equations are linear, whereas the second is quadratic. For simplicity, we assume that these equations are uniquely solvable but there exist detailed results about conditions for this.

In large-scale applications, the matrix X is typically dense and too large to store. Under certain conditions, one can prove that X has fast decaying singular values and can thus be well approximated by a low-rank matrix; see [102] for an overview. For the linear equations, one can then directly attempt the optimization strategy explained in Sect. 9.4.2 and minimize the residual function or the energy-norm error. The latter is preferable but only possible when A and B are symmetric and positive definite; see [111] for a comparison. If the underlying matrices are ill-conditioned, as is the case with discretized PDEs, a simple Riemannian gradient scheme will not be effective and one needs to precondition the gradient steps or perform a quasi-Newton method. For example, in case of the Lyapunov equation, it is shown in [113] how to efficiently solve the Gauss–Newton equations for the manifold \(\mathcal {M}_k\). If the Riccati equation is solved by Newton’s method, each step requires solving a Sylvester equation [102]. When aiming for low-rank approximations, the latter can again be solved by optimization on \(\mathcal {M}_k\); see [81]. We remark that while most methods for calculating low-rank approximations to (9.62) are based on Krylov subspaces and rational approximations, there exists a relation between both approaches; see [11].

The matrix equations from above have direct time-dependent versions. For example, the differential Riccati equation is given by

(9.63)

where G(t, X(t)) = C − X(t)BX(t). Uniqueness of the solution X(t) for all t ≥ t 0 is guaranteed when X 0, C, and B are symmetric and positive semi-definite [21]. In optimal control, the linear quadratic regulator problem with finite time horizon requires solving (9.63). In the large-scale case, it is typical that X 0 and C are low rank and it has been observed [20, 76] that X(t) has then fast decaying singular values, even on infinite time horizons.

Other examples are the differential Lyapunov equation (G(t, X) = 0) and the generalized differential Riccati equation (\(G(t,X(t)) = C + \sum _{j=1}^J D_j^T X(t) D_j - X(t) B X(t)\)); see, e.g. [20, 75], for applications. When matrices are large, it is important to exploit that applying the right hand side in (9.63) does not increase the rank of X(t) too much, which is guaranteed here, if J is not too large and the matrix C is of low rank. In [89] a low-rank approximation to X(t) is obtained with the dynamical low-rank algorithm. Like in the time-independent case, discretized PDEs might need special treatment to cope with the stiff ODEs. In particular, an exponential integrator can be combined with the projector-splitting integrator by means of an additional splitting of the vector field for the stiff part; see [89] for details and analysis.

6.2 Schrödinger Equation

Arguably the most typical example involving tensors of very high order is the time-dependent Schrödinger equation,

where H is a self-adjoint Hamiltonian operator acting on a (complex-valued) multi-particle wave function ψ(x 1, …, x d, t) with \(x_\mu \in \mathbb {R}^p\), p ≤ 3. This equation is fundamental in theoretical physics for the simulation of elementary particles and molecules. Employing a Galerkin discretization with i = 1, …, n μ basis functions \(\varphi _i^{(\mu )}\) in each mode μ = 1, …, d, the wave function will be approximated as

$$\displaystyle \begin{aligned} \psi(x_1, \ldots, x_d, t) \approx \sum_{i_1}^{n_1} \cdots \sum_{i_d}^{n_d} X(i_1,\ldots,i_d; t) \, \varphi_{i_1}^{(1)}(x_1) \cdots \varphi_{i_d}^{(d)}(x_d). \end{aligned}$$

By regarding the unknown complex coefficient X(i 1, …, i d;t) as the (i 1, …, i d)th element of the time-dependent tensor X(t) of size n 1 ×⋯ × n d, we obtain the linear differential equation

(9.64)

where \(\mathcal H\) is the Galerkin discretization of the Hamiltonian H. More complicated versions of this equation allow the Hamiltonian to be time-dependent.

The size of the tensor X(t) will be unmanageable for large d but, fortunately, certain systems allow it to be approximated by a low-rank tensor. For example, in the multilayer multiconfiguration time-dependent Hartree model (ML-MCDTH) in [78, 116] for simulating quantum dynamics in small molecules, the wave functions are approximated by hierarchical Tucker tensors. Spin systems in theoretical physics, on the other hand, employ the TT format and can simulate systems of very large dimension (since n μ are small); see [97] for an overview. For both application domains, the solution of (9.64) can be obtained by applying dynamical low-rank; see, e.g., [77] for MCDTH and [41] for spin systems.

Numerical experiments for (9.64) with the Henon–Heiles potential were performed in [72]. There the second-order splitting integrator with a fixed time step h = 0.01 and a fixed TT rank of 18 was compared to an adaptive integration of the gauged ODEs, similar to (9.53). In particular, the ML-MCDTH method [10] was used in the form of the state-of-the art code mcdth v8.4. Except for the slightly different tensor formats (TT versus hierarchical Tucker) all other modeling parameters are the same. For similar accuracy, a 10 dimensional problem is integrated by mcdth in 54 354 s, whereas the TT splitting integrator required only 4425 s. The reason for this time difference was mainly due to the ill conditioned ODEs in the gauged representation. In addition, there was no visible difference in the Fourier transform of the auto-correlation functions; see Fig. 9.9.

Fig. 9.9
figure 9

Spectrum computed by the second-order splitting integrator and by the MCTDH package. Picture taken from [72]. Copyright 2015 Society for Industrial and Applied Mathematics. Reprinted with permission. All rights reserved

There is an interesting link between the computation of the ground state (eigenvector of the minimal eigenvalue) of \(\mathcal H\) via the minimization of the Rayleigh quotient

$$\displaystyle \begin{aligned} \rho(X) = \frac{(X,\mathcal H(X))_F}{(X,X)_F} \end{aligned}$$

and so-called imaginary time evolution [41] for a scaled version of (9.64) that conserves unit norm. The latter is a formal way to obtain a gradient flow for ρ(X) using imaginary time τ = −it by integrating

For both approaches, we can approximate their solutions with low-rank tensors, as we explained before, either via optimization or dynamical low rank on \(\mathcal {M}_{\mathbf {k}}\). However, methods based on optimization of the multilinear TT representation of X remain the more popular approach since they easily allow to reuse certain techniques from standard eigenvalue problems, like subspace corrections, as is done in the DMRG [118] or AMEn [23, 63] algorithm.

For an overview on tensor methods in quantum physics and chemistry we refer to [51, 97, 105].

6.3 Matrix and Tensor Completion

Let Ω ⊂{1, …, m}×{1, …, n} be a sampling set. The problem of matrix completion consists of recovering an unknown m × n matrix M of rank k based only on the values M(i, j) for all (i, j) ∈ Ω. Remarkably, this problem has a unique solution if \(|\Omega | \approx O(\dim \mathcal {M}_k ) = O(k(m+n))\) and under certain randomness conditions on Ω and M; see [14]. If the rank k is known, this suggests immediately the strategy of recovering M by minimizing the least-squares fit

$$\displaystyle \begin{aligned} f(X) = \sum_{i=1}^m \sum_{j=1}^n (X(i,j) - M(i,j))^2 = \| \mathcal P_\Omega(X-M) \|{}_F^2 \end{aligned}$$

on the manifold \(\mathcal {M}_k\), where \(\mathcal P_\Omega \) is the orthogonal projection onto matrices that vanish outside of Ω. Since \(\mathcal P_\Omega \) is well-conditioned on \(\mathcal {M}_k\) when the iterates satisfy an incoherence property, the simple Riemannian gradient schemes that we explained above perform very well in recovering M; see, e.g., [82, 112].

The problem of matrix completion can be generalized to tensors, and Riemannian methods for tensor completion have been developed for the Tucker format in [64], and for the TT format in [103].

In addition, instead of element-wise sampling, the observations can also be constructed from a general linear operator \(\mathcal {S} \colon \mathbb {V} \to \mathbb {R}^q\). This problem remains well-posed under certain randomness conditions on \(\mathcal {S}\) and also Riemannian optimization performs well if applied to the least-square version of the problem for which \(\mathcal {L} = \mathcal {S}^T \mathcal {S}\); see [117].

6.4 Stochastic and Parametric Equations

Other interesting applications for low-rank tensors arise from stochastic or parametric PDEs [7, 9, 24, 26, 31, 54, 55]. For simplicity, suppose that the system matrix of a finite-dimensional linear system Ax = b of equations depends on p parameters ω (1), …, ω (p), that is,

$$\displaystyle \begin{aligned} A(\omega^{(1)}, \ldots, \omega^{(p)}) \, x(\omega^{(1)}, \ldots, \omega^{(p)}) = b. \end{aligned} $$
(9.65)

One might be interested in the solution \(x \in {\mathbb {R}}^n\) for some or all choices of parameters, or, in case the parameters are random variables, in expectation values of certain quantities of interest.

By discretizing each parameter ω (μ) with m μ values, we can gather all the m 1m p solution vectors x into one tensor

$$\displaystyle \begin{aligned}{}[X(j, i_1,\ldots, i_p)] = [x_j(\omega^{(1)}_{i_1}, \ldots, \omega^{(p)}_{i_p} )] \end{aligned}$$

of order p + 1 and size n × m 1 ×⋯ × m p. When A depends analytically on ω = (ω (1), …, ω (p)), the tensor X can be shown [62] to be well approximated by low TT rank and it satisfies a very large linear system \(\mathcal L(X)=B\). If \(\mathcal L\) is a TT matrix of low rank, we can then approximate X on \(\mathcal {M}_{\mathbf {k}}\) by the optimization techniques we discussed in Sect. 9.4. This is done, for example, in [65] with an additional preconditioning of the gradient.

A similar situation arises for elliptic PDEs with stochastic coefficients. After truncated Karhunen–Loève expansion, one can obtain a deterministic PDE of the same form as (9.65); see [106]. The following time-dependent example with random variable ω,

$$\displaystyle \begin{aligned} \partial_t \psi + v(t, x; \omega) \cdot \nabla \psi = 0, \quad \psi(0,x) = x, \end{aligned}$$

was solved by dynamical low-rank in [32].

6.5 Transport Equations

Transport equations describe (densities of) particles at position \(x \in \mathbb {R}^p\) and velocity \(v \in \mathbb {R}^p\). They are typically more challenging to integrate than purely diffusive problems. For example, the Vlasov equation

$$\displaystyle \begin{aligned} \partial_t u(t,x,v) + v \cdot \nabla_x u(t,x,v) - F(u) \cdot \nabla_v u(t,x,v) = 0 \end{aligned} $$
(9.66)

is a kinetic model for the density u of electrons in plasma. The function F is a nonlinear term representing the force. These equations can furthermore be coupled with Maxwell’s equations resulting in systems that require specialized integrators to preserve conservation laws in the numerical solution. After spatial discretization on a tensor product grid, Eq. (9.66) becomes a differential equation for a large tensor of order d = 6. In the case of the Vlasov–Poisson and Vlasov–Maxwell equations, [28, 30] show the splitting integrator gives very good approximations with modest TT rank, even over relatively large time intervals. In addition, the numerical integration of the substeps can be modified to ensure better preservation of some conservation laws; see [29, 30].

Similar approaches appear for weakly compressible fluid flow with the Boltzmann equation in [27] and stochastic transport PDEs in [32]. The latter also shows that numerical filters can be used in combination with dynamical low-rank to successfully reduce artificial oscillations.

7 Conclusions

In this chapter we have shown how the geometry of low-rank matrices and TT tensors can be exploited in algorithms. We focused on two types of problems: Riemannian optimization for solving large linear systems and eigenvalue problems, and dynamical low-rank approximation for initial value problems. Our aim was to be sufficiently explanatory without sacrificing readability and we encourage the interested reader to refer to the provided references for a more in depth treatment of these subjects.

Several things have not been discussed in this introductory chapter. The most important issue is arguably the rank adaptation during the course of the algorithms to match the desired tolerance at convergence. For this, truncation of singular values with a target error instead of a target rank can be used both for matrices and TT tensors, but from a conceptual perspective such an approach is at odds with algorithms that are defined on manifolds of fixed rank matrices or tensors. However, it is possible to combine geometric methods with rank adaptivity as in [109] for greedy rank-one optimization and in [42] for a two-site version of the splitting scheme for time integration, yet many theoretical and implementation questions remain. Other important topics not covered are the problem classes admitting a-priori low-rank approximability [19, 44], the application of low-rank formats to seemingly non-high dimensional problems like quantized TT (QTT) [52, 53], the efficient numerical implementation of truly large-scale and stiff problems, schemes with guaranteed and optimal convergence as in [5], and more general tensor networks like PEPS [114].