A Krylov-Schur-like method for computing the best rank-(r1,r2,r3) approximation of large and sparse tensors

The paper is concerned with methods for computing the best low multilinear rank approximation of large and sparse tensors. Krylov-type methods have been used for this problem; here block versions are introduced. For the computation of partial eigenvalue and singular value decompositions of matrices the Krylov-Schur (restarted Arnoldi) method is used. A generalization of this method to tensors is described, for computing the best low multilinear rank approximation of large and sparse tensors. In analogy to the matrix case, the large tensor is only accessed in multiplications between the tensor and blocks of vectors, thus avoiding excessive memory usage. It is proved that if the starting approximation is good enough, then the tensor Krylov-Schur method is convergent. Numerical examples are given for synthetic tensors and sparse tensors from applications, which demonstrate that for most large problems the Krylov-Schur method converges faster and more robustly than higher order orthogonal iteration.

1. Introduction. In many applications of today, large and sparse data sets are generated that are organized in more than two categories. Such multi-mode data can be represented by tensors. They arise in applications of data sciences, such as web link analysis, cross-language information retrieval and social network analysis, see, e.g., [31] and the survey [30]. The effective analysis of tensor data requires the development of methods that can identify the inherent relations that exist in the data, and that scale to large data sets. Low rank approximation is one such method, and much research has been done in recent years in this area; a few examples that are related to the work of this paper are given in [2,6,9,14,23,28,35,40,41,50]. However, most of these methods are intended for small to medium size tensors. The objective of this paper is to develop an algorithm for low-rank approximation of large and sparse tensors.
We consider the problem of approximating a 3-mode tensor A by another tensor B, where the norm is the Frobenius norm, and B has low multilinear rank-(r 1 , r 2 , r 3 ) (for definitions of the concepts used in this introduction, see Section 2). We will assume that A is large and sparse. This problem can be written (2) min where F ∈ R r1×r2×r3 is a tensor of small dimensions, and (U, V, W ) · F denotes matrix-tensor multiplication in all three modes. This is the best rank-(r 1 , r 2 , r 3 ) sequence of adjacency matrices of undirected graphs). Therefore the block Krylovtype methods and the numerical examples are given in terms of such tensors. The method presented in this paper is "like" a Krylov-Schur method for two reasons. The tensor Krylov-type method is not a Krylov method in a strict sense, as it does not build bases for Krylov subspaces [40]. The method is not a real Krylov-Schur method as it does not build and manipulate a Hessenberg matrix; instead it uses a tensor, which is in some ways similar to Hessenberg. However, this structure is not utilized. For ease of presentation we will sometimes omit "like" and "type".
2.1. Notation. Throughout this paper we use of the following notations. Vectors will be denoted by lower case roman letters, e.g., a and b, matrices by capital roman letters, e.g., A and B and tensors by calligraphic letters, e.g., A and B.
Notice that sometimes we will not explicitly mention the dimensions of matrices and tensors, and assume that they are such that the operations are well-defined. Also, for simplicity of notation and presentation, we will restricted ourselves to tensors of order 3, which are defined in the next paragraph. The generalization to higher order tensors is straightforward. For more general definitions we refer reader to [3].
A tensor will be viewed as a multidimensional array of real numbers. The order of a tensor is the number of dimensions, also known as modes, e.g., a 3-dimensional array, A ∈ R l×m×n , is called a tensor of order 3 or 3-tensor. A fiber is a one-dimensional section of a tensor, obtained by fixing all indices except one; A(i, :, k) is referred to as a mode-2 fiber. A slice is a two-dimensional section of a tensor, obtained by fixing one index; A(i, :, :) is a mode-1 slice or 1-slice. A particular element of a 3-tensor A can be denoted in two different way, i.e., "matlab-like" notation and standard subscripts with A(i, j, k) and a ijk , respectively. Definition 1. A 3-tensor A ∈ R m×m×n is called (1,2)-symmetric if all its 3slices are symmetric, i.e., A(i, j, k) = A(j, i, k), i, j = 1, 2, . . . , m, k = 1, 2, . . . , n.
Symmetry with respect to any two modes and for tensors of higher order than 3 can be defined analogously. We use I k for the identity matrix of dimension k.

Multilinear tensor-matrix multiplication.
We first consider the multiplication of a tensor by a matrix. When a tensor is multiplied by a single matrix in mode i, say, we will call the operation the mode-i multilinear multiplication (or mode-i product) of a tensor by a matrix. For example the mode-1 product of a tensor A ∈ R l×m×n by a matrix U ∈ R p×l is defined This means that all mode-1 fibers in the 3-tensor A are multiplied by the matrix U . The mode-2 and the mode-3 product are defined in a similar way. Let the matrices V ∈ R q×m and W ∈ R r×n ; multiplication along all three modes is defined For multiplication with a transposed matrix X ∈ R l×s it is convenient to introduce a separate notation, In a similar way if x ∈ R l then 2.3. Inner product and norm, contractions. The inner product of two tensors A and B of the same order and dimensions is denoted by A, B and is computed as a sum of element-wise products over all the indices, that is The product allows us to define the Frobenius norm of a tensor A as As in the matrix case the Frobenius norm of a tensor is invariant under orthogonal transformations, i.e., A = (U, V, W ) · A , for orthogonal matrices U , V , and W . This follows immediately from the fact that mode-i multiplication by an orthogonal matrix does not change the Euclidean length of the mode-i fibers.
The inner product is a contraction. We also define partial contractions that involve less than three modes, We use negative subscripts to denote partial contractions in all but one mode, The result is a matrix (order 2 tensor) of inner products between the mode-1 slices of the two tensors. For partial contractions only the contracted modes are required to be equal, so the result matrix may be rectangular.

Multilinear rank.
Unlike the matrix case, there is no unique definition of the rank of a tensor. In this paper we consider the concept of multilinear rank defined by Hitchcock [22]. Let A (i) denote the mode-i unfolding (matricization) of A (using some ordering of the vectors), where the columns of A (i) are all mode-i fibers [8]. Similarly, let fold i be the inverse of unfold i . The multilinear rank of a third order tensor A is an integer triplet (p, q, r) such that p = rank(A (1) ), q = rank(A (2) ), r = rank(A (3) ), 4 where rank(A (i) ) is the matrix rank. In this paper we will deal only with multilinear rank, and we will use the notation rank-(p, q, r), and rank(A) = (p, q, r). For matrices the rank is obtained via the svd; see, e.g., [20,Chapter 2]. In exact arithmetic the multilinear rank can be computed using the higher order singular value decomposition (HOSVD) [8].
2.5. Best rank-(r 1 , r 2 , r 3 ) approximation. The problem (1) of approximating a given tensor A ∈ R l×m×n by another tensor B of equal dimensions but of lower rank, occurs in many modern applications, e.g., machine learning [33], pattern classification [39], analytical and quantum chemistry [42,27], and signal processing [7]. We assume that rank(B) = (r 1 , r 2 , r 3 ), which means that B can be written as a product of a core tensor F ∈ R r1×r2×r3 and three matrices, where U ∈ R l×r1 , V ∈ R m×r2 , and W ∈ R n×r3 are full column rank matrices. Without loss a generality, we can suppose that U , V and W have orthonormal columns, as any nonorthogonality may be incorporated into F 1 . Therefore the best multilinear low rank problem (1) can be written as (4) min There are a few major differences between the best low rank approximation of matrices and 3-mode tensors and higher. In the matrix case, the explicit solution of corresponding problem can be obtained from the SVD, see the Eckart-Young property in [20,Theorem 2.4.8]. A simple proof is given in [13,Theorem 6.7]. There is no known closed form solution for the minimization problem (4), but it can be shown that this is a well-defined problem in the sense that for any (r 1 , r 2 , r 3 ) a solution exists [10,Corollary 4.5]. Several iterative methods for computing the low rank approximation for small and medium size tensors have been proposed, see [9,14,24,41]. In [9], it is shown that (4) is equivalent to following maximization problem Since the norm is invariant under orthogonal transformations, it holds that Φ(U, V, W ) = Φ(U Q 1 , V Q 2 , W Q 3 ) for any orthogonal matrices Q 1 ∈ R r1×r1 , Q 2 ∈ R r2×r2 and Q 3 ∈ R r3×r3 . Hence (5) is equivalent to a maximization problem over a product of Grassmann manifold; for optimization on matrix manifolds, see [12,1,23,24]. After computing the optimal U , V and W the optimal F can be obtained by considering the minimization of (4) as a linear least squares problem with unknown F. Lemma 2. Let A ∈ R l××m×n be given along with three matrices with orthonormal columns, U ∈ R l×r1 , V ∈ R m×r2 , and W ∈ R n×r3 , where r 1 ≤ l, r 2 ≤ m, and r 3 ≤ n. Then the least squares problem has the unique solution For a proof, see e.g. [9,40]. The tensor F is a generalization of the matrix Rayleigh quotient.
2.6. Gradient on the product manifold. In [14] a Newton-Grassmann method is derived for computing the solution of maximization problem (5). The constraints on the unknown matrices U , V , and W are taken into account by formulating the problem as an optimization problem on a product of three Grassmann manifolds. In this paper we will need the gradient of Φ in the tangent space of the product manifold for a stopping criterion. This gradient can be expressed in the ambient coordinate system, or in local coordinates. In the context of the new methods presented it is practical and more efficient to use local coordinate representations, see Proposition 4. Let (U U ⊥ ) denote the enlargement of U to a square orthogonal matrix, and use the analogous notation for V and W . Then the Grassmann gradient at (U, V, W ) can be written as In the context of the HOOI, see Section 6.1, it is more efficient to use global coordinates. For instance, the first component of the gradient can be computed as For more details on coordinate representations for this problem, see [14], [15,Section 3.2]. In the rest of this paper, the concept G-gradient will mean the Grassmann gradient in global or local coordinates.

2.7.
Conditioning of the best approximation problem. The best rank-r approximation problem for a matrix A is not unique if the singular values satisfy σ r (A) = σ r+1 (A). The problem is ill-conditioned if the gap is small, i.e. σ r (A) > σ r+1 (A) but σ r (A) ≈ σ r+1 (A), see e.g. [45,Chapter 3], [20,Chapter 8.6]. A similar situation exists for the tensor case [15,Corollary 4.5] (note that the perturbation theory for the SVD is a special case of that for the best rank-(r 1 , r 2 , r 3 ) approximation of a tensor). Define where the λ's are eigenvalues, in descending order, of the symmetric matrices. We will refer to these quantities as S-values. Then we can define three gaps, one for each mode, (In the matrix case, there is only one set of s (k) i , which are the singular values). It is shown in [15,Section 5.3] that the gaps can be taken as measures of the conditioning of the best approximation problem. If, for any k, s (k) r k is considerably larger than s (k) r k +1 then the approximation problem is well-conditioned with respect to mode k. Conversely, if the gap is small, then the problem is ill-conditioned.
3. Krylov methods for matrices. Krylov subspace methods are the main class of algorithms for solving iteratively large and sparse matrix problems. Here we give a brief introduction to Krylov methods, illustrating with the Arnoldi method.
For a given square matrix A ∈ R n×n and a nonzero vector u ∈ R n the subspace is called the Krylov subspace of dimension k associated with A and u [20, Chapter 10]. The Arnoldi process [20,Chapter 10] which is obtained by applying Gram-Schmidt orthogonalization, generates an orthogonal basis for the Krylov subspace (7). The recursive Arnoldi process is equivalent to the matrix equation where U k = [u 1 , . . . , u k ] is an orthogonal matrix andĤ k ∈ R (k+1)×k is an Hessenberg matrix with orthogonalization coefficients. Using this factorization one can compute an approximate solution of a linear system or an eigenvalue problem with coefficient matrix A by projecting onto the low dimensional subspace U k .
3.1. The Krylov-Schur approach to computing low rank approximations. The Arnoldi and Lanczos methods (see, e.g., [20,Chapter 10], [45,Chapter 5]) are designed for computing the eigenvalues of large and sparse matrices. Writing the Krylov decomposition (8) in the form approximations are obtained by computing the eigenvalues of H k (the Ritz values), using an algorithm for dense matrices. A problem with this approach is that when k grows the cost for orthogonalizing a newly generated vector against the columns of U k increases. In addition, since U k is dense, the memory requirements may increase too much. In order to save memory and work, an implicit restarting Arnoldi technique (IRA) was developed [43], and implemented in the highly successful ARPACK package [32]. In [44] Stewart proposed a Krylov-Schur method which is mathematically equivalent to IRA. In the following we give a brief introduction to the Krylov-Schur method, based on [45,Chapter 5].
Consider the Krylov decomposition (9) of A, which is obtained after k steps of the Arnoldi recursion. Let k = r + k 2 where r is the number of A's eigenvalues that we want to compute. The Schur decomposition of H k is where Q and T k are orthogonal and upper triangular matrices, respectively 2 . Partition T k and Q as where Q 1 ∈ R k×r , Q 2 ∈ R k×k2 , and the eigenvalues have been ordered such that those of T 11 ∈ R r×r are of interest and those of T 22 ∈ R k2×k2 are not. The Krylov decomposition (9) can now be written is also a Krylov decomposition. The main idea of the Krylov-Schur algorithm is to repeatedly expand (13) to dimension k, using the Arnoldi recursion, and then reduce it back to dimension r. The dimension of H k is assumed to be much smaller than that of A, which makes it possible to compute the Schur decomposition of H k using the dense QR algorithm (implemented in LAPACK). A sketch of this method [45, Chapter 5] is given in Algorithm 1.

Algorithm 1 The Krylov-Schur algorithm
Given: matrix A, the number of desired eigenvalues r and the number of steps k. Build a Krylov decomposition (9) of A.

Until convergence
Compute the Schur decomposition (10) and partition as in (11) with the r desired eigenvalues in the block T 11 . Discard the unwanted eigenvalues and save the truncated Krylov-Schur decomposition (13). Check convergence Expand the Krylov decomposition (13) to one of dimension k.
The convergence check is based on the eigenvalue residual [45, Chapter 5].
4. Krylov-type methods for tensors. Krylov-type methods that generalize the Arnoldi method to tensors are proposed in [40]. The methods are called Krylovtype methods, because the recursions are generalizations of matrix Krylov recursions, but no analogues of Krylov subspaces can be identified (to our knowledge). These methods are also inspired by Golub-Kahan bidiagonalization [19]. The bidiagonalization process generates two sequences of orthonormal basis vectors for certain Krylov subspaces. In the tensor case three sequences of orthogonal basis vectors are generated that are used to compute a core tensor that corresponds to the matrix H k in (9). For matrix Krylov methods, once an initial vector has been selected, the following vectors are determined uniquely; in the tensor case, one can choose different combinations of previously computed basis vectors. So there are different variants of tensor Krylov-type methods. We describe briefly two examples in the following.

Minimal Krylov recursion.
For a given third order tensor A ∈ R l×m×n and starting two vectors, u 1 ∈ R l and v 1 ∈ R m , we can obtain a third mode vector by w 1 = A · (u 1 , v 1 ) 1,2 ∈ R n . By using the most recently obtained vectors, three sequences of vectors can be generated. Using the modified Gram-Schmidt process, a newly generated vector is immediately orthogonalized against all the previous ones in its mode. The minimal Krylov recursion [40,Algorithm 3],which can be seen as a generalization of the Golub-Kahan bidiagonalization method, is given in Algorithm 2. The orth function orthogonalizesû against U i , and normalizes it. Using the three orthogonal matrices U k , V k , and W k generated by Algorithm 2, we obtain a rank-k approximation of A as

Algorithm 2 Minimal Krylov recursion
Given: two normalized starting vectors u 1 and v 1 , Maximal Krylov recursion. In the minimal Krylov recursion a new vector u i+1 is generated based on the two most recently computed v i and w i (and correspondingly for the other modes). But we can choose any other available v j and w k that have not been combined before in an operation A · (v j , w k ) 2,3 . If we decide to use all available combinations, then this is called the Maximal Krylov recursion [40,Algorithm 5]. Given V j and W k , all combinations can be computed aŝ Next the mode-1 fibers of the tensorÛ have to be orthogonalized. The number of basis vectors generated grows very quickly. In the following subsections we will describe a few modifications of the maximal recursion that avoid computing many of the vectors in the maximal recursion, while maintaining as much as possible of the approximation performance.

Block-Krylov methods.
For large and sparse tensors it is convenient to use software that implements operations with tensors, in particular tensor-matrix multiplication. In our numerical experiments we use the extension of the Matlab tensor toolbox [3] to sparse tensors [4]; it is natural to base some algorithm design decisions on the use of such software. Other languages and implementations are likely to have similar properties.
There are two main reasons why we choose to use block-Krylov methods. Firstly, it is easier to design and describe modifications of the maximal Krylov recursion in terms of blocks. Secondly, and more importantly, the time required for the computation of sparse tensor-vector and tensor-matrix products is dominated by data movements [4, Sections 3.2.4-3.2.5], where the tensor is reorganized to a different format before the multiplication takes place. Therefore, it is better to reuse the reorganized tensor for several vectors, as is done in a tensor-matrix product (akin to the use of BLAS 3 operations in dense linear algebra). In our experiments with a few sparse tensors of moderate dimensions (approximately 500 × 500 × 500 and 3600 × 3600 × 60) and 6 vectors, the time for repeated tensor-vector multiplication was 3-9 times longer than for the corresponding tensor-matrix block multiplication. One should keep in mind, however, that such timings may also be highly problem-, software-and hardware-dependent.
Let A ∈ R l×m×n be given, and assume that, starting from U 0 ∈ R l×r1 , V 0 ∈ R m×r2 and W 0 ∈ R n×r3 , three sequences of blocks of orthonormal basis vectors (referred to as ON-blocks) have been computed, Letting p be a block size, andV ∈ R m×p and W ∈ R n×p be blocks selected out of V µ and W ν , we compute a new block U λ ∈ R l×p 2 9 using Algorithm 3.
The algorithm is written in tensor form, in order to make the operations in Steps (ii)-(iv) look like the Gram-Schmidt orthogonalization that it is. In our actual implementations we have avoided some tensor-matrix restructurings, see Appendix B.
Clearly, after Step Step (iv) the mode-1 vectors (fibers) of U (1) are orthogonalized and organized in a matrix U λ ∈ R l×p 2 , i.e. a "thin" QR decomposition is computed, where the matrix H 1 λ is upper triangular. The tensor H 1 λ = fold 1 (H 1 λ ) ∈ R p 2 ×p×p contains the columns of H 1 λ , and consequently it has a "quasi-triangular" structure. The mode-1 step can be written (14) A · V ,W 2,3 = U λ−1 This can be seen as a generalization to blocks and tensors of one step of the Arnoldi recursion, cf. (9). The mode-2 and mode-3 block-Krylov steps are analogous. Different variants of BK methods can be derived using different choices ofV andW , etc. However, we will always assume that the blocks U 0 , V 0 , and W 0 are used to generate the blocks with subscript 1.
The recursion (14) and its mode-2 and mode-3 counterparts imply the following simple lemma 3 . It will be useful in the computation of gradients.
Lemma 3. Assume that, starting from three ON-blocks U 0 , V 0 , and W 0 , Algorithm 3 and its mode-2 and mode-3 counterparts, have been used to generate ON-blocks The corresponding identities hold for modes 2 and 3.
Proof. Consider the identity (14) withV = V 0 , andW = W 0 , multiply by U T j in the first mode, and use orthogonality and , and W 0 ∈ R n×r3 , have orthonormal columns. Let it be a starting point for one block-Krylov step in each mode with A ∈ R l×m×n , giving (U 1 , V 1 , W 1 ) and tensors Then the norm of the G-gradient of (5) at Proof. The mode-1 gradient at the mode-1 result follows. The proof for the other modes is analogous.
Assume we have an algorithm based on block-Krylov steps in all three modes, and we want to compute the G-gradient to check if a point (U 0 , V 0 , W 0 ) is approximately stationary. Then, by Proposition 4, we need only perform one block-Krylov step in each mode, starting from (U 0 , V 0 , W 0 ), thereby avoiding the computation of U ⊥ , V ⊥ , and W ⊥ , which are usually large and dense. If the norm of the G-gradient is not small enough, then one would perform more block-Krylov steps. Thus the gradient computation comes for free, essentially.
4.3.1. Block-Krylov methods for (1,2)-symmetric tensors. In many real applications, the tensors are (1, 2)-symmetric. This is the case, for instance, if the mode-3 slices of the tensor represent undirected graphs. In the rest of this section we will assume that A ∈ R m×m×n is (1, 2)-symmetric, and we will compute two sequences of blocks U 1 , U 2 , . . . and W 1 , W 2 , . . ., where the U λ blocks contain basis vectors for modes 1 and 2, and the W ν for mode 3.
A new block U λ is computed from given blocksŪ andW in the same way as in the nonsymmetric case, using Algorithm 3.
To compute a new block W ν , we use two blocksŪ andŪ . IfŪ =Ū , then we can use the analogue of Algorithm 3. In the caseŪ =Ū the product tensor A · Ū ,Ū 1,2 is (1,2)-symmetric, which means that almost half its 3-fibers are superfluous, and should be removed. Thus, letting W (3) denote the tensor that is obtained in (iii) of Algorithm 3, we compute the "thin" QR decomposition, where triunfold 3 (X ) denotes the mode-3 unfolding of the (1,2)-upper triangular part of the tensor X . IfŪ ∈ R m×p , then W ν ∈ R n×pν , where p ν = p(p + 1)/2. A careful examination of Lemma 3 for the case of (1,2)-symmetric tensors shows that the corresponding result holds also here. We omit the derivations in order not to make the paper too long.

Min-BK method.
Our simplest block-Krylov method is the (1,2)-symmetric block version of the minimal Krylov recursion of Section 4.1, which we refer to as the min-BK method. Here, instead of using only two vectors in the multiplication û = A·(u i , w i ) we use the p first vectors from the previous blocks. LetŪ = U (:, 1 : p), denote the first p vectors of a matrix block U . The parameter s in Algorithm 4 is the number of stages.
Algorithm 4 Min-BK method for (1,2)-symmetric tensor A. Given U 0 and W 0 for i = 1 : s · · · % Gram-Schmidt giving W i end for  The min-BK method is further described in the three diagrams of Table 1. Note that to conform with Proposition 4 we always useŪ 0 = U 0 andW 0 = W 0 . It is seen that the growth of the number of basis vectors, the k i parameters, is relatively slow. However, it will be seen in Section 6 that this method is not competitive.

Max-BK method.
The max-BK method is maximal in the sense that in each stage we use all the available blocks to compute new blocks. The algorithm is defined by three diagrams, see Table 2. E.g., in stage 2 we use combinations of the whole blocks U 0 , U 1 , W 0 , and W 1 , to compute U 2 , U 3 , and U 4 (U 1 was computed already in stage 1). The diagram for the W i 's is triangular due to the (1,2)-symmetry of A: A · (U 0 , U 1 ) 1,2 = A · (U 1 , U 0 ) 1,2 , for instance.
It is clear that the fast growth of the number of basis vectors makes this variant impractical, except for small values of r and s, e.g. r = (2, 2, 2) and s = 2. In the same way as in the matrix Krylov-Schur method, we are not interested in too large values of k 1 and k 3 , because we will project the original problem to one of dimension (k 1 , k 1 , k 3 ), which will be solved by methods for dense tensors. Hence we will in the next subsection introduce a compromise between the min-BK and the max-BK Table 2 Diagrams of the max-BK method. Top: Combinations of blocks of basis vectors for computing new blocks. Bottom: Number of basis vectors in the stages of the max-BK method. Columns 3 and 5 give the number of basis vectors for r = (2, 2, 2) and r = (7, 7, 7).

BK method.
The BK method is similar to min-BK in that it uses only the p first vectors of each block in the block-Krylov step. In each stage more new blocks are computed than in min-BK, but not as many as in max-BK. Both these features are based on numerical tests, where we investigated the performance of the BK method in the block Krylov-Schur method to be described in Section 5. We found that if we omitted the diagonal blocks in the diagrams in Table 2, then the convergence of the Krylov-Schur method was only marginally impeded. The BK method is described in the three diagrams of Table 3. Table 3 Diagrams of the BK method. Top: Combinations of blocks of basis vectors for computing the new blocks. In the case when U 0 ∈ R m×2 and p ≥ 3, we letW 1 = W 1 . Bottom: Basis blocks in the stages of the BK method, the number of vectors for p = 4, and r = (2, 2, 2), r = (7, 7, 7), and r = (10, 10, 10). It may happen that one of the dimensions of the tensor is considerably smaller than the other. Assume that m n. Then after a few stages the number of vectors in the third mode may be equal to n, and no more can be generated in that mode. Then the procedure is modified in an obvious way (the right diagram is stopped being used) so that only vectors in the other modes (U blocks) are generated. The min-BK and max-BK methods can be modified analogously.

A Tensor Krylov-Schur like method.
When tensor Krylov-type methods are used for the computation of low multilinear rank approximations of large and sparse tensors, they suffer from the same weakness as Krylov methods for matrices: the computational burden for orthogonalizing the vectors as well as memory requirements may become prohibitive. Therefore a restarting procedure should be tried. We will now describe a generalization of the matrix Krylov-Schur approach to a corresponding tensor method. Here we assume that the tensor is non-symmetric.
Let A ∈ R l×m×n be given third order tensor, for which we want to compute the best rank-(r 1 , r 2 , r 3 ) approximation. For reference we restate the maximization problem, Let k 1 , k 2 , and k 3 be integers such that and assume that we have computed, using a BK method, a rank-(k 1 , k 2 , k 3 ) approximation where X ∈ R l×k1 , Y ∈ R m×k2 , and Z ∈ R n×k3 are matrices with orthonormal columns, and C ∈ R k1×k2×k3 is a core tensor. With the assumption (17), we can use an algorithm for dense tensors, e.g. a Newton-Grassmann method [14,24], to solve the projected maximization problem From the solution of (19) we have the best rank-(r 1 , r 2 , r 3 ) approximation of C, where F ∈ R r1×r2×r3 is the core tensor. This step is analogous to computing and truncating the Schur decomposition of the matrix H k in the matrix case in Section 3.1. Combining (20) and (18) we can write where U = XÛ ∈ R l×r1 , V = YV ∈ R m×r2 , and W = ZŴ ∈ R n×r3 , with orthonormal columns. Thus (21) is a rank-(r 1 , r 2 , r 3 ) approximation of A. Then, starting with U 0 = U , V 0 = V , and W 0 = W , we can again expand (21), using a BK method, to a rank-(k 1 , k 2 , k 3 ) approximation (18). A sketch of the tensor Krylov-Schur method is given in Algorithm 5. The algorithm is an outer-inner iteration. In the outer iteration (16) is projected to the problem (19) using the bases (X, Y, Z).
Notice that if A is a (1, 2)-symmetric tensor, then all aforementioned relations are satisfied with U = V and X = Y . So equation (21) transfers to

Convergence of the tensor Krylov-Schur algorithm.
In the discussion below we will make the following assumption: (22) the G-Hessian for (16) at the stationary point is negative definite, i.e., the stationary point is a strict local maximum.
Note that (22) is equivalent to assuming that the objective function is strictly convex in a neighborhood of the stationary point. Let (U 0 , V 0 , W 0 ) be an approximate solution, and let the expanded bases of ON- For simplicity we here have included more than one block-Krylov steps in U 1 , V 1 , and W 1 . Let X ⊥ be a matrix such that (X X ⊥ ) is orthogonal, and make the analogous definitions for Y ⊥ and Z ⊥ . We then make the change of variables The tensor is a subtensor of B, see Figure 1. In the discussion of convergence we can, without loss of generality, consider the equivalent maximization problem for B, Now the approximate solution (U 0 , V 0 , W 0 ) is represented by and we have enlarged E 0 by one or more block-Krylov steps to In the inner iteration we shall now compute the best rank-(r 1 , r 2 , r 3 ) approximation for C, max C · (P, Q, S) , P T P = I r1 , Q T Q = I r2 , S T S = I r3 .
using the Newton-Grassmann method (note that P ∈ R k1×r1 , Q ∈ R k2×r2 , and S ∈ R k3×r3 ). Denote the core tensor after this computation by F. Due to the fact that F is a subtensor of C, it follows that and evidently the Krylov-Schur algorithm produces a non-decreasing sequence of objective function values that is bounded above (by A ). If E 0 is the local maximum point for (23), then the G-gradient ∇ B (E 0 ) = 0, and, by Proposition 4, ∇ C (Ē 0 ) = 0, whereĒ 0 corresponds to E 0 . Therefore, the Newton-Grassmann method for (19) will not advance, but give F = F.
On the other hand, if E 0 is not the local maximum, then ∇ B (E 0 ) and ∇ C (Ē 0 ) are nonzero. Assume that we are close to a local maximum so that the G-Hessian for (23) is negative definite. Then the G-Hessian for (24) is also negative definite, see Appendix 4 A, and the projected maximization problem (19) is locally convex. Therefore the Newton-Grassmann method will converge to a solution that satisfies F > F [18, Theorem 3.1.1].
Thus, we have the following result.
Theorem 5. Assume that the assumptions (22) hold, and that (U 0 , V 0 , W 0 ) is close enough to a local maximum for (16). Then Algorithm 5 will converge to that local maximum.
The G-gradient is zero at the local maximum; thus the Krylov-Schur-like algorithm converges to a stationary point for the best rank-(r 1 , r 2 , r 3 ) approximation problem.
6. Numerical tests. In this section we will investigate the performance of the different block-Krylov-Schur methods by applying them to some test problems. As a comparison we will use the HOOI method. We here give a brief description, for details, see [9].
6.1. Higher Order Orthogonal Iteration. Consider first the nonsymmetric case (5). HOOI is an alternating iterative method [9], where in each iteration three maximization problems are solved. In the first maximization we assume that V and W are given satisfying the constraints, and maximize (26) max The solution of this problem is given by the first r 1 left singular vectors of the mode-1 unfolding C (1) of C 1 , and that is taken as the new approximation U . Then in the second maximization U and W are considered as given and V is determined, etc. The cost for computing the thin SVD is O(l(r 2 r 3 ) 2 ) (under the assumption (17)). As this computation is highly optimized, it is safe to assume that for large and sparse tensors the computational cost in HOOI is dominated by the tensor-matrix multiplications A · (V, W ) 2,3 (and corresponding in the other modes), and the reshaping of the tensor necessary for the multiplication. In addition, the computation of the Ggradient involves extra tensor-matrix multiplications. Here it is more efficient to use global coordinates, cf. Section 2.6. In our experiments we computed the G-gradient only every ten iterations.
For a (1,2)-symmetric tensor we use the HOOI method, where we have two maximizations in each step, one for U , with the previous value of U in C 1 , and the other for W , with C 3 = A · (U, U ) 1,2 .
The HOOI iteration has linear convergence; its convergence properties are studied in [49]. In general, alternating methods are not guaranteed to converge to a local maximizer [38,37]. For some tensors, HOOI needs quite a few iterations before the convergence is stabilized to a constant linear rate. On the other hand, HOOI has relatively fast convergence for several well-conditioned problems. In our tests the HOOI method is initialized with random matrices with orthonormal columns.
For large tensors we use HOOI as a starting procedure for the inner Newton-Grassmann iterations to improve robustness. Note that here the tensor C is much smaller of dimension (k 1 , k 1 , k 3 ).

Experiments.
To investigate the performance of Krylov-Schur methods we present in the following the results of experiments on a few (1, 2)-symmetric tensors. In the outer iteration, the stopping criterion is the relative norm of the G-gradient ∇ / F ≤ 10 −13 .
In the inner iteration (Step (i) of algorithm 5) the best rank-(r 1 , r 1 , r 3 ) of the tensor C is computed using the Newton-Grassmann algorithm [14], initialized by the HOSVD of C, truncated to rank (r 1 , r 1 , r 3 ), followed by 5 HOOI iterations to improve robustness (recall that this problem is of dimension (k 1 , k 1 , k 3 )). The same stopping criterion as in the outer iteration was used. The typical iteration count for the inner Newton-Grassmann iteration was 4-5.
In the figures the convergence history of the following four methods are illustrated, where the last three are named according to the block Krylov-type method used in the outer iterations: 1. HOOI, 2. max-BKS(s ; k 1 ,k 3 ), 3. BKS(s,p ; k 1 ,k 3 ), 4. min-BKS(s,p ; k 1 ,k 3 ). Here s denotes the stage, k 1 and k 3 the number of basis vectors in the X and Z basis, respectively, and p indicates that the first p vectors of each ON-block are used in BKS and min-BKS.
The convergence results are presented in figures, where the y-axis and x-axis represent the relative norm of G-gradient and the number of outer iterations, respectively. For the larger examples we also plot the gradient against the execution time.
In the examples we used rather small values for (r 1 , r 1 , r 3 ), like in some real world applications [34,36]. In two forthcoming papers [17,16] on tensor partitioning and data science applications we compute rank-(2, 2, 2) approximations.
The experiments were performed using Matlab and the tensor toolbox [4] on a standard desktop computer with 8 GBytes RAM memory. In all test cases the memory was sufficient.
6.2.1. Example 1. Synthetic signal-plus-noise tensor. For the first test we generated synthetic (1, 2)-symmetric tensors with specified low rank. Let A = A signal + ρA noise , where A signal is a signal tensor with low multilinear rank, A noise is a noise tensor and ρ denotes the noise level. A signal was constructed as a tensor of dimension (r 1 , r 1 , r 3 ) with normally distributed N (0, 1) elements; this was placed at the top left corner of a zero tensor of size m × m × n. The elements of the noise tensor were chosen normally distributed N (0, 1), and then the tensor was made (1, 2)symmetric. For testing purposes we performed a random permutation such that the tensor remained (1,2)-symmetric. This tensor is dense. The purpose of this example is to demonstrate that the rate of convergence of all methods depends heavily on the conditioning of the approximation problem (cf. [15,Corollary 4.5] and the short statement in Section 5.1). Figure 2 illustrates the convergence results for a 200 × 200 × 200 tensor approximated by one of rank-(2, 2, 2), which is the correct rank of the signal tensor. The problem with ρ = 10 −2 is more difficult than the other one, because the signal-to-noise ratio is smaller, making the approximation problem more ill-conditioned. This shows in the iteration count for all methods. See also Table 4, where the S-values are given. The HOOI was considerably more sensitive to the starting approximation than the other methods. For a few starting values it converged much more slowly than in the figure. For this small problem it is possible to compute the solution directly using HOOI and the Newton-Grassmann method, without the Krylov-Schur approach. We compared that solution with the one obtained using the max-BK method and they agreed to within less than the magnitude of the stopping criterion (the angles between the subspaces spanned by the solution matrices were computed).
For the small problems in this example it is not meaningful to compare the computational efficiency of the BKS methods to that of HOOI: due to the simplicity of HOOI, it comes out as a winner in the cases when it converges.  [46] 5 . We constructed it from a social network, using a student/faculty flag as third mode: the tensor element A(λ, µ, ν) = 1, if students λ and µ are friends and one of them has flag ν. After zero slices in the third mode have been removed, this is a 6593 × 6593 × 29 tensor with 1.2 · 10 6 non zeros. Figure 3 shows the convergence for the Princeton tensor approximated with a rank-(2, 2, 2) and a (4, 4, 4)-rank tensor. In both cases the inner iterations were initialized with truncated HOSVD followed by 5 HOOI iterations. The time measurements in Figure 3 are based on the Matlab functions tic and toc. A large proportion of the time in HOOI is the computation of the G-gradient (which was done every ten iterations). In Table 5 we give the S-values. The mode-3 entries are a strong indication that the mode-3 multilinear rank of the tensor is equal to 3. Any attempt to compute e.g. a rank-(4,4,4) approximation will suffer from the mode-3 ill-conditioning and is likely to give incorrect results. However, a rank-(4,4,3) can be computed easily using BKS. Here the convergence of HOOI was very slow.
The number of iterations in the BKS method was rather insensitive to the choice of stage and block size parameters s and p. Thus it did not pay off to use a larger inner subproblem. Similarly, the use of max-BKS(2;111,29) gave relatively fast convergence in terms of the number of iterations, but the extra information gained by using a large value of k 1 was not so substantial that it made up for the heavier computations.
HOOI was sensitive to the choice of starting approximation. Often the convergence rate was considerably slower than in Figure 3.
6.2.3. Example 3. The Reuters tensor. The Reuters tensor is a sparse tensor of dimensions 13332 × 13332 × 66 with 486, 894 nonzero elements. It is constructed from all stories released during 66 consecutive days by the news agency Reuters after the September 11, 2001, attack [5]. The vertices of the network are words. There is an edge between two words if they appear in the same text unit (sentence). The weight of an edge is its frequency. Figure 4 shows the convergence results for the Reuters tensor, approximated with a rank-(2, 2, 2) and a rank-(6, 6, 6) tensor. In the second case, the inflexibility of the choice of k 1 and k 3 in max-BKS forced us to use stage 1, which led to slower convergence than with BKS and min-BKS.  The S-values are given in Table 6. It is seen that none of the problems is particularly ill-conditioned. Table 6 Example 3 S-values for the Reuters tensor with (r 1 , r 2 , r 3 ) = (2, 2, 2) (left) and (r 1 , r 2 , r 3 ) = (6, 6, 6) (right). It is argued in [17] that in cases when the 3-slices of a (1,2)-symmetric tensor are adjacency matrices of graphs, then one should normalize the slices so that the largest eigenvalue of each slice becomes equal to 1. In that context a rank-(2,2,2) approximation is computed. We ran a test with the normalized tensor and the same parameter values as in Figure 4. The results are shown in Figure 5. The S-values are given in Table 7. They indicate that this problem is slightly more ill-conditioned than the unscaled one. Table 7 Example 3. S-values for the scaled Reuters tensor with (r 1 , r 2 , r 3 ) = (2, 2, 2). The following description is taken from [16]. In [25] network traffic logs are analyzed in order to identify malicious attackers. The data are called the 1998 DARPA Intrusion Detection Evaluation Dataset and were first published by the Lincoln Laboratory at MIT 6 . We downloaded the data set from https://datalab.snu.ac.kr/haten2/ in October 2018. The records consist of (source IP, destination IP, port number, timestamp). In the data file there are about 22000 different IP addresses. We chose the subset of 8991 addresses that both sent and received messages. The time span for the data is from June 1 1998 to July 18, and the number of observations is about 23 million. We merged the data in time by collecting every 63999 consecutive observations into one bin. Finally we symmetrized the tensor A ∈ R m×m×n , where m = 8891 and n = 371, so that a ijk = 1 if i communicated with j in time slot k 0 otherwise.
In this example we did not normalize the slices of the tensor: The 3-slices are extremely sparse, and normalization makes the rank-(2,2,2) problem so ill-conditioned that none of the algorithms converged. Instead we scaled the slices to have Frobenius norm equal to 1. The convergence history is shown in Figure 6. The HOOI method was sensitive to the (random) starting approximations. It did happen that the method converged rapidly, but in many cases convergence was extremely slow.
The S-values are given in Table 8. The problem is well-conditioned.