A μ-mode BLAS approach for multidimensional tensor-structured problems

In this manuscript, we present a common tensor framework which can be used to generalize one-dimensional numerical tasks to arbitrary dimension d by means of tensor product formulas. This is useful, for example, in the context of multivariate interpolation, multidimensional function approximation using pseudospectral expansions and solution of stiff differential equations on tensor product domains. The key point to obtain an efficient-to-implement BLAS formulation consists in the suitable usage of the μ-mode product (also known as tensor-matrix product or mode-n product) and related operations, such as the Tucker operator. Their MathWorks MATLAB®/GNU Octave implementations are discussed in the paper, and collected in the package KronPACK. We present numerical results on experiments up to dimension six from different fields of numerical analysis, which show the effectiveness of the approach.


Introduction
Many one-dimensional tasks in numerical analysis can be generalized to a two-dimensional formulation by means of tensor product formulas.This is the case, for example, in the context of spectral decomposition or interpolation of multivariate functions.Indeed, the one-dimensional formula where the values t j are linearly combined to obtain the values s i (i.e.s = Lt, with s = (s i ) ∈ C n , t = (t j ) ∈ C m , and L = (ℓ ij ) ∈ C n×m ), can be easily extended to the two-dimensional case as The meaning of the involved scalar quantities depends on the specific example under consideration.In any case, a straightforward implementation of formula (1) requires four nested for-loops, with a resulting computational cost of O(n 4 ) (if, for simplicity, we consider m 1 = m 2 = n 1 = n 2 = n).On the other hand, formula (1) can be written equivalently in matrix formulation as where L 1 = (ℓ1 i1j1 ) ∈ C n1×m1 , L 2 = (ℓ 2 i2j2 ) ∈ C n2×m2 , T = (t j1j2 ) ∈ C m1×m2 and S = (s i1i2 ) ∈ C n1×n2 .The usage of formula (2) requires two separate matrix-matrix products as floating point operations, each of which can be implemented with three nested for-loops: this approach reduces then the cost of computing the elements of S to O(n 3 ).On the other hand, a more efficient way to realize formula (2) is to exploit optimized Basic Linear Algebra Subprograms (BLAS) [1][2][3][4], which are a set of numerical linear algebra routines that perform the just mentioned matrix operations with a level of efficiency close to the theoretical hardware limit.A performance comparison of the three approaches to compute the values s i1i2 in matlab 1 language, for increasing size of the task, is given in Table 1.As expected, for all the values of n under study, the most efficient way to compute the elements of S is realizing formula (2) through the BLAS approach.Remark that the considerations on the complexity cost and BLAS efficiency are basically language-independent, and apply for other interpreted or compiled languages as well, like Python, Julia, R, Fortran, and C++.For clarity of exposition and simplicity of presentation of the codes, we will use in this manuscript, from now on, matlab programming language.n = 50 n = 100 n = 200 n = 400 Nested for-loops 1.8e-2 2.8e-1 4.8e0 8.0e1 Matrix-matrix products (for-loops) 7.8e-4 5.5e-3 4.9e-2 3.9e-1 Matrix-matrix products (BLAS) 2.1e-5 5.6e-5 1.7e-4 1.2e-3 Table 1 Wall-clock time (in seconds) for the computation of the values s i 1 i 2 in formula (1) with increasing size m 1 = m 2 = n 1 = n 2 = n and different approaches, using MathWorks MATLAB ® R2019a.The input values are standard normal distributed random numbers.
In other contexts, such as numerical solution of (stiff) differential equations on two-dimensional tensor product domains by means of exponential integrators or preconditioned iterative methods, it is required to compute quantities like vec(S) = (L 2 ⊗ L 1 )vec(T ), being again L 1 , L 2 , T and S matrices of suitable size whose meaning depends on the specific example under consideration.Here ⊗ denotes the standard Kronecker product of two matrices, while vec represents the vectorization operator, see Appendix A for their formal definitions.A straightforward implementation of formula (3) would need to assemble the large-sized matrix L 2 ⊗ L 1 .If, for simplicity, we consider again m 1 = m 2 = n 1 = n 2 = n, this approach requires a storage and a computational cost of O(n 4 ), which is impractical.However, owing to the properties of the Kronecker product (see Appendix A), we can see that formula (3) is equivalent to formula (2).Therefore, all the considerations made for the previous example on the employment of optimized BLAS apply also in this case.The aim of this work is to provide a common framework for generalizing formula (2) in arbitrary dimension d, which will result in an efficient BLAS realization of the underlying task.This is very useful in the context of solving tensor-structured problems which may arise from different scientific and engineering fields.The pursued approach is illustrated in detail in Section 2, in which we present the µ-mode product and some associated operations (the Tucker operator, in particular), both from a theoretical and a practical point of view.These operations are widely known by the tensor algebra community, but their usage is mostly restricted in the context of tensor decompositions (see [5,6]).Then, we proceed in Section 3 by describing more precisely the oneand two-dimensional formulations of the problems mentioned in this section, as well as their generalization to the d-dimensional case in terms of µ-mode products.We collect in Section 4 the related numerical experiments and we finally draw the conclusions in Section 5.
All the functions and the scripts needed to perform the relevant tensor operations and to reproduce the numerical examples of this manuscript are contained in our matlab package KronPACK2 .
A µ-mode BLAS approach for multidimensional tensor-structured problems 2 The µ-mode product and its applications In order to generalize formula (2) to the d-dimensional case, we rely on some concepts from tensor algebra (see [5,6] for more details).Throughout this section, we assume that T ∈ C m1×•••×m d is an order-d tensor whose elements are either denoted by t j1...j d or by T (j 1 , . . ., j d ).
Definition 2.1.A µ-fiber of T is a vector in C mµ obtained by fixing every index of the tensor but the µth.
A µ-fiber is nothing but a generalization of the concept of rows and columns of a matrix.Indeed, for an order-2 tensor (i.e. a matrix), 1-fibers are the columns, while 2-fibers are the rows.On the other hand, for an order-3 tensor, 1-fibers are the column vectors, 2-fibers are the row vectors while 3-fibers are the so-called "page" or "tube" vectors, which means vectors along the third dimension.
is defined as the matrix whose columns are the µ-fibers of T .
Remark that for an order-2 tensor the 1-and 2-matricizations simply correspond to the matrix itself and its transpose.In dimensions higher than two, the µ-matricization requires the concept of generalized transpose of a tensor and its unfolding into a matrix.The first operation is realized in matlab by the permute function, that we use to interchange µ-fibers with 1-fibers of the tensor T .The second operation is performed by the reshape function, that we use to unfold the "transposed" tensor into the matrix T (µ) .In matlab, the anonymous function which performs the µ-matricization of a tensor T, given m = size(T); d = length(m); can be written as By means of µ-fibers, it is possible to define the following operation.From this definition, it appears clear that the µ-fiber S(j 1 , . . ., j µ−1 , •, j µ+1 , . . ., j d ) of S can be computed as the matrix-vector product of L and the µ-fiber T (j 1 , . . ., j µ−1 , •, j µ+1 , . . ., j d ).Therefore, the µmode product T × µ L might be performed by calling times level 2 BLAS.However, owing to the concept of matricization of a tensor introduced in Definition 2.2, it is possible to perform the same task more efficiently by using a single level 3 BLAS call.Indeed, the µ-mode product of T with L is just the tensor S such that In particular, in the two-dimensional setting, the 1-mode product corresponds to the multiplication LT , while the 2-mode product corresponds to (LT T ) T = T L T .In general, we can compute the matrix S (µ) appearing in formula (4) as L*mumat(T,mu), and in order to recover the tensor S from S (µ) we need to invert the operations of unfolding and "transposing".This can be done easily with the aid of the matlab functions reshape and ipermute, respectively.All in all, given n = size(L,1), the anonymous function that computes the µmode product of an order-d tensor T with L by a single matrix-matrix product can be written as mump = @(T,L,mu) ipermute(reshape(L*mumat(T,mu),... Notice that from formula (4) it appears clear that the computational cost of the µ-mode product, in terms of floating point operations, is One of the main applications of the µ-mode product is the so-called Tucker operator, which is implemented in KronPACK in the function tucker.Definition 2.4.Let L µ ∈ C nµ×mµ be matrices, with µ = 1, . . ., d.The Tucker operator of T with L 1 , . . ., L d is the tensor S ∈ C n1×•••×n d obtained by concatenating d consecutive µ-mode products with matrices L µ , that is We notice that the single element s i1...i d of S in formula (5) turns out to be provided that ℓ µ iµjµ are the elements of L µ .Hence, as formula ( 6) is clearly the generalization of formula (1) to the d-dimensional setting, formula (5) is the sought d-dimensional generalization of formula (2).We also notice that the Tucker operator ( 5) is invariant with respect to the ordering of the µ-mode products, and that the implicit ordering given by Definition 2.4 is equivalent to performing the sums in formula (6) starting from the innermost.
A µ-mode BLAS approach for multidimensional tensor-structured problems The Tucker operator is strictly connected with the Kronecker product of matrices applied to a vector.Lemma 2.1.Let L µ ∈ C nµ×mµ be matrices, with µ = 1, . . . ,d.Then, the elements of S in formula (5) are equivalently given by Proof The µ-mode product satisfies the following property see [6].Then, with µ = 1 we obtain By means of the properties of the Kronecker product (see Appendix A) we have then and finally, by definition of vec operator, vec(S (1) Notice that formula (7) is precisely the d-dimensional generalization of formula (3).Hence, tasks written as in formula (7) can be equivalently stated and computed more efficiently again by formula (5), without assembling the large-sized matrix We can then summarize as follows: the element-wise formulation (6), the tensor formulation (5) and the vector formulation (7) can all be used to compute the entries of the tensor S.However, in light of the considerations for the µ-mode product, only the tensor formulation can be efficiently computed by d calls of level 3 BLAS, with an overall computational cost of O(n d+1 ) for the case m µ = n µ = n.This is the reason why the relevant functions of our package KronPACK are based on formulation (5).
Remark 1.The implementation of a single µ-mode product in the function mump of KronPACK involves two explicit permutations of the tensor (except the 1-mode and the d-mode products, which are realized without explicitly permuting, thanks to the design of the function reshape in matlab).On the other hand, the function tucker, which realizes the Tucker operator (5), performs a composition of any pair of consecutive permutations, thus reducing their overall number.In fact, this is important when dealing with large-sized tensors, because the cost of permuting is not negligible due to the underlying alteration of the memory layout.For this reason, several algorithms which further reduce or completely avoid permutations in an efficient way have been developed (see, for instance [7][8][9][10]).In this context, for instance, it is possible to use the function pagemtimes to efficiently realize a "Loops-over-GEMMs" strategy.However, as this function has been recently introduced in MathWorks MATLAB ® R2020b and it is still not available in the latest stable GNU Octave release 7.1.0,for compatibility reasons we do not follow this approach.
Notice that the definition of µ-mode product and its realization through the function mump can be easily extended to the case in which instead of a matrix L we have a matrix-free operator L. In matlab, if the operator L is represented by the function Lfun which operates on columns, we can implement the µ-mode action by mumpfun = @(T,Lfun,mu) ipermute(reshape(Lfun(mumat(T,mu)),... [n,m([ The corresponding generalization of the Tucker operator, denoted again by and implemented in KronPACK in the function tuckerfun, follows straightforwardly.Clearly, in this case, some properties of the Tucker operator (5), such as the aforementioned invariance with respect to the ordering of the µmode product operations, may not hold anymore for generic operators L µ .Generalization ( 8) is useful in some instances, see Remark 3 and Section 4.2 for an example.We remark that such an extension is not available in some other popular tensor algebra toolboxes, such as Tensor Toolbox for MATLAB [11] -which does not have GNU Octave support, too -and Tensorlab [12], both of which are more devoted to tensor decomposition and related topics.The µ-mode product is also useful for computing the action of the Kronecker sum (see Appendix A for its definition) of the L µ matrices to a vector v, that is where v = vec(V ).In fact, as it can be noticed from formula (4), the identity matrix is the identity element of the µ-mode product.Combining this observation with Lemma 2.1, we easily obtain formula (9).In our package KronPACK, the matrix resulting from the Kronecker sum on the left hand side of equality (9) can be computed as kronsum(L), where L is the cell array containing L µ in L{mu}.On the other hand, its action on v can be computed equivalently in tensor formulation, without forming the matrix itself, by kronsumv(V,L).
A µ-mode BLAS approach for multidimensional tensor-structured problems

Problems formulation in d dimensions
In this section we discuss in more detail the problems that were briefly introduced in Section 1. Their generalization to arbitrary dimension d is addressed thanks to the common framework presented in Section 2.

Pseudospectral decomposition
Suppose that a function f : R → C, with R ⊆ R, can be expanded into a series where f i are complex scalar coefficients and φ i (x) are complex functions orthonormal with respect to the standard L 2 (R) inner product, i.e.
Then, the spectral coefficients f i are defined by and can be approximated by a quadrature formula.Usually, in this context, specific Gaussian quadrature formulas are employed, whose node and weights vary depending on the chosen family of basis functions.If we consider q quadrature nodes ξ k and weights w k , we can compute the first m pseudospectral coefficients by fi = By collecting the values φ i (ξ k ) in position (i, k) of the matrix Ψ ∈ C m×q and the values f (ξ k )w k in the vector f w , we can compute the pseudospectral coefficients by means of the single matrix-vector product f = Ψf w .
In the two-dimensional case, the coefficients of a pseudospectral expansion in a tensor product basis (see, for instance, [13,Ch. 6.10]) are given by fi1i2 which can be efficiently computed as where Ψ µ ∈ C mµ×qµ has element φ µ iµ (ξ kµ µ ) in position (i µ , k µ ), with µ = 1, 2, and F W is the matrix with element f (ξ k1 1 , ξ k2 2 )w k1 1 w k2 2 in position (k 1 , k 2 ).In general, the coefficients of a d-dimensional pseudospectral expansion in a tensor product basis are given by fi1..
In tensor formulation, the coefficients can be computed as (see formulas ( 5) and ( 6) where Ψ µ is the transform matrix with element φ µ iµ (ξ kµ µ ) in position (i µ , k µ ), and we collect in the order-d tensors F and F W the values fi1...i d and where x = (x 1 , . . ., x d ).An application to Hermite-Laguerre-Fourier function decomposition is given in Section 4.2.

Function approximation
Suppose we are given an approximation of a univariate function where c i are scalar coefficients and φ i (x) are generic (basis) functions.This is the case, for example, in the context of function interpolation or pseudospectral expansions.We are interested in the evaluation of formula (11) at given points x ℓ , with 1 ≤ ℓ ≤ n.This can be easily realized in a single matrix-vector product: indeed, if we collect the coefficients c i in the vector c ∈ C m and we form the matrix Φ ∈ C n×m with element φ i (x ℓ ) in position (ℓ, i), the sought evaluation is given by f = Φc, being f ∈ C n the vector containing the approximated function at the given set of evaluation points.
A µ-mode BLAS approach for multidimensional tensor-structured problems The extension of formula (11) to the tensor product bivariate case is straightforward (see, for instance, [14, Ch.XVII]).Indeed, in this case the approximating function is given by f where c i1i2 represent scalar coefficients and φ µ iµ (x µ ) the (univariate) basis function, with 1 ≤ i µ ≤ m µ and µ = 1, 2. Then, given a Cartesian grid of points (x ℓ1 1 , x ℓ2 2 ), with 1 ≤ ℓ µ ≤ n µ , the evaluation of approximation ( 12) can be computed efficiently in matrix formulation by Here we collected the function evaluations f (x ℓ1 1 , x ℓ2 2 ) in the matrix F , we formed the matrices Φ µ ∈ C nµ×mµ of element φ µ iµ (x ℓµ µ ) in position (ℓ µ , i µ ), and we let C be the matrix of element c i1i2 in position (i 1 , i 2 ).
In general, the approximation of a d-variate function f with tensor product basis functions is given by where c i1...i d represent scalar coefficients while φ µ iµ (x µ ) the (univariate) basis functions, with 1 ≤ i µ ≤ m µ .Then, given a Cartesian grid of points (x ℓ1 1 , . . ., x ℓ d d ), with 1 ≤ ℓ µ ≤ n µ , the evaluation of approximation ( 13) can be expressed in tensor formulation as see formulas ( 5) and ( 6).Here we denote Φ µ the matrix with element φ µ iµ (x ℓµ µ ) in position (ℓ µ , i µ ), and we collect in the order-d tensors C and F the coefficients and the resulting function approximation at the evaluation points, respectively.We present an application to barycentric multivariate interpolation in Section 4.3.
Remark 2. Clearly, formula (14) can be employed to evaluate a pseudospectral approximation (10) at a generic Cartesian grid of points, by properly defining the involved tensor C and matrices Φ µ .In the context of direct and inverse spectral transforms, for example for the effective numerical solution of differential equations (see [15]), one could be interested in the evaluation of pseudospectral decompositions at the same grid of quadrature points (ξ k1 1 , . . ., ξ k d d ) used to approximate the spectral coefficients.Under standard hypothesis, this can be done by applying formula (14) where the symbol * denotes the conjugate transpose.Without forming explicitly the matrices Φ µ , the desired evaluation can be computed using the matrices Ψ µ by means of the KronPACK function cttucker.
Remark 3. Several functions which perform the whole one-dimensional procedure of approximating a function and evaluating it on a set of points, given suitable inputs, are available.This is the case, for example in the interpolation context, of the matlab built-in functions spline, interp1 (that performs different kind of one-dimensional interpolations), and interpft (which performs a resample of the input values by means of FFT techniques), or of the functions provided by the QIBSH++ library [16] in the approximation context.Yet, it is possible to extend the usage of this kind of functions to the approximation in the d-dimensional tensor setting by means of concatenations of µ-mode actions (see Definition 2.5), yielding the generalization of the Tucker operator (8).In practice, we can perform this task with the KronPACK function tuckerfun, see the numerical example in Section 4.2.

Action of the matrix exponential
Suppose we want to solve the linear Partial Differential Equation (PDE) coupled with suitable boundary conditions, where A is a linear timeindependent spatial (integer or fractional) differential operator, typically stiff.The application of the method of lines to equation (15), by discretizing first the spatial variable, e.g. by finite differences or spectral differentiation, leads to the system of Ordinary Differential Equations (ODEs) for the unknown vector u(t).Here, A ∈ C n×n is the matrix which approximates the differential operator A on the grid points x ℓ , with 1 ≤ ℓ ≤ n.The exact solution of system ( 16) is obviously u(t) = exp(tA)u 0 and, if the size of A allows, it can be effectively computed by Padé or Taylor approximations (see [17,18]).If the size of A is too large, then one has to rely on algorithms to approximate the action of the matrix exponential exp(tA) on the vector u 0 .
Examples of such methods are [19][20][21][22].Suppose now we want to solve instead coupled again with suitable boundary conditions.If PDE (17) admits a Kronecker structure, such as for some linear Advection-Diffusion-Absorption (ADA) equations on tensor product domains or linear Schrödinger equations with a potential in Kronecker form (see [15] for more details and examples), then the method of lines yields the system of ODEs Here A µ , with µ = 1, 2, represent the one-dimensional stencil matrices corresponding to the discretization of the one-dimensional differential operators that constitute A on the grid points x ℓµ µ , with 1 ≤ ℓ µ ≤ n µ .Moreover, the notation I µ stands for identity matrices of size n µ , and the component 2 ), that is This, in turn, is consistent with the linearization of the indexes of the vec operator defined in Appendix A.
Clearly, the solution of system ( 18) is given by which again could be computed by any method to compute the action of the matrix exponential on a vector.Remark that, since the matrices I 2 ⊗ A 1 and A 2 ⊗ I 1 commute and using the properties of the Kronecker product (see Appendix A), one could write everything in terms of the exponentials of the small-sized matrices A µ .Indeed, we have However, as in general the matrices exp(tA µ ) are full, their Kronecker product results in a large and full matrix to be multiplied into u 0 , which is an extremely inefficient approach.Nevertheless, if we fully exploit the tensor structure of the problem, we can still compute the solution of the system efficiently just in terms of the exponentials exp(tA µ ).Indeed, let U (t) be the n 1 × n 2 matrix whose stacked columns form the vector u(t), that is vec(U (t)) = u(t).
Then, using this matrix notation and by means of the properties of the Kronecker product, problem (18) takes the form and it is well-known (see [23]) that its solution can be computed in matrix formulation as U (t) = exp(tA 1 )U 0 exp(tA 2 ) T .In general, the d-dimensional version of solution (19) is which can be written in more compact notation as Here, A µ are square matrices of size n µ , and u 0 is a vector of length Then, similarly to the two-dimensional case, we have Finally, using Lemma 2.1, we have where U (t) and U 0 are d-dimensional tensors such that u(t) = vec(U (t)) and u 0 = vec(U 0 ).Hence, the action of the large-sized matrix exponential appearing in formula (20) can be computed by the Tucker operator (21) which just involves the small-sized matrix exponentials exp(tA µ ).For an application in the context of solution of an ADA linear evolutionary equation with spatially variable coefficients, see Section 4.4.

Preconditioning of linear systems
Suppose we want to solve the semilinear PDE coupled with suitable boundary conditions, where A is a linear timeindependent spatial differential operator and f is a nonlinear function.Using A µ-mode BLAS approach for multidimensional tensor-structured problems the method of lines, similarly to what led to system (16), we obtain A common approach to integrating system (23) in time involves the use of IMplicit EXplicit (IMEX) schemes.For instance, the application of the wellknown backward-forward Euler method with constant time step τ leads to the solution of the linear system at every time step, where M = (I − τ A) ∈ C n×n and I is an identity matrix of suitable size.If the space discretization allows (second order centered finite differences, for example), the system can then be solved by means of the very efficient Thomas algorithm.If, on the other hand, this is not the case, a suitable direct or (preconditioned) iterative method can be employed.Let us consider now the two-dimensional version of the semilinear PDE (22), i.e.
again with suitable boundary conditions, A a linear time-independent spatial differential operator and f a nonlinear function.As for equation (17), if the PDE admits a Kronecker sum structure, the application of the method of lines leads to which can be integrated in time again by means of the backward-forward Euler method.The matrix of the resulting linear system to be solved at every time step is now If we use an iterative method, we can obtain the action of the matrix M to a vector v as Moreover, examples of effective preconditioners for this kind of linear systems are the ones of Alternating Direction Implicit (ADI) type (see [24]).In this case, we can use the product of the matrices arising from the discretization of equation ( 24) after neglecting all the spatial variables but one in the operator A. We obtain then the preconditioner which is expected to be effective since P = M + O(τ 2 ).In addition, the action of P −1 to a vector v can be efficiently obtained as by noticing that Remark 4. Another approach of solution to equation ( 25) would be to write the equivalent matrix formulation of the problem, i.e.
and then apply appropriate algorithms to integrate it numerically, mainly based on the solution of Sylvester equations.This is the approach pursued, for example, in [25].
In general, for a d-dimensional semilinear problem with a Kronecker sum structure, the linear system to be solved at every time step has now matrix Again, the action of the matrix M on a vector v can be computed without assembling the matrix (see equivalence ( 9)).Finally, an effective preconditioner for the linear system is a straightforward generalization of formula (26), i.e.
Similarly to the two-dimensional case, its inverse action to a vector v can be computed efficiently as see Lemma 2.1.In our package KronPACK, formula ( 27) can be realized without explicitly inverting the matrices P µ by using the function itucker.We A µ-mode BLAS approach for multidimensional tensor-structured problems notice that this is another feature not available in the tensor algebra toolboxes mentioned in Section 2. For an example of application of these techniques, in the context of solution of evolutionary diffusion-reaction equations, see Section 4.5.We finally notice that there exist also specific techniques to solve linear systems in Kronecker form, usually arising in the discretization of time-independent differential equation, see for instance [26,27].

Numerical experiments
We present in this section some numerical experiments of the proposed µmode approach for tensor-structured problems, which make extensively use of the functions contained in our package KronPACK.We remark that, when we employ Cartesian grids of points, they have been produced by the matlab command ndgrid.If instead one would prefer to use the ordering induced by the meshgrid command (which, however, works only up to dimension three), it is enough to interchange the first and the second matrix in the Tucker operator (5).The resulting tensor is then the (2, 1, 3)-permutation of S in Definition 2.4.
All the numerical experiments have been performed with MathWorks MATLAB ® R2019a on an Intel ® Core ™ i7-8750H CPU with 16 GB of RAM.The degrees of freedom of the problems have been kept at a moderate size, in order to be reproducible with the package KronPACK in a few seconds on a personal laptop.

Code validation
In this section we validate the tucker function of KronPACK, by comparing it to the corresponding functions of the toolboxes mentioned in Section 2, i.e. ttm and tmprod of Tensor Toolbox for MATLAB and Tensorlab, respectively.We performed several tests on tensors of different orders and sizes and the three functions always produced the same output (up to round-off unit) at comparable computational times.For simplicity of exposition, we report in Figure 1 just the wall-clock times of the experiments with tensors of order d = 3 and d = 6.For each selected value of d, we take as tensors and matrices sizes m µ = n µ = n, µ = 1, . . ., d, for different values of n, in such a way that the number of degrees of freedom n d ranges from N min = 12 6 to N max = 18 6 .The input tensors and matrices have normal distributed random values, and the complete code can be found in the script code_validation.m.

Hermite-Laguerre-Fourier function decomposition
We are interested in the approximation, by means of a pseudospectral decomposition, of the trivariate function The decays in the first and second directions and the periodicity in the third direction suggest the use of a Hermite-Laguerre-Fourier (HLF) expansion.This mixed transform is useful, for instance, for the solution of differential equations with cylindrical coordinates by spectral methods, see [28].We then introduce the normalized and scaled Hermite functions (orthonormal in L 2 (R)) where H i1 is the (physicist's) Hermite polynomial of degree i 1 − 1.We consider the m 1 scaled Gauss-Hermite quadrature points {ξ k1 1 } k1 and define Ψ 1 ∈ R m1×m1 to be the corresponding transform matrix with element H β1 i1 (ξ k1 1 ) in position (i 1 , k 1 ).The parameter β 1 is chosen so that the quadrature points are contained in [−b 1 , b 1 ] (see [29]).This is possible by estimating the largest quadrature point for the unscaled functions by √ 2m 1 + 1 (see [30,Ch. 6]) and A µ-mode BLAS approach for multidimensional tensor-structured problems setting Moreover, we consider the normalized and scaled generalized Laguerre functions (orthonormal in L 2 (R + )) where L α i2 is the generalized Laguerre polynomial of degree i 2 − 1.We define Ψ 2 to be the corresponding transform matrix with element L α,β2 i2 (ξ k2 2 ) in position (i 2 , k 2 ), where {ξ k2 2 } k2 are the m 2 scaled generalized Gauss-Laguerre quadrature points.The parameter β 2 is chosen, similarly to the Hermite case, as see [30,Ch. 6] for the asymptotic estimate which holds for |α| ≥ 1/4 and α > −1.Finally, for the Fourier decomposition, we obviously do not construct the transform matrix, but we rely on a Fast Fourier Transform (FFT) implementation provided by the matlab function interpft, which performs a resample of the given input values by means of FFT techniques.We measure the approximation error, for varying values of n µ , µ = 1, 2, 3, by evaluating the pseudospectral decomposition at a Cartesian grid of points (x ℓ1 1 , x ℓ2 2 , x ℓ3 3 ), with 1 ≤ ℓ µ ≤ n µ .In order to do that, we construct the matrices Φ 1 and Φ 2 containing the values of the Hermite and generalized Laguerre functions at the points {x ℓ1 1 } ℓ1 and {x ℓ2 2 } ℓ2 , respectively.The relevant code for the approximation of f and its evaluation, by using the KronPACK function tuckerfun, can be written as where FW is the three-dimensional array containing the values 3 )w k1 1 w k2 2 , where {ξ k3 3 } k3 are the m 3 equispaced Fourier quadrature points in [a 3 , b 3 ) and {w kµ µ } kµ , with µ = 1, 2, are the scaled weights of the Gauss-Hermite and generalized Gauss-Laguerre quadrature rules, respectively.The values {ξ kµ µ } kµ and {w kµ µ } kµ , for µ = 1, 2, have been computed by the relevant Chebfun functions [31].The complete example can be found in the script example_spectral.m,Given a prescribed accuracy, we look for the smallest number of basis functions (m 1 , m 2 , m 3 ) that achieve it, and we measure the computational time needed to perform the approximation of f and its evaluation with the HLF method.As a term of comparison, we consider the same experiment with a three-dimensional Fourier spectral approximation (FFF method): in fact, for the size of the computational domain and the exponential decays along the first and second directions of the function f we are considering, it appears reasonable to approximate f by a periodic function in Ω and take advantage of the efficiency of a three-dimensional FFT.
The results with α = 4, b 1 = 4, b 2 = 11, b 3 = −a 3 = 1, and n 1 = n 2 = n 3 = 301 evaluation points uniformly distributed in Ω are displayed in Figure 2. As we can observe, the total number of degrees of freedom needed by the HLF approach is always smaller than the corresponding FFF one.In particular, despite the exponential decay along the second direction, the FFF method requires a very large number of Fourier coefficients along that direction in order to reach the most stringent accuracies.In these situations, the HLF method implemented with the µ-mode approach is preferable in terms of computational time to the well-established implementation by the FFT technique of the FFF method.

Multivariate interpolation
Let us consider the approximation of a function f (x) through a five-variate interpolating polynomial in Lagrange form A µ-mode BLAS approach for multidimensional tensor-structured problems Here L iµ (x µ ) is the Lagrange polynomial of degree m µ − 1 on a set {ξ kµ µ } kµ of m µ interpolation points written in the second barycentric form, with µ = 1, . . ., 5, i.e.This is the five-dimensional version of one of the examples presented in [32,Sec. 6].We evaluate the polynomial at a uniformly spaced Cartesian grid of points (x ℓ1 1 , . . ., x ℓ5 5 ), with 1 ≤ ℓ µ ≤ n µ .Then, approximation (28) at the just mentioned grid can be computed as where we collected the function evaluations at the interpolation points in the tensor F and L µ contains the element L iµ (x ℓµ µ ) in position (ℓ µ , i µ ).If we store the matrices L µ in a cell L, the corresponding matlab command for computing the desired approximation is P = tucker(F,L); The results, for a number of evaluation points fixed to n µ = n = 35 and varying number of interpolation points m µ = m, are reported in Figure 3, and the complete code can be found in the script example_interpolation.m.
As expected, the error decreases according to the estimate Fig. 3 Results for approximation (29) with an increasing number mµ = m of interpolation points.The relative error (blue circles) is computed in maximum norm at the evaluation points.For reference, a dashed line representing the theoretical decay estimate is added.see [32,33].

Linear evolutionary equation
Let us consider the following three-dimensional Advection-Diffusion-Absorption evolutionary equation, written in conservative form, for a concentration u(t, x) (see [34]) where β µ , µ = 1, 2, 3, and α > 0 are advection and diffusion coefficients and γ ≥ 0 is a coefficient governing the decay of u(t, x).After a space discretization by second order centered finite differences on a Cartesian grid, we end up with a system of ODEs where A µ ∈ R nµ×nµ is the one-dimensional discretization of the operator If we denote by U 0 = vec(u 0 ) and U (t) = vec(u(t)) the tensors associated to the vectors u 0 and u(t), respectively, then we have A µ-mode BLAS approach for multidimensional tensor-structured problems We consider equation ( 30) for x ∈ [0, 2] 3 , coupled with homogeneous Dirichlet-Neumann conditions (u(t, x) = 0 at x µ = 0 and ∂ xµ u(t, x) = 0 at x µ = 2, µ = 1, 2, 3).The coefficients are fixed to Then, if we compute the needed matrix exponentials by the function expm in matlab and define E{mu} = expm(tstar*A{mu}); the solution U (t * ) at final time t * = 0.5 can be computed as U = tucker(U0,E); since the matrix exponential is the exact solution and thus no substepping strategy is needed.The complete example is reported in the script example_exponential.m.
In Table 2 we show the results with a discretization in space of n = (50, 55, 60) grid points.Since the problem is moderately stiff, we consider for comparison the solution of system (31) by the ode23 matlab function (which implements an explicit adaptive Runge-Kutta method of order (2)3) and by a standard implementation of the explicit Runge-Kutta method of order four (RK4).For the Runge-Kutta methods, we consider both the tensor and the vector implementations, using the functions kronsumv and kronsum, respectively (see equivalence (9)).The number of uniform time steps for RK4 has been chosen in order to obtain a comparable error with respect to the result of the variable time step solver ode23.As we can see, the tensor formulation (32) implemented using the function tucker is much faster than any other considered approach.Indeed, this is due to the fact that formula (32) requires a single time step and calls a level 3 BLAS routine only three times.For other experiments involving the approximation of the action of the matrix exponential in tensor-structured problems, we invite the reader to check [15].  2 Summary of the results for solving the ODEs system (31) with the three described approaches.We report the number of time steps, the wall-clock times in seconds for both the tensor and the vector formulations (when feasible) and the relative error in infinity norm of the final solution with respect to the solution given by the tucker approach.

Semilinear evolutionary equation
We consider the following three-dimensional semilinear evolutionary equation for x ∈ [0, 1] 3 , where the function Φ(t, x) is chosen so that the exact solution is u(t, x) = e t u 0 (x).We complete the equation with homogeneous Dirichlet boundary conditions in all the directions.This is the three-dimensional generalization of the example presented in [35].
We discretize the problem in space by means of second order centered finite differences on a Cartesian grid, with n µ grid points for the spatial variable x µ , µ = 1, 2, 3.Then, the application of the backward-forward Euler method leads to the following marching scheme where u k ≈ u(t k , x), τ is the time step size, t k is the current time and The matrix of the linear system (34) is given by where A µ is the discretization of the partial differential operator ∂ x 2 µ and I µ is the identity matrix of size n µ .One could solve the linear system (34) using a direct method, in particular by computing the Cholesky factors of the matrix M once and for all (if the step size τ is constant).Another approach would be to use the Conjugate Gradient (CG) method for the single marching step (34).In matlab, the latter can be performed as where M is the matrix assembled using kronsum (vector approach), while Mfun is implemented by means of the function kronsumv (tensor approach).As described in Section 3.4, an effective preconditioner for system (34) is the one A µ-mode BLAS approach for multidimensional tensor-structured problems of ADI-type P 3 ⊗ P 2 ⊗ P 1 , P µ = (I µ − τ A µ ).
The action of the inverse of this preconditioner on a vector v can be easily performed in tensor formulation, see formula (27), and the resulting Preconditioned Conjugate Gradient (PCG) method is pcg(Mfun,uk+tau*f(tk,uk),tol,maxit,Pfun,[],uk); where Pfun is implemented through the KronPACK function itucker.The complete example is reported in the file example_imex.m.
In Table 3 we report the results obtained for a space discretization of n = (40, 44, 48) grid points.The time step size of the marching scheme ( 34) is τ = 0.01 and the final time of integration is t * = 1.For all the methods, the final relative error with respect to the exact solution measured in infinity norm is 9.7 • 10 −3 .As it is clearly shown, the ADI-type preconditioner is really effective in reducing the number of iterations of the CG method.Moreover, the resulting method is the fastest among all the considered approaches.

Avg. iterations
Elapsed time per time step Direct -6.7 CG vector 30 3.3 CG tensor 30 2.2 PCG tensor 2 0.5 Table 3 Summary of the results for solving the semilinear equation (33) by the method of lines and the backward-forward Euler method.The elapsed time is the wall-clock time measured in seconds.

Conclusions
In this work, we presented how it is possible to state d-dimensional tensorstructured problems by means of composition of one-dimensional rules, in such a way that the resulting µ-mode BLAS formulation can be efficiently implemented on modern computer hardware.The common thread consists in the suitable employment of tensor-product operations, with special emphasis on the Tucker operator and its variants.After validating our package KronPACK against other commonly used tensor operation toolboxes, the effectiveness of the µ-mode approach compared to other well-established techniques is shown on several examples from different fields of numerical analysis.More in detail, we employed this approach for a pseudospectral Hermite-Laguerre-Fourier trivariate function decomposition, for the barycentric Lagrange interpolation of a five-variate function and for the numerical solution of three-dimensional stiff linear and semilinear evolutionary differential equations by means of exponential techniques and a (preconditioned) IMEX method, respectively.
A µ-mode BLAS approach for multidimensional tensor-structured problems

Definition 2 . 5 .
Let L : C mµ → C n be an operator.Then the µ-mode action of T with L, still denoted S = T × µ L, is the tensor S ∈ C m1×•••×mµ−1×n×mµ+1×•••×m d obtained by the action of the operator L on the µ-fibers of T .

Fig. 1
Fig.1Wall-clock times for different realizations of the Tucker operator (5) with the functions ttm, tmprod, and tucker.The left plot refers to the case d = 3, while the right plot refers to the case d = 6.Each test has been repeated several times in order to avoid fluctuations.

Fig. 2
Fig.2Achieved accuracies versus wall-clock times (in seconds) for the Hermite-Laguerre-Fourier (HLF) and the Fourier-Fourier-Fourier (FFF) approaches.The label of the marks in the plot indicates the number of basis functions used in each direction.