# Learning with tensors: a framework based on convex optimization and spectral regularization

## Abstract

We present a framework based on convex optimization and spectral regularization to perform learning when feature observations are multidimensional arrays (tensors). We give a mathematical characterization of spectral penalties for tensors and analyze a unifying class of convex optimization problems for which we present a provably convergent and scalable template algorithm. We then specialize this class of problems to perform learning both in a transductive as well as in an inductive setting. In the transductive case one has an input data tensor with missing features and, possibly, a partially observed matrix of labels. The goal is to both infer the missing input features as well as predict the missing labels. For induction, the goal is to determine a model for each learning task to be used for out of sample prediction. Each training pair consists of a multidimensional array and a set of labels each of which corresponding to related but distinct tasks. In either case the proposed technique exploits precise low multilinear rank assumptions over unknown multidimensional arrays; regularization is based on composite spectral penalties and connects to the concept of Multilinear Singular Value Decomposition. As a by-product of using a tensor-based formalism, our approach allows one to tackle the multi-task case in a natural way. Empirical studies demonstrate the merits of the proposed methods.

## Keywords

Spectral regularization Matrix and tensor completion Tucker decomposition Multilinear rank Transductive and inductive learning Multi-task learning## 1 Introduction

Tensors are the higher order generalization of vectors and matrices. They find applications whenever the data of interest have intrinsically many dimensions. This is the case for an increasing number of areas such as econometrics, chemometrics, psychometrics, (biomedical) signal processing and image processing. Regardless of the specific domain, a common task in the data analysis workflow amounts at finding some low dimensional representation of the process under study. Existing tensor-based techniques (Kolda and Bader 2009; Smilde et al. 2004; Coppi and Bolasco 1989; Kroonenberg 2008) mostly consist of decompositions that give a concise representation of the underlying structure of data; this is useful for exploratory data analysis since it often reveals representative low-dimensional subspaces (for Tucker-type decompositions) or sum of rank-1 factors (for Canonic Polyadic Decomposition (CPD) and related techniques). In this work we take a broader perspective and consider a wider set of learning tasks. Our main goal is to extend spectral regularization (Abernethy et al. 2009; Tomioka and Aihara 2007; Argyriou et al. 2007b, 2010; Srebro 2004) to the case where data have intrinsically many dimensions and are therefore represented as higher order tensors.

### 1.1 Related literature

So far spectral regularization has been advocated mainly for matrices (Tomioka and Aihara 2007; Argyriou et al. 2010, 2008, 2007b; Abernethy et al. 2009). In the important low-rank matrix recovery problem, using a convex relaxation technique proved to be a valuable methodology (Cai et al. 2010; Candès and Recht 2009; Candès and Plan 2010). Recently this approach has been extended and tensor completion has been formulated (Liu et al. 2009; Signoretto et al. 2011b). The authors of Gandy et al. (2011) considered tensor completion and low multilinear rank tensor pursuit. Whereas the former assumes knowledge of some entries, the latter assumes the knowledge of measurements obtained sensing the tensor unknown via a known linear transformation (with the sampling operator being a special case). They provide algorithms for solving constrained as well as penalized versions of this problem. They also discussed formulations suitable for dealing with noisy measurements, in which a quadratic loss is employed to penalize deviation from the observed data.

### 1.2 Contributions

We present a framework based on convex optimization and spectral regularization to perform learning when data observations are represented as tensors. This includes in particular the cases where observations are vectors or matrices. In addition, it allows one to deal appropriately with data that have a natural representation as higher order arrays. We begin by presenting a unifying class of convex optimization problems for which we present a scalable template algorithm based on an operator splitting technique (Lions and Mercier 1979). We then specialize this class of problems to perform single as well as multi-task learning both in a transductive as well as in an inductive setting. To this end we develop tools extending to higher order tensors the concept of spectral regularization for matrices (Argyriou et al. 2007a). We consider smooth penalties (including the quadratic loss as a special case) and exploit a low multilinear rank assumption over one or more tensor unknowns through spectral regularizers. We show how this connects to the concept of Tucker decomposition (Tucker 1964, 1966) (a particular instance of which is also known as Multilinear Singular Value decomposition (De Lathauwer et al. 2000)). Additionally, as a by-product of using a tensor-based formalism, our framework allows one to tackle the multi-task case (Argyriou et al. 2008) in a natural way. In this way one exploits interdependence both at the level of the data representations as well as across tasks.

Our main contribution is twofold. A first contribution is to apply the framework to supervised transductive and inductive learning problems where the input data can be expressed as tensors. Important special cases of the framework include extensions of multitask learning with higher order observation data. A second main contribution lies within the Inexact Splitting Method that we propose as the template algorithm; we study an adaptive stopping criterion for the solution of a key sub-problem and give guarantees about the convergence of the overall algorithm.

### 1.3 Outline

In the next section we introduce preliminaries and present our notation. In Sect. 3 we discuss the general problem setting that we are concerned with. We present in Sect. 4 a template algorithm to solve this general class of problems and show its convergence. In Sect. 5 we extend to the tensor setting the existing definition of spectral penalty and develop the analytical tools we need. Section 6 deals with tensor-based transductive learning. Inductive learning is discussed in Sect. 7. We demonstrate the proposed methodologies in Sect. 8 and end the paper with Sect. 9 by drawing our concluding remarks.

## 2 Notation and preliminaries

We denote both scalars and vectors as lower case letters (*a*,*b*,*c*,…) and matrices as bold-face capitals (* A*,

*,*

**B***,…). We write 1*

**C**_{ N }to denote \([1,1,\ldots,1]^{\top}\in\mathbb{R}^{N}\) and

**I**_{ N }to indicate the

*N*×

*N*identity matrix. We also use subscript lower-case letters

*i*,

*j*in the meaning of indices and we will use

*I*,

*J*to denote the index upper bounds. Additionally we write \(\mathbb{N}_{I}\) to denote the set {1,…,

*I*}. We recall that

*N*-th order tensors, which we denote by calligraphic letters (\(\mathcal{A}\), \(\mathcal{B}\), \(\mathcal{C}\), …), are higher order generalizations of vectors (first order tensors) and matrices (second order tensors). More generally, the order

*N*of a tensor is the number of dimensions, also known as ways or modes. We write \(a_{i_{1},\ldots,i_{N}}\) to denote the entry \((\mathcal{A})_{i_{1},\ldots,i_{N}}\). Likewise we write

*a*

_{ i }to mean (

*a*)

_{ i }and

*a*

_{ ij }to mean (

*)*

**A**_{ ij }.

Next we present basic facts about tensors and introduce the mathematical machinery that we need to proceed further. The level of abstraction that we consider allows one to deal in a unified fashion with different problems and provides a useful toolset for very practical purposes. For instance, a proper characterization of operators and corresponding adjoints allows one to use the chain rule for subdifferentials (see, e.g., Ekeland and Temam 1976) that we extensively use in Sect. 5. Note that this is very useful also from an implementation view point. In fact, it is being used for the automatic derivation of differentials and sub-differentials of composite functions in modern optimization toolboxes (such as Becker et al. 2010).

### 2.1 Basic facts about tensors

*N*-th order tensor \(\mathcal{A}\) is rank-1 if it consists of the outer product of

*N*nonzero vectors \(u^{(1)}\in\mathbb {R}^{I_{1}},~u^{(2)}\in\mathbb{R}^{I_{2}},\ldots,~u^{(N)}\in\mathbb {R}^{I_{N}}\), that is, if \(a_{i_{1}i_{2}\ldots i_{N}} =u^{(1)}_{i_{1}}u^{(2)}_{i_{2}}\cdots u^{(N)}_{i_{N}}\) for all values of the indices. In this case we write \(\mathcal{A}=u^{(1)}\otimes u^{(2)}\otimes\cdots\otimes u^{(N)}\). The linear span of such elements forms a vector space, which once endowed with the inner product

^{1}\(\mathbb {R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\). The corresponding Hilbert-Frobenius norm is \(\Vert\mathcal{A}\Vert:= \sqrt{\langle\mathcal{A},\mathcal{A}\rangle}\). We use 〈⋅,⋅〉 and ∥⋅∥ for any

*N*≥1, regardless of the specific tuple (

*I*

_{1},

*I*

_{2},…,

*I*

_{ N }). An

*n*-

*mode vector*of \(\mathcal{A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) is an element of \(\mathbb{R}^{I_{n}}\) obtained from \(\mathcal{A}\) by varying the index

*i*

_{ n }and keeping the other indices fixed. The

*n*-rank of \(\mathcal{A}\), indicated by \(\operatorname{rank}_{n}(\mathcal{A})\), is the dimension of the space spanned by the

*n*-mode vectors. A tensor for which \(r_{n}=\operatorname{rank}_{n}(\mathcal{A})\) for \(n\in\mathbb{N}_{N}\) is called a rank-(

*r*

_{1},

*r*

_{2},…,

*r*

_{ N }) tensor; the

*N*-tuple (

*r*

_{1},

*r*

_{2},…,

*r*

_{ N }) is called the

*multilinear rank*of \(\mathcal{A}\). For the higher order case an alternative notion of rank exists. This is:

*n*-ranks differ from each other in the general

*N*-th order case.

*n-mode unfolding*(also called matricization or flattening) of \(\mathcal{A}\) is the matrix \(\mathcal {A}_{\langle n\rangle}\in\mathbb{R}^{I_{n}\times J}\) whose columns are the

*n*-mode vectors. The ordering according to which the vectors are arranged to form \(\mathcal{A}_{\langle n\rangle}\) will not matter for our purposes; what matter is that one sticks to a chosen ordering rule.

### Remark 1

^{⊤}denotes matrix transposition.

^{2}

*n*-mode unfolding as introduced above defines the linear operator

*refolding*or tensorization, denoted as ⋅

^{〈n〉}, is defined as its adjoint \(\cdot^{\langle n\rangle}:\mathbb {R}^{I_{n}\times J}\rightarrow\mathbb{R}^{I_{1}\times I_{2}\times\cdots \times I_{N}}\) satisfying

*n*-

*mode product*of a tensor \(\mathcal {A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) by a matrix \(\boldsymbol{U}\in\mathbb{R}^{J_{n}\times I_{n}}\), denoted by \(\mathcal{A}\times_{n} \boldsymbol{U}\), is defined by

### 2.2 Sampling operator and its adjoint

*P*entries of the

*N*-th order tensor \(\mathcal{A}\). In the following we denote by Open image in new window the

*sampling operator*defined by Note that Open image in new window is linear and it can be equivalently restated as Open image in new window where \(\mathcal{E}_{s^{p}}\) is that element of the canonical basis of \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) defined as

*co-isometry*in the sense of Halmos (1982, Sect. 127, page 69).

### Remark 2

From this fact it follows that any solution of Open image in new window can be written as Open image in new window where Open image in new window .

### Remark 3

*reproducing kernel Hilbert space*of functions Open image in new window (Aronszajn 1950) based on a set of

*evaluation functionals*of the type

*N*-th order array \(\mathcal{A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) can be regarded as a function from \(\mathbb{N}_{I_{2}}\times\cdots\times \mathbb{N}_{I_{N}}\) to \(\mathbb{R}\). Correspondingly, Open image in new window can be restated in terms of evaluation functionals of the same type as (5), namely

### 2.3 Abstract vector spaces

*Q*module spaces of tensors of different orders:

*canonical inner product*formed upon the uniform sum of the module spaces’ inner products: Note that we denoted the

*q*-th component using the notation reserved for higher order tensors. When

*N*

_{ q }=2 (second order case) we stick with the notation for matrices introduced above and finally we denote it as a vector if

*N*

_{ q }=1. We denote \((\mathcal{W}_{1},\mathcal {W}_{2},\ldots,\mathcal{W}_{Q})\) by \(\mathcal{W}\). The norm associated to (7) is Open image in new window .

### Remark 4

As an example, assume Open image in new window is formed upon the module spaces \(\mathbb{R}^{2\times3\times3}\), \(\mathbb{R}^{4\times4}\) and \(\mathbb {R}^{5}\). A generic element of Open image in new window will be denoted then by \((\mathcal{A}, \boldsymbol{B},c)\) where we use different letters to emphasize the different role played by the corresponding elements.

Alternatively we will denote elements of Open image in new window , i.e., abstract vectors, by lower case letters (*w*,*v*,…) like ordinary vectors, i.e., elements of \(\mathbb{R}^{P}\). We use this convention whenever we do not want to specify the structure of Open image in new window . We note that this is consistent with the fact that elements of Open image in new window can always be considered as “long vectors” avoiding involved notation. Additionally we denote by capital letters (*A*,*B*,*C*,…) general operators between abstract spaces such as Open image in new window and use lower case letters (*f*,*g*,…) to denote functionals on Open image in new window namely operators of the type Open image in new window . Next we introduce the general family of optimization problems of interest.

## 3 General problem setting

### 3.1 Main optimization problem

*w*a specific structure, which will depend on the specific instance of the learning task of interest.

### 3.2 Some illustrative examples

Problem (8) is very general and covers a wide range of machine learning formulations where one faces single as well as *composite penalties*, i.e., functions corresponding to the sum of multiple atomic (stand-alone) penalties. To show this and illustrate the formalism introduced above, we begin by the simplest problems that can be cast as (8). Successively, we will move on to the cases of interest, namely tensor-based problems. In the simplest cases, such as in *Ridge Regression* (Hoerl and Kennard 1970), \(\bar{f}\) can be set equal to the error functional of interest *f*. In other cases, such as those that we will deal with in the remainder of the paper, it is more convenient to duplicate optimization variables; in these cases \(\bar{f}\) is related to *f* in a way that we will clarify later.

In the optimization literature the idea of solving optimization problems by duplicating variables has roots in the 1950s and was developed in the 1970s, mainly in connection to control problems, see, e.g., Bertsekas and Tsitsiklis (1989). This general approach underlies the *alternating methods of multipliers* and the related Douglas-Rachford technique that we discuss later in more details. As we will see, duplicating variables allows to decompose the original problem into simpler sub-problems that can be solved efficiently and can be distributed across multiple processing units.

#### 3.2.1 Ridge regression

*λ*>0 is a user-defined parameter. The problem of interest, namely

*design matrix*\(\boldsymbol{X}=[x_{1},x_{2},\ldots,x_{N}]\in\mathbb{R}^{D\times N}\) and a vector of measured responses \(y\in\mathbb{R}^{N}\).

#### 3.2.2 Group lasso

*l*

_{2}penalty used in Ridge Regression is replaced by the group lasso penalty with (possibly overlapping) groups, see Zhao et al. (2009), Jacob et al. (2009). Let \(2^{N_{D}}\) denote the

*power set*

^{3}of

*N*

_{ D }and consider some collection of

*M*ordered sets Open image in new window . For any \(w\in\mathbb{R}^{D}\) let Open image in new window be defined entry-wise by The group lasso problem with overlapping groups and an unpenalized bias term can be expressed as

*λ*>0: The latter is a first example of

*composite penalty*. In this case, grouped selection occurs for non-overlapping groups; hierarchical variable selection is reached by defining groups with particular overlapping patterns (Zhao et al. 2009). Consider now the abstract vector space endowed with the canonical inner product Note that the original variable

*w*is duplicated into

*M*copies, namely,

*w*

_{[1]},

*w*

_{[2]},…,

*w*

_{[M]}. Once defined the set we can solve (14) by means of the problem where

### 3.3 Learning with tensors

In the next sections we will deal with both inductive and transductive tensor-based learning problems. Regularization will be based upon *composite spectral penalties* that we introduce in Sect. 5. Multiple module spaces will be used to account for tensor unknowns of different orders. We will tackle multiple tasks simultaneously and assume input feature are collected within higher order tensors. A strategy similar to the one considered above for the group lasso will be used to conveniently recast our learning problems in term of (8).

#### 3.3.1 Transductive Learning

In the transductive case one has an input data tensor with missing features and, possibly, a partially observed matrix of labels. The goal is to both infer the missing entries in the data tensors as well as predict the missing labels. Notably, the special case when there is no labeling information, corresponds to tensor completion that was considered for the first time in Liu et al. (2009) and can be regarded as a single learning task. For the case where input patterns are represented as vectors our approach boils down to the formulation in Goldberg et al. (2010). In this sense the transductive formulation that we propose can be regarded as a generalization to the case when input data admit a higher order representation. In this case the essential idea consists of regularizing the collection of input features and labels directly without learning a model.

#### 3.3.2 Inductive learning

Each training pair consists of an input tensor data observation and a vector of labels that corresponds to related but distinct tasks. This setting extends the standard penalized empirical risk minimization problem to allow for both multiple tasks and higher order observational data.

#### 3.3.3 Common algorithmic framework

The learning tasks that we deal with via the optimization problem in (8)

## 4 Unifying algorithmical approach

For certain closed forms of \(\bar{f}\) and \(\bar{g}\), (8) can be restated as a semi-definite programming (SDP) problem (Vandenberghe and Boyd 1996) and solved via SDP solvers such as SeDuMi (Sturm 1999), or SDPT3 (Tütüncü et al. 2003). However there is an increasing interest in the case where Open image in new window is high dimensional in which case this approach is not satisfactory. Alternative scalable techniques that can be adapted to the solution of (3) consist of proximal point algorithms designed to find a zero of the sum of maximal monotone operators. Classical references include Rockafellar (1976), Lions and Mercier (1979) and Spingarn (1983). A modern and comprehensive review with application to signal processing is found in Combettes and Pesquet (2009). These algorithms include as special cases the Alternating Direction Method of Multipliers (ADMMs), see Boyd et al. (2011) for a recent review. Here we propose an algorithm in the family of the Douglas-Rachford splitting methods. Notably, ADMMs can be seen as a special instance of the Douglas-Rachford splitting method, see Eckstein and Bertsekas (1992) and references therein. Our general approach can be regarded as a variant of the proximal decomposition method proposed in Combettes and Pesquet (2008) and Combettes (2009) by which it was inspired. As the main advantage, the approach does not solve the original problem directly; rather, it duplicates some of the optimization variables and solve simpler problems (proximal problems) in a distributed fashion. As we will show later, the simplicity of proximal problems lies on the fact that they can be solved exactly in terms of the SVD. Notably, as Sect. (3.2) shows, the algorithm we develop is not relevant for our tensor-based framework only.

### 4.1 Proximal point algorithms and operator splitting techniques

#### 4.1.1 Problem restatement

*normal cone*(Bauschke and Combettes 2011) of Open image in new window : Letting now Eq. (22) can be restated as

*A*and

*B*, as well as their sum

*T*=

*A*+

*B*, are set-valued operators (for each

*w*their image is a subset of Open image in new window ) and they all qualify as

*maximal monotone*. Maximal monotone operators, of which subdifferentials are a special instance, have been extensively studied in the literature, see e.g. Minty (1962), Rockafellar (1970b), Brézis (1973) and Phelps (1993). A recent account on the argument can be found in Bauschke and Combettes (2011).

#### 4.1.2 Resolvent and proximal point algorithms

*τ*>0 and a given maximal monotone operator

*T*on Open image in new window , \(\hat{x}\in T^{-1}(0)\) if and only if \(\hat{x}\) satisfies \(\hat{x}\in R_{\tau T}\hat{x} \), i.e., if \(\hat{x}\) is a fixed point of the single-valued

*resolvent*of

*τT*, defined as

*T*is a special monotone operator; indeed it corresponds to the subdifferential of the convex function Open image in new window . In case of a subdifferential, (26) can be restated as \(x^{(t)}\in x^{(t+1)}+\tau\partial\bar{h}(x^{(t+1)})\). This, in turn, is equivalent to:

#### 4.1.3 Proximity operator

#### 4.1.4 Operator splitting approaches

The proximal iteration (29) is numerically viable only in those cases in which it is easy to solve the optimization problem (28). When \(\bar{h}\) is a quadratic function, for instance, (28) corresponds to the solution of a system of linear equations that can be approached by reliable and well studied routines. In general, however, it is non trivial to tackle problem (28) directly. A viable alternative to the proximal iteration (29) rely on an *operator splitting* approach, see Bauschke and Combettes (2011) for a modern review. In the present context, the use of a splitting technique arises quite naturally from separating the objective function Open image in new window into (1) Open image in new window (corresponding to the operator *A*) and (2) the (generally) non-smooth term \(\bar{g}\) (corresponding to the operator *B*). As we will see, this decomposition leads to a tractable algorithm, in which the operators *A* and *B* are employed in separate subproblems that are easy to solve. In particular, a classical method to solve (24) is the Douglas-Rachford splitting technique that was initially developed in Lions and Mercier (1979) based upon an idea found in Douglas and Rachford (1956).

### 4.2 Douglas-Rachford splitting technique

*A*and

*B*are maximal monotone operators. The main iteration

*G*

_{ DR }consists of the following steps: In the latter

*τ*is a positive proximity parameter and (

*γ*

^{(k)})

_{ k }is a sequence of parameters that, once chosen appropriately, ensures convergence. With reference to (23), Eq. (31a) reads in our context whereas (31b) reads

### 4.3 Modelling workflow within the Douglas-Rachford algorithmic framework

The use of a splitting technique arises quite naturally in our context from separating the objective function (with constraints embedded via the indicator function) into (1) a part that can be approached by gradient projection and (2) a non-smooth term that can be conveniently tackled via a proximal problem. On the other hand, the Douglas-Rachford algorithmic framework, together with the abstract vector space machinery introduced above, naturally results into the following mathematical engineering workflow.

### Optimization modelling

Specification of the target problem: definition of the cost *f* and of the composite penalty *g* depending on the learning task of interest.

### Problem casting

Specification of the auxiliary problem: definition of the abstract vector space Open image in new window ; \(\bar{f}~, \bar {g}\) and \(\bar{\mathcal{C}}\) are specified so that a solution of the auxiliary problem can be mapped into a solution of the target problem.

Sect. 3.2 already provided an illustration of these steps in connection to learning problems involving a parameter vector and a bias term. In general, a key ingredient in doing the problem casting is to ensure that \(\bar{g}\) is an *additive separable function*. In this case, in fact, computing \(\mathrm {prox}_{\tau\bar{g}}\) in (33) involves subproblems on each module space that can be distributed. We formally state this result in the following proposition. The simple proof can be found in the literature, see e.g. Bauschke and Combettes (2011, Proposition 23.30).

### Proposition 1

*For*\(i\in\mathbb{N}_{I}\)

*let*Open image in new window

*be some vector space with inner product*〈⋅,⋅〉

_{ i }.

*Let*Open image in new window

*be the space obtained endowing the Cartesian product*Open image in new window

*with the inner product*\(\langle x,y\rangle=\sum_{i\in\mathbb{N}_{I}}\langle x_{i},y_{i}\rangle_{i}\).

*Assume a function*Open image in new window

*defined by*

*where for any*\(i\in\mathbb{N}_{I}\), Open image in new window

*is convex*.

*Then we have*:

### 4.4 Limits of two-level strategies

Next we present our algorithm based on an *inexact* variant of the Douglas-Rachford iteration. Our interest is in those situations where (33) can be computed exactly whereas the inner problem (32) requires an iterative procedure. As it turns out, in fact, in many situations one can cast the learning problem of interest in such a way that (33) can be computed easily and with high precision. Nonetheless, for general \(\bar{f}\) in the inner problem (32), using the Douglas-Rachford iteration to solve (8) requires a procedure consisting of two nested iterative schemes. In general, the convergence of such a two-level strategy is ensured only upon exact solution of the inner problem. On the other hand, practical implementations require to specify a termination criterion and a corresponding accuracy. Notably Gandy et al. (2011) proposes different algorithms for an instance of the general problem in (8) similar to the formulations we will consider in Sect. 6. In particular, in Sect. 5.4 they also devise an inexact algorithm but they do not provide any convergence guarantee. Motivated by this we propose an adaptive termination criterion for the inner problem and prove the convergence of the outer scheme to a solution of (8).

### 4.5 Template based on inexact splitting technique

*Inexact Splitting Method*(ISM), is presented in Algorithm 1 in which we denoted by Open image in new window the projection onto Open image in new window : The idea is sketched as follows.

- 1.
We apply an inexact version of

*G*_{DR}to solve problem (8), where we only require to compute*y*^{(k)}in (32) up to a given precision*ϵ*(*k*). Since, in our setting, (31b) can be computed in a closed form, we do not require any inexactness at this step. - 2.
Problem (32) is strongly convex for any

*τ*>0 and convex and differentiable function \(\bar{f}\). One can apply a gradient method that converges in this situation at a linear rate (Nesterov 2003, Theorem 2.2.8, p. 88).

*ϵ*(

*k*) that depends upon the iteration index

*k*. In practice this is achieved via the Goldstein-Levitin-Polyak gradient projection method, see Bertsekas (1976, 1995). In the first main iterations a solution for (32) is found with low accuracy (from which the term

*inexact*); as the estimate is refined along the iterations of Main the precision within the inner problem is increased; this ensures that the sequence (

*y*

^{(k)})

_{ k }produced by Algorithm 1 converges to a solution of problem (8), as the following result shows.

### Theorem 1

*Assume the solution set* Open image in new window *of problem* (8) *is non*-*empty*; *In Algorithm* 1 *let* *ϵ* _{0}>0, *σ*>1 *and* *τ*>0 *be arbitrarily fixed parameters*. *Then* {*y* ^{(k)}}_{ k } *converges to* Open image in new window .

### Proof

See Appendix B. □

### Remark 5

(Unknown Lipschitz constant)

Notice that in the procedure that computes the proximity operator with adaptive precision we assumed known \(L_{\bar{f}}\) as defined in (9); based upon the latter, \(L_{\bar{q}}\) is immediately computed since \(L_{\bar{q}}=L_{\bar{f}}+1/\tau\), see Lemma 2 in Appendix B. In practical application, however, \(L_{\bar{f}}\) is often unknown or hard to compute. In this situation an upper bound for \(L_{\bar{q}}\) can be found according to a backtracking strategy, see Beck and Teboulle (2009), Nesterov (2007) for details. The constant step-size \(L_{\bar{q}}\) in step 3 of InexactProxy is replaced by an adaptive step-size \(h \in(0, \frac{1}{L_{\bar{q}}}]\) as appropriately chosen by the backtracking procedure.

### Remark 6

(Termination of the outer loop)

*y*

^{(k)}}

_{ k }converges to the solution of problem (8), one can use the condition to terminate the loop in the procedure Main, where

*η*>0 is a desired accuracy. However, for the specific form of the learning problems considered in this paper, we prefer to use the objective value. Typically, we terminate the outer loop if

^{4}

The reason for this choice is as follows: generally the termination condition (36) finds solution close to optimal (with respect to the optimization problem). When it does not, the algorithm is normally stuck in a plateau which means that the optimization is typically going to require a lot of time, with no significant improvement in the estimate. In this setting the termination condition achieves a shorter computational time by accepting the estimate we got so far and exiting the loop.

## 5 Spectral regularization and multilinear ranks

So far we have elaborated on the general formulation in (8); in this section we specify the nature of the penalty functions that we are concerned with in our tensor-based framework. We begin by focusing on the case where Open image in new window corresponds to \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\); we then consider multiple module spaces in line with (6) and (7).

### 5.1 Spectral penalties for higher order tensors

We recall that a *symmetric gauge function* \(h:\mathbb {R}^{P}\rightarrow\mathbb{R}\) is a norm which is both absolute and invariant under permutations^{5} (von Neumann 1937), see also Horn and Johnson (1994, Definition 3.5.17). Symmetric gauge functions are for instance all the *l* _{ p } norms. The following definition generalizes to higher order tensors the concept of spectral regularizer studied in Abernethy et al. (2009) and Argyriou et al. (2010).

### Definition 1

(*n*-mode spectral penalty for higher order tensors)

*n*-mode spectral penalty if it can be written as:

*n*-mode spectral penalties. The earliest example of such a situation is found in Liu et al. (2009). Denoting by ∥⋅∥

_{∗}the nuclear norm for matrices, Liu et al. (2009) considers the penalty

_{1}is a symmetric gauge function, (37) qualifies as a composite spectral regularizer.

The nuclear norm has been used to devise convex relaxation for rank constrained matrix problems (Recht et al. 2007; Candès and Recht 2009; Candes et al. 2011); this parallels the use of the *l* _{1}-norm in sparse approximation and cardinality minimization (Tibshirani 1996; Chen et al. 2001; Donoho 2006). Likewise, minimizing (37) can be seen as a convex proxy for the minimization of the multilinear ranks.

### 5.2 Relation with multilinear rank

*core tensor*and for any \(n\in\mathbb{N}_{N}\), \(\boldsymbol{U}^{(n)}\in\mathbb{R}^{I_{n}\times I_{n}}\) is a matrix of

*n*-mode singular vectors, i.e., the left singular vectors of the

*n*-mode unfolding \(\mathcal{W}_{\langle n\rangle}\) with SVD

^{6}

*n*-mode singular vectors corresponding to the smallest singular values \(\sigma(\mathcal {W}_{\langle n\rangle})\). See Fig. 2 for an illustration. Since penalizing the nuclear norm of \(\mathcal{W}_{\langle n\rangle}\) enforces the sparsity of \(\sigma(\mathcal{W}_{\langle n\rangle})\), (37) favors low multilinear rank tensors. Notably for

*N*=2 (second order case) it is easy to see that (37) is consistent with the definition of nuclear norm for matrices.

The nuclear norm is the convex envelope of the rank function on the spectral-norm unit ball (Fazel 2002); as such it represents the best convex approximation for a number of non-convex matrix problems involving the rank function. Additionally it has been established that under certain probabilistic assumptions it allows one to recover with high probability a low rank matrix from a random subset of its entries (Candès and Recht 2009; Koltchinskii et al. 2010). Similar results do not exist for (37) when *N*>2, no matter what definition of tensorial rank one considers (see Sect. 2.1). It is therefore arguable whether or not it is appropriate to call it *nuclear norm for tensors*, as done in Liu et al. (2009). Nonetheless this penalty provides a viable way to compute low complexity estimates in the spirit of the Tucker decomposition. By contrast problems stated in terms of the tensorial rank (2) are notoriously intractable (Hillar and Lim 2010; Hastad 1990). To the best of our knowledge it remains an open problem to devise an appropriate convexification for this type of rank function.

### 5.3 Proximity operators

The numerical feasibility of proximal point algorithms largely depends upon the simplicity of computing the proximal operator introduced in (30). For the class of *n*-mode spectral penalties we can establish the following.

### Proposition 2

(Proximity operator of an *n*-mode spectral penalty)

*Assume*\(\mathcal{W}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\)

*and let*(39)

*be the SVD of its*

*n*-

*mode unfolding*\(\mathcal{W}_{\langle n \rangle}\).

*Then the evaluation at*\(\mathcal{W}\)

*of the proximity operator of*\(\varOmega(\mathcal{W})=h(\sigma(\mathcal{W}_{\langle n\rangle}))\)

*is*

### Proof

*with SVD \(\boldsymbol{A}=\boldsymbol{U}\operatorname{diag}(\sigma(\boldsymbol{A})) \boldsymbol{V}^{\top}\), Argyriou et al. (2011, Proposition 3.1) established that*

**A**_{〈n〉}is a linear one-to-one (invertible) operator and that \((\mathcal{W}_{\langle n\rangle})^{\langle n\rangle}=\mathcal{W}\) namely, the composition between the folding operator and its adjoint yields the identity

^{7}on \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\). Additionally by the chain rule for the subdifferential (see e.g. Nesterov 2003, Lemma 3.18) and by definition of

*Ω*one has \(\partial\varOmega(\mathcal{V})=(\partial(h\circ\sigma)(\mathcal {V}_{\langle n\rangle}))^{\langle n\rangle}\). We now have:

*matrix shrinkage operator*as introduced in Cai et al. (2010).

### 5.4 Multiple module spaces

So far we considered the case where Open image in new window consisted solely of the module space \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\). Next we focus on the case where Open image in new window is given by 2 modules, see Sect. 2.3. The following definition will turn out useful in the next section where we deal with two distinct type of unknowns that are jointly regularized.

### Definition 2

((*n* _{1},*n* _{2})-mode spectral penalty)

*n*

_{1},

*n*

_{2})-mode spectral penalty if it can be written as:

*R*=min{

*K*,

*S*

_{1}

*S*

_{2}}, \(h:\mathbb{R}^{R}\rightarrow\mathbb{R}\) is some symmetric gauge function and \(\sigma( [\mathcal{W}_{1\langle n_{1}\rangle}, \mathcal{W}_{2\langle n_{2}\rangle} ])\in[0,\infty)^{R}\) is the vector of singular values of the matrix \([\mathcal{W}_{\langle n_{1}\rangle},\mathcal{W}_{\langle n_{2}\rangle} ] \) in non-increasing order.

Note that we required that \(I_{n_{1}}=J_{n_{2}}=K\) since otherwise \(\mathcal {W}_{1\langle n_{1}\rangle}\) and \(\mathcal{W}_{2\langle n_{2}\rangle}\) cannot be concatenated.

### Proposition 3

(Proximity operator of an (*n* _{1},*n* _{2})-mode spectral penalty)

*Let*Open image in new window ,

*Ω*,

*S*

_{1}

*and S*

_{2}

*be defined as in Definition*2

*and assume the SVD*:

*Then we have*

*where*

*is partitioned into*[

**Z**_{1},

**Z**_{2}]

*where*

**Z**_{1}

*is a*(

*K*×

*S*

_{1})-

*matrix and*

**Z**_{2}

*is a*(

*K*×

*S*

_{2})-

*matrix*.

### Proof

We note that Definition 2 and the result above can be easily generalized to more than two module spaces at the price of a more involved notation.

## 6 Transductive learning with higher order data

^{8}It is assumed one has a set of

*N*items with higher order representation \(\mathcal {X}^{(n)}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}},~n\in \mathbb{N}_{N}\). These items are gathered in the input dataset \(\mathcal {X}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times N}\) defined entry-wise by

*n*-th item there is a target vector Open image in new window . In particular we shall focus on the case where Open image in new window so that

*=[*

**Y***y*

^{(1)},

*y*

^{(2)},…,

*y*

^{(N)}] is a (

*T*×

*N*)-matrix of binary labels. Entries of \(\mathcal{X}\) and

*can be missing with being the index set of the observed entries in, respectively, \(\mathcal {X}\) and*

**Y***. The goal is to infer the missing entries in \(\mathcal{X}\) and*

**Y***simultaneously, see Fig. 3. We refer to this task as*

**Y***heterogeneous*data completion to emphasize that the nature of \(\mathcal{X}\) and

*is different. Note that this reduces to standard transductive learning as soon as*

**Y***T*=1,

*M*=1 and finally Open image in new window (no missing entries in the input dataset). Goldberg et al. (2010) considers the more general situation where

*T*≥1 and Open image in new window . Here we further generalize this to the case where

*M*≥1, that is, items admit a higher order representation. We also point out that the special case where

*T*=1 and there is no labeling task defined (in particular, Open image in new window ) corresponds to tensor completion as considered for the first time in Liu et al. (2009). Next we clarify our modelling assumptions.

### 6.1 Modelling assumptions

The heterogeneous data completion task is ill-posed in the sense that there are infinitely many ways to fully specify the entries of \(\mathcal {X}\) and * Y*.

^{9}Making the inference process feasible requires to formulate assumptions for both the input dataset as well as for the matrix of labels.

*r*

_{1},

*r*

_{2},…,

*r*

_{ M },

*r*

_{ M+1}) tensor and \(\mathcal{E}\) is a remainder. In our setting the assumption considered in Goldberg et al. (2010) is solely that

*n*-th pattern for the

*t*-th task, is linked to \(\tilde{\mathcal{X}}^{(n)}\) via a latent variable model. More specifically, we let

*t*-th task; we assume that \({y}^{(n)}_{t}\) is produced by assigning at random each binary entry with alphabet {−1,1} following the probability model

*b*

_{ t }. Let \(\mathcal{W}\) be that element of \(\mathbb{R}^{D_{1}\times D_{2}\times \cdots\times D_{M}\times T}\) defined as \(w_{d_{1}d_{2}\cdots d_{M}t}:=w^{(t)}_{d_{1}d_{2}\cdots d_{M}}\). Note that \(\mathcal{W}\) gathers the representers of the linear functionals associated to the

*T*tasks. We now have that

### Remark 7

Notice that we have deliberately refrained from specifying the nature of \(\mathcal{E}\) in (50). Central to our approach is the way input features and target labels are linked together; this is specified by the functional relation (53) and by (54). One could interpret \(\mathcal{E}\) as noise in which case \(\tilde{\mathcal{X}}\) can be regarded as the underlying true representation of the input observation. This is in line with error-in-variables models (Golub and Van Loan 1980; Van Huffel and Vandewalle 1991). Alternatively one might regard \(\mathcal{X}\) as the true representation and assume that the target variable depends only upon the latent tensor \(\tilde{\mathcal{X}}\), having low multilinear rank (*r* _{1},*r* _{2},…,*r* _{ M },*r* _{ M+1}).

### 6.2 Multi-task learning via soft-completion of heterogeneous data

*λ*

_{ m }>0 is a user-defined parameter and \(\sum_{m\in\mathbb{N}_{M+1}}\lambda_{m}=1\). Problem (60) is convex since its objective is the sum of convex functions. It is a form of penalized empirical risk minimization with a composite penalty. The first

*M*penalty terms

*M*+1)-th penalty

### 6.3 Algorithm for soft-completion

^{10}

*and*

**U***are respectively left and right singular vectors of the matrix \([\tilde{\mathcal{X}}_{\langle M+1\rangle}, \tilde{ \boldsymbol{Y}}^{\top}]\). Note that (34) reads here: For completeness we reported in Appendix C the closed form of \(\nabla\bar{f}\). We summarize in Algorithm 2 the steps required to compute a solution. We stress that these steps are obtained by adapting the steps of our template procedure given in Algorithm 1.*

**V**### 6.4 Hard-completion without target labels

The problem of missing or unknown values in multi-way arrays is frequently encountered in practice. Missing values due to data acquisition, transmission, or storage problems are for instance encountered in face image modelling by multilinear subspace analysis (Geng et al. 2011). Generally speaking, missing data due to faulty sensors are widespread in biosignal processing; Acar et al. (2011), in particular, considers an EEG (electroencephalogram) application where data are missing due to disconnections of electrodes. Another problem in Acar et al. (2011) arises from modelling time-evolving computer network traffic where cost-sensitivity imposes that only a subset of edges in the network are sampled.

*f*

^{ x }penalizes the misfit of \(\tilde{\mathcal{X}}\) to the partially observed input data tensor; the composite penalty term favors solution with small multilinear rank. The solution strategy illustrated in the previous section can be easily adjusted to deal with this situation. For certain practical problems, however, it is more desirable to complete the missing entries while requiring the exact adherence to the data. Let us use Open image in new window as a shorthand notation for \(\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times N}\). Strict adherence to observables can be accomplished by means of the following constrained formulation of

*tensor completion*(Gandy et al. 2011; Tomioka et al. 2011; Liu et al. 2009; Signoretto et al. 2011b): where Open image in new window is the sampling set (48).

### 6.5 Algorithm for hard-completion

### Proposition 4

(Projection onto Open image in new window )

*Let*Open image in new window

*and*Open image in new window

*be defined as above*.

*Then for any*Open image in new window ,

*it holds that*

*where*

*and we denoted by*Open image in new window

*the adjoint of the sampling operator*Open image in new window .

### Proof

See Appendix D. □

### Remark 8

Algorithm 3 is explicitly designed for the higher order case (*M*≥2). However it can be easily simplified to perform hard completion of matrices (*M*=1). In this case it is not difficult to see that one needs to evaluate only one proximity operator; consequently, duplication of the matrix unknown can be avoided.

## 7 Inductive learning with tensor data

*N*input-target training pairs Each item is represented by an

*M*-th order tensor and is associated with a vector of

*T*labels. As before, we focus on the case where Open image in new window . For ease of notation we assumed that we have the same input data across the tasks; in general, however, this needs not to be the case.

To understand the rationale behind the regularization approach we are about to propose, consider the following generative mechanism.

### 7.1 Modelling assumptions

*R*

_{ m }<

*D*

_{ m }, \(\boldsymbol{U}_{m}\in\mathbb {R}^{D_{m} \times R_{m}}\) is a matrix with orthogonal columns. Note that the core tensor \(\mathcal{S}_{\tilde{\mathcal{X}}}\in\mathbb {R}^{R_{1}\times R_{2}\times\cdots\times R_{M}}\) and \(\mathcal{E}\in\mathbb {R}^{D_{1}\times D_{2}\times\cdots\times D_{M}}\) are item-specific; on the other hand for any \(m\in\mathbb{N}_{M}\), the full rank matrix

**U**_{ m }spans a latent space relevant to the tasks at hand and common to all the input data. To be precise we assume the target label

*y*

_{ t }were generated according to the probability model \(p(y_{t}|\tilde{y}_{t})=1/(1+ \exp(-y_{t}\tilde {y}_{t}))\), where \(\tilde{y}_{t}\) depends upon the core tensor \(\mathcal {S}_{\tilde{\mathcal{X}}}\):

*b*

_{ t }are task-specific unknowns. It is important to remark that, in this scenario, \(\mathcal{S}_{\mathcal {W}^{(t)}}\) comprises

*R*

_{1}

*R*

_{2}⋯

*R*

_{ M }≪

*D*

_{1}

*D*

_{2}⋯

*D*

_{ M }parameters. In practice the common latent spaces as well as the core tensor \(\mathcal{S}_{\tilde{\mathcal{X}}}\) are both unknowns so that \(\mathcal{S}_{{\mathcal{W}}^{(t)}}\) cannot be estimated directly. However if we further assume that for at least one \(m\in\mathbb{N}_{M}\), where we denote by Open image in new window the range of a matrix

*, one has the following.*

**A**### Proposition 5

*Assume*(83)

*holds for*\(m_{1}\in \mathbb{N}_{M}\).

*Then*

*where*\({\mathcal{W}}^{(t)}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots \times D_{M}}\)

*is the low multilinear rank tensor*:

### Proof

See Appendix E. □

Note that the right-hand side of (84) is an affine function of the given higher order representation \(\mathcal{X}\) of the item, rather than an affine function of the unobserved core tensor \(\mathcal {S}_{\tilde{\mathcal{X}}}\), as in (82).

### Remark 9

Equation (83) requires that \(\mathcal{E}\) does not “overlap” with the discriminative features; in practice this will not hold. One can only hope that \(\mathcal{E}\) does not “overlap too much” so that \(\tilde{y}_{t}\approx\langle\mathcal{X} , {\mathcal{W}}^{(t)} \rangle +b_{t}\).

*T*tasks:

*R*

_{ M+1}≪

*D*

_{ M+1}one has \(\boldsymbol{U}_{M+1}\in\mathbb {R}^{D_{M+1}\times R_{M+1}}\), \(\boldsymbol{S}_{M+1}\in\mathbb{R}^{R_{M+1}\times R_{M+1}}\) and finally \(\boldsymbol{V}_{M+1}\in\mathbb{R}^{R_{M+1}\times D_{1}D_{2}\cdots D_{M}}\). Note that \({\mathcal{W}}\) can now be equivalently restated as the low multilinear rank tensor:

**U**_{ m }, \(m\in\mathbb{N}_{M+1}\) that define subspaces that concentrate the discriminative relationship.

We conclude by pointing out that a supervised learning problem where data observations are represented as matrices, namely second order tensors, is a special case of our setting. Single classification tasks in this situation were studied in Tomioka and Aihara (2007). Similarly to the present setting, the latter proposes a spectral regularization as a principled way to perform complexity control over the space of matrix-shaped models. Before discussing a solution strategy we point out that the method can be easily generalized to regression problems by changing the loss function.

### 7.2 Model estimation

### 7.3 Algorithm for inductive learning

## 8 Experiments

### 8.1 Transductive learning

We begin by presenting experiments on transductive learning with multiple tasks, see Sect. 6.

#### 8.1.1 Evaluation criterion and choice of parameters

*normalized root mean square error*(NRMSE) on the complementary set Open image in new window : where we denoted by

*J*

^{ c }the cardinality of Open image in new window and \(\tilde{\mathcal{X}}\) is as in (50). For both

*tensor soft-completion*(tensor-sc) and

*matrix soft-completion*(matrix-sc) we solve the optimization problem in (60) via the approach presented in Sect. 6.3. The parameter

*λ*

_{0}is chosen in the set {10

^{−5},10

^{−4},10

^{−3}, 10

^{−2},10

^{−1},1}. For tensor-sc we set the parameters in the composite spectral penalty as

*M*+1 is the order of the input data tensor and \(\bar{\lambda}\) is a varying parameter. For matrix-sc we take

*λ*

_{0}we compute the regularization path with respect to \(\bar{\lambda}\) beginning with a large value \(\bar{\lambda}^{(0)}\) and solving a sequence of problems with \(\bar{\lambda}^{(t)}=\eta_{\bar{\lambda}} \bar{\lambda}^{(t-1)}\) where as in Goldberg et al. (2010) we consider as decay parameter \(\eta_{\bar{\lambda}}=0.25\). At each step

*t*we perform warm-starting, that is, we take as initial point the solution obtained at step

*t*−1. We stop when \(\bar{\lambda}\leq10^{-6}\). For both tensor and matrix soft-completion we choose the values of parameters corresponding to the minimum fraction of mis-predicted labels of a hold-out validation set.

#### 8.1.2 Implementation of the optimization algorithm

*η*=10

^{−4}. With reference to Algorithms 2 and 3 we let

*ϵ*

_{0}=10

^{−2}and set

*σ*=1.1. We use a backtracking procedure to find an upper bound for the Lipschitz constant \(L_{\bar{q}}\) (see Remark 5 and references therein). Finally we let \(\tau=0.02/\tilde{L}_{\bar{f}}\) where \(\tilde{L}_{\bar{f}}\) is an upper bound for the Lipschitz constant \(L_{\bar{f}}\), also found via backtracking. As explained above we compute the entire path with respect to the penalty parameter and use warm-starting at each step. At step

*t*=0 the initialization of the algorithm is performed as follows. For both matrix as well as tensor-sc we set

*b*

^{(0)}to be a vector of zeros. For what concerns \(\mathcal {X}^{(0)}\) and \(\tilde{\boldsymbol{Y}}^{(0)}\) we do as follows. Let \(\mathcal {X}^{*}\) and

**Y**^{∗}be obtained setting to zero unobserved entries of \(\mathcal{X}\) and

*respectively. Consider a partitioning [*

**Y**

**Z**_{ M+1,1},

**Z**_{ M+1,2}] of the rank-1 approximation of the matrix \([\mathcal{X}^{*}_{\langle M+1\rangle},\boldsymbol{Y}^{*\top}]\) consistent with the dimension of \(\mathcal{X}^{*}_{\langle M+1\rangle}\) and

**Y**^{∗⊤}. Both matrix-sc and tensor-sc are then initialized according to

#### 8.1.3 Alternative approach

We also report results obtained using linear kernel within LS-SVM classifiers applied to vectorized input data, see Suykens and Vandewalle (1999). We find models via the LS-SVMlab toolbox (De Brabanter et al. 2010). A classifier is built for each task independently since these models do not handle vector-valued labels simultaneously. Although the presence of missing values in the context of LS-SVM has been studied (Pelckmans et al. 2005) the toolbox does not implement any strategy to handle this situation. For this reason we considered as input data the vectorized version of Open image in new window where \(\mathcal{X}^{*}\) and \(\boldsymbol{Z}^{\langle M+1\rangle}_{M+1,1}\) are as in the previous paragraph. We denote this approach as imp+ls-svm where imp is a shorthand for imputation.

#### 8.1.4 Soft completion: toy problems on multi-labeled data

*i*∈{1,2} a matrix \(\boldsymbol{U}_{i}\in\mathbb{R}^{D\times r_{i}}\) with entries i.i.d. from a normal distribution. Finally \(\boldsymbol{U}_{3}\in\mathbb {R}^{N\times r_{3}}\) was generated according to the same distribution. The input data tensor in \(\mathbb{R}^{D\times D\times N}\) was taken to be

*b*

_{ t }, again with independent and identically normally distributed entries; successively, we produced \(\tilde{\boldsymbol{Y}}\) and

*according to (55) and the probability model (54).*

**Y**^{11}Finally, the sampling sets Open image in new window and Open image in new window in (48) and (49) were created by picking uniformly at random a fraction

*ω*of entries of the data tensor and the target matrix respectively. For matrix and tensor soft-completion we performed model selection by using 70 % of these entries for training; we measure the performance corresponding to each parameter pair on the hold-out validation set constituted by the remaining entries. We finally use the optimal values of parameters and train with the whole set of labeled data; we then measure performance on the hold-out test set. Model selection for the linear LS-SVM classifiers was based on 10-fold cross-validation.

*D*=30,

*T*=10,

*σ*=0.1 and different values of the multilinear rank (

*r*

_{1},

*r*

_{2},

*r*

_{3}),

*N*and

*ω*. Table 2 concerns the fraction of unobserved labels (that is, test data) predicted incorrectly by the different procedures as well as NRMSEs. Note that the latter is not reported for the linear LS-SVM models as these approaches do not have an embedded imputation strategy. We report the mean (and standard deviation) over 10 independent trials where each trial deals with independently generated data and sample sets.

Fractions of unobserved labels (test data) predicted incorrectly and NRMSEs. tensor-sc exploits low rank assumptions along all the modes; in contrast, matrix-sc works only with the third mode unfolding (hereby ignoring the two-way nature of each data observation). tensor-sc generally performs comparably or better than matrix-sc and imp+ls-svm in terms of misclassification errors; tensor-sc generally outperforms matrix-sc on the reconstruction of the underlying input data tensor \(\tilde{\mathcal {X}}\), see (50)

ml-rank | | | tensor-sc | matrix-sc | imp+ls-svm | ||
---|---|---|---|---|---|---|---|

label error | NRMSE (×10 | label error | NRMSE (×10 | label error | |||

(3,3,3) | 30 | 0.2 | | | 0.31(0.08) | 7.63(1.42) | 0.32(0.03) |

0.3 | | | 0.21(0.07) | 5.98(2.24) | | ||

0.4 | | | | 3.87(1.12) | 0.14(0.06) | ||

90 | 0.2 | | | | 3.74(1.26) | 0.16(0.03) | |

0.3 | | | 0.09(0.02) | 1.68(0.67) | 0.10(0.02) | ||

0.4 | | | 0.06(0.01) | 2.55(1.67) | 0.07(0.02) | ||

(3, 3, 9) | 30 | 0.2 | 0.44(0.09) | | 0.42(0.09) | 11.57(1.96) | |

0.3 | | | 0.33(0.06) | 10.19(2.51) | 0.33(0.03) | ||

0.4 | | | 0.27(0.04) | 8.82(2.88) | 0.27(0.04) | ||

90 | 0.2 | | | 0.18(0.02) | 6.63(1.16) | 0.27(0.02) | |

0.3 | | | 0.14(0.02) | 4.07(1.62) | 0.17(0.03) | ||

0.4 | | | 0.11(0.02) | 3.44(2.10) | 0.14(0.02) |

### Remark 10

According to Table 2 tensor-sc generally performs comparably or better to matrix-sc in terms of misclassification errors; however the experiments show that the former leads to more favorable results for the reconstruction of the underlying input data tensor \(\tilde{\mathcal{X}}\), see (50).

#### 8.1.5 Multi-class categorization via soft-completion: Olivetti faces

^{12}For each person, ten different 56×46 grayscale pictures

^{13}are available; the input dataset consists therefore of a (56×46×50)-tensor of which 65 % of entries were artificially removed. For each input image a vector-valued target label was created with one-vs-one encoding. That is, if

*c*

_{ i }∈{1,2,…,5} denotes the class (person) indicator for the

*i*-th image we set

*y*

^{(i)}∈{−1,1}

^{5}to be

Cumulative confusion matrices for the different procedures. tensor-sc leads to better classification accuracy in comparison to the alternative techniques

(a) tensor-sc | ||||||
---|---|---|---|---|---|---|

| ||||||

1 | 2 | 3 | 4 | 5 | ||

\(\hat{y}\) | 1 | 19 | 3 | 0 | 0 | 3 |

2 | 0 | 29 | 0 | 0 | 0 | |

3 | 0 | 0 | 20 | 0 | 0 | |

4 | 0 | 0 | 1 | 23 | 0 | |

5 | 0 | 0 | 0 | 0 | 27 |

(b) matrix-sc | ||||||
---|---|---|---|---|---|---|

| ||||||

1 | 2 | 3 | 4 | 5 | ||

\(\hat{y}\) | 1 | 17 | 3 | 1 | 0 | 4 |

2 | 2 | 27 | 0 | 0 | 0 | |

3 | 0 | 0 | 20 | 0 | 0 | |

4 | 0 | 0 | 3 | 21 | 0 | |

5 | 1 | 0 | 2 | 0 | 24 |

(c) imp+ls-svm | ||||||
---|---|---|---|---|---|---|

| ||||||

1 | 2 | 3 | 4 | 5 | ||

\(\hat{y}\) | 1 | 14 | 5 | 0 | 1 | 5 |

2 | 0 | 29 | 0 | 0 | 0 | |

3 | 0 | 0 | 19 | 0 | 1 | |

4 | 0 | 0 | 2 | 22 | 0 | |

5 | 0 | 0 | 0 | 0 | 27 |

Mean and standard deviation of misclassification error rates and NRMSE of features imputation for the Olivetti dataset

tensor-sc | matrix-sc | imp+ls-svm | ||
---|---|---|---|---|

label err | NRMSE (×10 | label err | NRMSE (×10 | label err |

| 19.90(9.75) | 0.11(0.14) | 20.90(10.24) | 0.09(0.13) |

### Remark 11

Note that, since the choice of parameters was driven by misclassification errors, the objective of the approach is the correct completion of the labeling. Therefore the estimated input features \(\hat {\mathcal{X}}\) in (60) should be interpreted as carriers of latent discriminative information rather than a reconstruction of the underlying images. With reference to Remark 7 we interpret here \(\mathcal{X}^{(i)}\) as the true representation of the *i*-th image, \(\tilde{\mathcal{X}}^{(i)}\) as latent discriminative features and \(\hat{\mathcal{X}}^{(i)}\) as their estimates.

### Remark 12

Unlike in the toy problems above, for which \(\tilde{\mathcal{X}}\) was available, the NRMSEs in Table 4 are computed upon the actual set of images \(\mathcal{X}\).

Figure 4 illustrates the retrieval of latent features for some unlabeled (test) pictures. Notably the latent features obtained by tensor-sc look as over-smoothed images whereas those obtained by matrix-sc generally look more noisy. In particular, the cases for which matrix-sc incorrectly assigns labels often correspond to situations where latent features do not capture person-specific traits (first and second rows). Wrongly assigned labels also correspond to cases where latent features are close to those corresponding to a different person (last two rows).

#### 8.1.6 Hard completion: toy problems

*λ*

_{ m },

*m*∈{1,2,3} to zero. In either case we compute solutions via Algorithm 3 keeping into account the simplifications that occur in the second order case. For each experiment we generated a core tensor \(\mathcal{S}\) in \(\mathbb {R}^{r_{1}\times r_{2}\times r_{3}}\) with entries i.i.d. according to the uniform distribution on the interval [−0.5,0.5], denoted as

*U*([−0.5,0.5]); for

*i*∈{1,2,3} a matrix \(\boldsymbol{U}_{i}\in \mathbb{R}^{D\times r_{i}}\) was generated also with entries i.i.d. from

*U*([−0.5,0.5]). The input data tensor in \(\mathbb{R}^{D\times D\times D}\) was taken to be \(\mathcal{X}=\tilde{\mathcal{X}}+ \sigma\,\mathcal{E}\) where

*D*×

*D*×

*D*)-tensor with independent normally distributed entries. Finally the sampling sets Open image in new window were created by picking uniformly at random a fraction

*ω*of entries of the data tensor. We took

*D*=50 and considered different values of

*σ*, the multilinear rank (

*r*

_{1},

*r*

_{2},

*r*

_{3}) and

*ω*. In Table 5 we report the mean (and standard deviation) of NRMSEs and execution times over 10 independent trials where each trial deals with independently generated data and sample sets. Note that keeping fixed

*σ*across different values of multilinear rank gives a different noise level. We therefore report in the table the approximate

*signal-to-noise ratio*(SNR) obtained on the different trials. The latter is defined as \(\mathit{SNR}:=\operatorname{var}(\tilde{x})/(\sigma ^{2})\) where \(\tilde{x}\) denotes the generic entry of \(\tilde{\mathcal {X}}\) and \(\operatorname{var}\) denotes the empirical variance.

NRMSEs and execution times for tensor-hc and matrix-hc; the latter completes the given tensor based upon low rank assumption on the third mode only. Note that the computational complexity of tensor-hc is roughly 3 times that of matrix-hc, in line with Remark 14

| Multilinear rank | | tensor-hc | matrix-hc | |||
---|---|---|---|---|---|---|---|

NMRSE (×10 | time (s) | NMRSE (×10 | time (s) | ||||

0.02 | (3,3,3) | (SNR≈3) | 0.3 | | 42.04(2.15) | 5.51(0.65) | 13.93(1.01) |

0.6 | | 23.24(1.14) | 3.86(0.55) | 8.05(1.09) | |||

0.9 | | 16.13(1.26) | 3.71(0.73) | 5.24(0.58) | |||

(3,3,9) | (SNR≈9.5) | 0.3 | | 54.65(2.03) | 6.15(0.63) | 21.09(0.89) | |

0.6 | | 29.06(1.10) | 4.04(0.44) | 11.98(1.04) | |||

0.9 | | 18.27(0.96) | 3.44(0.36) | 5.99(0.58) | |||

(9,9,3) | (SNR≈30) | 0.3 | 3.32(0.45) | 104.21(7.00) | | 28.10(1.59) | |

0.6 | 1.96(0.31) | 41.01(2.25) | | 13.12(0.67) | |||

0.9 | 1.77(0.22) | 21.59(1.05) | | 6.64(0.44) | |||

0.04 | (3,3,3) | (SNR≈0.8) | 0.3 | | 50.34(3.92) | 8.55(0.99) | 17.53(1.64) |

0.6 | | 35.03(1.51) | 6.68(1.00) | 11.65(0.94) | |||

0.9 | | 25.67(1.20) | 6.68(1.13) | 8.17(0.58) | |||

(9,9,3) | (SNR≈2.36) | 0.3 | | 77.31(3.54) | 8.53(0.86) | 26.74(1.88) | |

0.6 | | 46.08(1.58) | 6.54(0.70) | 15.60(1.02) | |||

0.9 | | 29.97(1.10) | 5.90(0.59) | 9.71(0.55) | |||

(9,9,3) | (SNR≈7.5) | 0.3 | 5.95(0.78) | 118.89(6.74) | | 35.62(2.09) | |

0.6 | 3.72(0.59) | 59.66(2.54) | | 19.39(0.88) | |||

0.9 | 3.40(0.43) | 36.20(1.80) | | 10.61(0.72) |

### Remark 13

The experimental evidence suggests that tensor completion performs better than matrix completion (performed along the third mode) when either no *n*-rank dominates the others or when the 3-rank is higher. This observation holds across different noise levels and fractions of entries used in the reconstruction.

### Remark 14

As Table 5 shows, for the third order case, the computational complexity of our implementation of tensor completion is roughly three times that of matrix completion. This is expected since the computational load is determined by Step 5 and 6 which, in turn, involve a number of iterations equal to the order of the tensor (*M*+1). This is no longer needed for *M*=1 (second order case), see Remark 8.

#### 8.1.7 Impainting of colored images via hard completion

In here we apply hard-completion to impainting of 8-bit RGB colored images, each of which is represented as a third order tensor. The first two modes span the pixels space; the third mode contains information from the three channels. For each image we remove entries in all the channels simultaneously (first three rows of Fig. 5), or consider the case where entries are missing at random (last two rows of Fig. 5). We then solve the problem in Eq. (75) with the sampling set Open image in new window indexing non-missing pixels. A solution is found via Algorithm 3. As termination criterion we use the relative increment (36) where we set *η*=10^{−7}. With reference to Algorithm 2 we let *τ*=10^{4} and *λ* _{ m }=1 for *m*∈{1,2,3}. Figure 5 reports the original pictures, the input data tensor and the outcome of our algorithm.

### 8.2 Inductive learning

^{14}with models obtained solving (90). With reference to the latter we set the parameters in the composite spectral penalty as follows. In one case, referred to as log mlrank,

^{15}we set

*M*is the order of the input data and \(\bar{\lambda}\) is a varying parameter. Alternatively we take

#### 8.2.1 Multi-labeled data toy problems

*T*models represented by a (

*D*×

*D*×

*T*)-tensor \({\mathcal{W}}\) with multilinear rank (

*r*,

*r*,

*r*) and a vector of bias terms \(b\in\mathbb{R}^{T}\) with normal entries; \({\mathcal{W}}\) was obtained by generating a core tensor \(\mathcal{S}_{{\mathcal{W}}}\) in \(\mathbb{R}^{r\times r\times r}\) with entries i.i.d. from a normal distribution; for

*i*∈{1,2} a matrix \(\boldsymbol{U}_{i}\in\mathbb {R}^{D\times r}\) and \(\boldsymbol{U}_{3}\in\mathbb{R}^{T\times r}\) were generated also with entries i.i.d. from a normal distribution. Finally we set

*N*input-output pairs as follows. For \(n\in\mathbb{N}_{N}\) we let \(\mathcal{X}^{(n)}\) be a (

*D*×

*D*)-matrix with normal entries; for any \(t\in\mathbb{N}_{T}\) we let the corresponding label \(y^{(n)}_{t}\) be a Bernoulli random variable with alphabet {1,−1} and success probability \(1/(1+\exp (-\tilde{y}^{(n)}_{t}))\); the latent variable \(\tilde{y}^{(n)}_{t}\) was taken to be

*r*=2 and considered the procedure above for different values of

*T*and

*D*. For a fixed value of

*T*we use

*N*=100

*T*pairs for training. Note that

*N*refer to the whole set of tasks; in turn, the

*N*pairs were distributed uniformly at random across different tasks. As such, there are on average 100 input-output pairs per task; in this way, the amount of training information is kept constant as

*T*varies. For each setting we perform 10 trials. For log mlrank and log rank we chose \(\bar{\lambda}\) based upon a validation set of 30 % pairs selected at random within the

*N*observations. Results in terms of misclassification rate on a test set are reported in Table 6.

Fractions of unobserved labels (test data) predicted incorrectly

| | | log mlrank | log rank | lin ls-svm | naive Bayes |
---|---|---|---|---|---|---|

5 | 2 | 200 | | 0.11(0.03) | 0.13(0.02) | 0.23(0.02) |

4 | 400 | | 0.08(0.01) | 0.13(0.01) | 0.22(0.01) | |

8 | 800 | | 0.07(0.01) | 0.14(0.01) | 0.22(0.01) | |

15 | 2 | 200 | | 0.31(0.02) | 0.34(0.01) | 0.39(0.01) |

4 | 400 | | 0.28(0.03) | 0.35(0.01) | 0.39(0.01) | |

8 | 800 | | 0.22(0.01) | 0.35(0.01) | 0.39(0.01) |

### Remark 15

The experiments show that leveraging relations between tasks (log mlrank and log rank) significantly improves results; the performance of ls-svm models, which are trained independently, is the same as *T* is increased. This is to be expected since the amount of training data per task is kept approximately the same across different values of *T*.

### Remark 16

A comparison between log mlrank and log rank reveals that exploiting structural assumptions over the input features is a good idea; this is seen to be the case especially when the number of tasks is small or the features dimensionality is higher.

#### 8.2.2 Multiclass classification of Binary Alphadigits

^{16}made up of digits from “0” through “9” followed by capital letters from “A” through “Z” (English alphabet). Each digit is represented by 39 examples each of which consists of a binary 20×16 matrix. In log mlrank the matrix shape of each digit is retained; log rank and lin ls-svm treat each input pattern as a vector of dimension 320. We consider problems with different numbers of classes. As before we used one-vs-one encoding (see Sect. 8.1.5); correspondingly, the number of tasks

*T*is equal to the number of classes. In each case we train models upon

*N*training examples uniformly distributed across the considered classes; we chose

*N*so that approximately 10 examples per class are used for training (and model selection) whereas the remaining examples are retained for testing. For each setting we average results over 20 trials each of which is obtained from a random splitting of training and test data. Due to the scarcity of training patterns an error occurs when running NaiveBayes; therefore we could not obtain results for this approach. Tables 7 and 8 report results for different values of

*T*. For

*T*=2 we considered a subset of arbitrary binary problems; for

*T*>2 we considered classes of digits in their given order. In general, for

*T*≤4 log mlrank seems to perform slightly better than log rank (Tables 7 and 8). As for the multi-labels example above, there is a strong evidence that enforcing task relationships via the regularization mechanism in log mlrank and log rank improves over the case where tasks are considered independently (Table 8).

Fractions of misclassified test digits, multiple problems, *T*=2

(a) “2” vs “7” | ||
---|---|---|

log mlrank | log rank | lin ls-svm |

| 0.07(0.04) | 0.06(0.04) |

(b) “I” vs “J” | ||
---|---|---|

log mlrank | log rank | lin ls-svm |

| 0.13(0.05) | 0.14(0.04) |

(c) “4” vs “L” | ||
---|---|---|

log mlrank | log rank | lin ls-svm |

| 0.11(0.04) | |

(d) “R” vs “S” | ||
---|---|---|

log mlrank | log rank | lin ls-svm |

| 0.08(0.05) | |

(e) “8” vs “9” | ||
---|---|---|

log mlrank | log rank | lin ls-svm |

| 0.07(0.05) | 0.07(0.05) |

(f) “M” vs “N” | ||
---|---|---|

log mlrank | log rank | lin ls-svm |

0.15(0.05) | 0.16(0.06) | |

Fractions of misclassified test digits, *T*>2

| | log mlrank | log rank | lin ls-svm |
---|---|---|---|---|

4 | 40 | | 0.08(0.03) | 0.08(0.03) |

8 | 80 | | | 0.22(0.04) |

12 | 120 | | | 0.38(0.04) |

16 | 160 | | | 0.51(0.02) |

## 9 Concluding remarks

In this paper we have established a mathematical framework for learning with higher order tensors. The transductive approach we considered is especially useful in the presence of missing input features. The inductive formulation, on the other hand, allows one to predict labels associated to input items unavailable during training. Both these approaches work by simultaneously identifying subspaces of highly predictive features without the need for a preliminary feature extraction step. This is accomplished both leveraging relationships across tasks and within the higher order representation of the (possibly very high dimensional) input data.

A drawback of our methods is their restriction to linear models only. An interesting line of future research concerns the extension to a broader class of models. For certain problem of interest one could perhaps extend results for matrix representer theorems (Argyriou et al. 2009), used within multi-task learning (Argyriou et al. 2008). In the setting of Argyriou et al. (2008), different but related learning problems are associated to task vectors belonging to a feature space associated to a used-defined kernel function. As a special case, i.e. when the feature mapping is the identity, the feature space corresponds to the input space where data are originally represented. In this case one obtains linear models in the data, like in this paper. In general, one can show that these (possibly infinite dimensional) task vectors lie within the span of the mapped data associated to all the tasks (Argyriou et al. 2009). In the context of this paper, we assumed low multilinear ranks hereby leveraging the algebraic structure of data *in the input space*. This is rather crucial, especially when the higher order data entails missing observations and part of the learning problem consists of completing the data. In contrast, Argyriou et al. (2008) exploits the geometry of the feature space rather than that of the original input space. Therefore, although nonlinear extensions are certainly possible (and desirable) they will need working assumptions attached to the geometry of the feature space rather than that of the input space. For some cases, a viable alternative is to conceive a mapping that preserves properties of the higher order data in the input space. This is the spirit of the Grassmanian kernels proposed in Signoretto et al. (2011a).

It is also important to note that in our experiments we either considered a single spectral penalty (as in (97) and (99)) or a composite regularizer where all the penalties were equally enforced ((96) and (98)). Although uniform weights are shown to work in practice, this black and white setting is clearly restrictive: ideally one would perform model selection to search for the optimal combination of parameters. Unfortunately this comes at the price of increasing substantially the computation burden.

## Footnotes

- 1.
In the multilinear algebra literature such a space is often denoted by \(\mathbb{R}^{I_{1}}\otimes\mathbb {R}^{I_{2}}\otimes\cdots\otimes\mathbb{R}^{I_{N}}\) to emphasize its nature as linear span of rank-1 objects. Here we use \(\mathbb {R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) for compactness.

- 2.
Note that the right hand side of (3) is in fact invariant with respect to permutations of the columns of \(\mathcal{A}_{\langle n\rangle}\).

- 3.
The power set of a set Open image in new window , denoted as Open image in new window , is the set of all subsets of Open image in new window , including the empty set, denoted as ∅, and Open image in new window itself.

- 4.
Note that in (36) and before in (35) we implicitly assumed that the denominator in the left hand-side is never exactly zero. In all the problems we will consider later, in particular, one can verify that this is always the case unless Open image in new window is a solution, which never occurs in practical applications. For those specifications of \(\bar{h}\) where this condition might arise one can replace \(\vert\bar{h}(y^{(k)})\vert\) by \(\vert\bar{h}(y^{(k)})\vert+ 1\).

- 5.
The reason for restricting to the class of symmetric gauge functions will become apparent in Proposition 2, in which their properties are used for the derivation of proximity operators.

- 6.
Assume the unfolding is performed according to the ordering rule in De Lathauwer et al. (2000). Then one has

**V**^{(n)}=(**U**^{(n+1)}⊗⋯⊗**U**^{(N−1)}⊗**U**^{(N)}⊗**U**^{(1)}⊗⋯⊗**U**^{(n−1)})^{⊤}where ⊗ denotes here the matrix Kronecker product. - 7.
Equivalently, ⋅

_{〈n〉}is unitary. - 8.
The code of some routines can be found at https://securehomes.esat.kuleuven.be/~msignore/.

- 9.
This is the case since entries of \(\mathcal {X}\) are in \(\mathbb{R}\).

- 10.
We adopt the short-hand Open image in new window to indicate the iterated Cartesian product: Open image in new window .

- 11.
Note that each input observation possesses multiple (binary) labels. This situation differs from the multi-class paradigm that we consider later on. In here any possible binary vector is admissible; by contrast, in a multi-class setting only those vectors belonging to the codebook are admissible.

- 12.
Publicly available at http://cs.nyu.edu/~roweis/data.html.

- 13.
Sizes refer to the images obtained after removing borders.

- 14.
Specifically we considered the routine NaiveBayes contained in the Statistics Toolbox of Matlab.

- 15.
We use log as a short-hand for

*logistic*since (90) is based on the logistic loss; we use mlrank as a short-hand for*multilinear rank*since in this case we penalize high multilinear rank of the tensor corresponding to the ensemble of models. - 16.
Publicly available at http://cs.nyu.edu/~roweis/data.html.

- 17.
Note that

*m*_{2}is a generic index; contrary to*m*_{1}it is unrelated to (83).

## Notes

### Acknowledgements

Research supported by: ERC AdG A-DATADRIVE-B, Research Council KUL: GOA/10/09 MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC) en PFV/10/002 (OPTEC), CIF1 and STRT1/08/023, IOF-SCORES4CHEM, several PhD/postdoc and fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects: G0226.06, G.0302.07, G.0320.08, G.0427.10N, G.0558.08, G.0557.08, G.0588.09; IWT: PhD Grants, Eureka-Flite+, SBO LeCoPro, SBO Climaqs, SBO POM, O&O-Dsquare. Belgian Federal Science Policy Office: IUAP P6/04 (DYSCO, Dynamical systems, control and optimization, 2007–2011); IBBT; EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT-248940), FP7-SADCO (MC ITN-264735).

## References

- Abernethy, J., Bach, F., Evgeniou, T., & Vert, J. (2009). A new approach to collaborative filtering: operator estimation with spectral regularization.
*Journal of Machine Learning Research*,*10*, 803–826. MATHGoogle Scholar - Acar, E., Dunlavy, D., Kolda, T., & Mørup, M. (2011). Scalable tensor factorizations for incomplete data.
*Chemometrics and Intelligent Laboratory Systems*,*106*(1), 41–56. CrossRefGoogle Scholar - Argyriou, A., Micchelli, C., Pontil, M., & Ying, Y. (2007a). A spectral regularization framework for multi-task structure learning. In
*Advances in neural information processing systems*. Google Scholar - Argyriou, A., Micchelli, C. A., Pontil, M., & Ying, Y. (2007b). A spectral regularization framework for multi-task structure learning. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.),
*Advances in neural information processing systems*(Vol. 20, pp. 25–32). Cambridge: MIT Press. Google Scholar - Argyriou, A., Evgeniou, T., & Pontil, M. (2007c). Multi-task feature learning. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.),
*Advances in neural information processing systems*(Vol. 19, pp. 41–48). Cambridge: MIT Press. Google Scholar - Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning.
*Machine Learning*,*73*(3), 243–272. CrossRefGoogle Scholar - Argyriou, A., Micchelli, C., & Pontil, M. (2009). When is there a representer theorem? Vector versus matrix regularizers.
*Journal of Machine Learning Research*,*10*, 2507–2529. MathSciNetMATHGoogle Scholar - Argyriou, A., Micchelli, C., & Pontil, M. (2010). On spectral learning.
*Journal of Machine Learning Research*,*11*, 935–953. MathSciNetMATHGoogle Scholar - Argyriou, A., Micchelli, C., Pontil, M., Shen, L., & Xu, Y. (2011). Efficient first order methods for linear composite regularizers. Arxiv preprint arXiv:1104.1436.
- Aronszajn, N. (1950). Theory of reproducing kernels.
*Transactions of the American Mathematical Society*,*68*, 337–404. MathSciNetCrossRefMATHGoogle Scholar - Bauschke, H., & Combettes, P. (2011).
*Convex analysis and monotone operator theory in Hilbert spaces*. Berlin: Springer. CrossRefMATHGoogle Scholar - Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
*SIAM Journal on Imaging Sciences*,*2*(1), 183–202. MathSciNetCrossRefMATHGoogle Scholar - Becker, S., Candès, E. J., & Grant, M. (2010).
*Templates for convex cone problems with applications to sparse signal recovery*. Tech. rep, Stanford University. Google Scholar - Berlinet, A., & Thomas-Agnan, C. (2004).
*Reproducing kernel Hilbert spaces in probability and statistics*. Amsterdam: Kluwer Academic. CrossRefMATHGoogle Scholar - Bertsekas, D. (1976). On the Goldstein-Levitin-Polyak gradient projection method.
*IEEE Transactions on Automatic Control*,*21*(2), 174–184. MathSciNetCrossRefMATHGoogle Scholar - Bertsekas, D. P. (1995).
*Nonlinear programming*. Belmont: Athena Scientific. MATHGoogle Scholar - Bertsekas, D. P., & Tsitsiklis, J. N. (1989).
*Parallel and distributed computation*. Englewood Cliffs: Prentice-Hall. MATHGoogle Scholar - Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.
*Foundations and Trends in Machine Learning*,*3*(1), 1–122. CrossRefMATHGoogle Scholar - Brézis, H. (1973).
*Opérateurs maximaux monotones*. Amsterdam: Elsevier. MATHGoogle Scholar - Cai, J., Candès, E., & Shen, Z. (2010). A singular value thresholding algorithm for matrix completion.
*SIAM Journal on Optimization*,*20*(4), 1956–1982. MathSciNetCrossRefMATHGoogle Scholar - Candes, E., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis?
*Journal of the ACM*,*58*(3), 11, p. 37. MathSciNetCrossRefMATHGoogle Scholar - Candès, E., & Plan, Y. (2010). Matrix completion with noise.
*Proceedings of the IEEE*,*98*(6), 925–936. CrossRefGoogle Scholar - Candès, E., & Recht, B. (2009). Exact matrix completion via convex optimization.
*Foundations of Computational Mathematics*,*9*(6), 717–772. MathSciNetCrossRefMATHGoogle Scholar - Chen, S., Donoho, D., & Saunders, M. (2001). Atomic decomposition by basis pursuit.
*SIAM Review*,*43*, 129–159. MathSciNetCrossRefMATHGoogle Scholar - Combettes, P. (2009). Iterative construction of the resolvent of a sum of maximal monotone operators.
*Journal of Convex Analysis*,*16*(4), 727–748. MathSciNetMATHGoogle Scholar - Combettes, P., & Pesquet, J. (2008). A proximal decomposition method for solving convex variational inverse problems.
*Inverse Problems*,*24*, 065014. MathSciNetCrossRefMATHGoogle Scholar - Combettes, P., & Pesquet, J. (2009).
*Proximal splitting methods in signal processing*. Arxiv preprint arXiv:0912.3522. - Coppi, R., & Bolasco, S. (1989).
*Multiway data analysis*. Amsterdam: North-Holland MATHGoogle Scholar - De Brabanter, K., Karsmakers, P., Ojeda, F., Alzate, C., De Brabanter, J., Pelckmans, K., De Moor, B., Vandewalle, J., & Suykens, J. A. K. (2010).
*LS-SVMlab toolbox user’s guide version 1.8*. Internal Report 10-146, ESAT-SISTA, K.U.Leuven (Leuven, Belgium). Google Scholar - De Lathauwer, L., De Moor, B., & Vandewalle, J. (2000). A multilinear singular value decomposition.
*SIAM Journal on Matrix Analysis and Applications*,*21*(4), 1253–1278. MathSciNetCrossRefMATHGoogle Scholar - Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.
*Machine Learning*,*29*(2), 103–130. CrossRefMATHGoogle Scholar - Donoho, D. L. (2006). Compressed sensing.
*IEEE Transactions on Information Theory*,*52*(4), 1289–1306. MathSciNetCrossRefMATHGoogle Scholar - Douglas, J., & Rachford, H. H. (1956). On the numerical solution of heat conduction problems in two and three space variables.
*Transactions of the American Mathematical Society*,*82*(2), 421–439. MathSciNetCrossRefMATHGoogle Scholar - Eckstein, J., & Bertsekas, D. (1992). On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators.
*Mathematical Programming*,*55*(1), 293–318. MathSciNetCrossRefMATHGoogle Scholar - Ekeland, I., & Temam, R. (1976).
*Convex analysis and variational problems*. Amsterdam: North-Holland MATHGoogle Scholar - Fazel, M. (2002).
*Matrix rank minimization with applications*. Ph.D. thesis, Elec. Eng. Dept., Stanford University. Google Scholar - Gandy, S., Recht, B., & Yamada, I. (2011). Tensor completion and low-n-rank tensor recovery via convex optimization.
*Inverse Problems*,*27*(2), 025010. MathSciNetCrossRefMATHGoogle Scholar - Geng, X., Smith-Miles, K., Zhou, Z., & Wang, L. (2011). Face image modeling by multilinear subspace analysis with missing values.
*IEEE Transactions on Systems, Man and Cybernetics. Part B. Cybernetics*,*41*(3), 881–892. CrossRefGoogle Scholar - Goldberg, A., Xiaojin, Z., Recht, B., Xu, J., & Nowak, R. (2010). Transduction with matrix completion: three birds with one stone. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.),
*Advances in neural information processing systems*(Vol. 23, pp. 757–765). Google Scholar - Golub, G., & Van Loan, C. (1980). An analysis of the total least squares problem.
*SIAM Journal on Numerical Analysis*,*17*(6), 883–893. MathSciNetCrossRefMATHGoogle Scholar - Golub, G. H., & Van Loan, C. F. (1996).
*Matrix Computations*(3rd ed.). Baltimore: Johns Hopkins University Press. MATHGoogle Scholar - Hastad, J. (1990). Tensor rank is NP-complete.
*Journal of Algorithms*,*11*(4), 644–654. MathSciNetCrossRefMATHGoogle Scholar - Hillar, C., & Lim, L. (2010).
*Most tensor problems are NP hard*. Arxiv preprint arXiv:0911.1393. - Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems.
*Technometrics*,*12*(1), 55–67. MathSciNetCrossRefMATHGoogle Scholar - Horn, R., & Johnson, C. (1994).
*Topics in matrix analysis*. Cambridge: Cambridge University Press. MATHGoogle Scholar - Jacob, L., Obozinski, G., & Vert, J. (2009). Group lasso with overlap and graph lasso. In
*Proceedings of the 26th annual international conference on machine learning*. New York: ACM. Google Scholar - Kolda, T., & Bader, B. (2009). Tensor decompositions and applications.
*SIAM Review*,*51*(3), 455–500. MathSciNetCrossRefMATHGoogle Scholar - Koltchinskii, V., Tsybakov, A., & Lounici, K. (2010).
*Nuclear norm penalization and optimal rates for noisy low rank matrix completion*. Arxiv preprint arXiv:1011.6256. - Kroonenberg, P. (2008).
*Applied multiway data analysis*. New York: Wiley-Interscience. CrossRefMATHGoogle Scholar - Lions, P., & Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators.
*SIAM Journal on Numerical Analysis*,*16*(6), 964–979. MathSciNetCrossRefMATHGoogle Scholar - Liu, J., Musialski, P., Wonka, P., & Ye, J. (2009). Tensor completion for estimating missing values in visual data. In
*IEEE international conference on computer vision (ICCV)*, Kyoto, Japan (pp. 2114–2121). Google Scholar - Ma, S., Goldfarb, D., & Chen, L. (2011). Fixed point and Bregman iterative methods for matrix rank minimization.
*Mathematical Programming*,*128*(1), 321–353. MathSciNetCrossRefMATHGoogle Scholar - Minty, G. (1962). Monotone (nonlinear) operators in Hilbert space.
*Duke Mathematical Journal*,*29*(3), 341–346. MathSciNetCrossRefMATHGoogle Scholar - Moreau, J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien.
*Comptes Rendus Mathematique. Academie Des Sciences Paris, Sér. A Math*,*255*, 2897–2899. MathSciNetMATHGoogle Scholar - Nesterov, Y. (2003).
*Introductory lectures on convex optimization: a basic course*. Norwell: Kluwer Academic. MATHGoogle Scholar - Nesterov, Y. (2007).
*Gradient methods for minimizing composite objective function*. Center for Operations Research and Econometrics (CORE), Université catholique de Louvain, Tech. Rep. Google Scholar - von Neumann, J. (1937). Some matrix inequalities and metrization of matric-space.
*Tomsk University Review*,*1*, 286–300. MATHGoogle Scholar - Pelckmans, K., De Brabanter, J., Suykens, J. A. K., & De Moor, B. (2005). Handling missing values in support vector machine classifiers.
*Neural Networks*,*18*, 684–692. CrossRefMATHGoogle Scholar - Phelps, R. (1993).
*Convex functions, monotone operators, and differentiability*. Berlin: Springer. MATHGoogle Scholar - Recht, B., Fazel, M., & Parrilo, P. (2007). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization.
*SIAM Review*,*52*, 471–501. MathSciNetCrossRefMATHGoogle Scholar - Rockafellar, R. (1970a).
*Convex analysis*. Princeton: Princeton University Press. CrossRefMATHGoogle Scholar - Rockafellar, R. (1970b). On the maximal monotonicity of subdifferential mappings.
*Pacific Journal of Mathematics*,*33*(1), 209–216. MathSciNetCrossRefMATHGoogle Scholar - Rockafellar, R. (1976). Monotone operators and the proximal point algorithm.
*SIAM Journal on Control and Optimization*,*14*, 877. MathSciNetCrossRefMATHGoogle Scholar - Signoretto, M., De Lathauwer, L., & Suykens, J. A. K. (2011a). A kernel-based framework to tensorial data analysis.
*Neural Networks*,*24*(8), 861–874. CrossRefMATHGoogle Scholar - Signoretto, M., Van de Plas, R., De Moor, B., & Suykens, J. A. K. (2011b). Tensor versus matrix completion: a comparison with application to spectral data.
*IEEE Signal Processing Letters*,*18*(7), 403–406. CrossRefGoogle Scholar - Smale, S., & Zhou, D. (2005). Shannon sampling II: connections to learning theory.
*Applied and Computational Harmonic Analysis*,*19*(3), 285–302. MathSciNetCrossRefMATHGoogle Scholar - Smilde, A., Bro, R., & Geladi, P. (2004).
*Multi-way analysis with applications in the chemical sciences*. New York: Wiley. CrossRefGoogle Scholar - Spingarn, J. (1983). Partial inverse of a Monotone Operator.
*Applied Mathematics & Optimization*,*10*(1), 247–265. MathSciNetCrossRefMATHGoogle Scholar - Srebro, N. (2004).
*Learning with matrix factorizations*. Ph.D. thesis, Massachusetts Institute of Technology. Google Scholar - Sturm, J. (1999). Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones.
*Optimization Methods & Software*,*11*(1), 625–653. MathSciNetCrossRefMATHGoogle Scholar - Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers.
*Neural Processing Letters*,*9*(3), 293–300. MathSciNetCrossRefMATHGoogle Scholar - Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO.
*Journal of the Royal Statistical Society. Series B (Methodological)*,*58*(1), 267–288. MathSciNetMATHGoogle Scholar - Tomioka, R., & Aihara, K. (2007). Classifying matrices with a spectral regularization. In
*Proceedings of the 24th international conference on machine learning*(pp. 895–902). New York: ACM. Google Scholar - Tomioka, R., Hayashi, K., & Kashima, H. (2011).
*Estimation of low-rank tensors via convex optimization*. Arxiv preprint arXiv:1010.0789. - Tucker, L. R. (1964). The extension of factor analysis to three-dimensional matrices. In
*Contributions to mathematical psychology*(pp. 109–127). New York: Holt, Rinehart & Winston. Google Scholar - Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.
*Psychometrika*,*31*(3), 279–311. MathSciNetCrossRefGoogle Scholar - Tütüncü, R., Toh, K., & Todd, M. (2003). Solving semidefinite-quadratic-linear programs using SDPT3.
*Mathematical Programming*,*95*(2), 189–217. MathSciNetCrossRefMATHGoogle Scholar - Van Huffel, S., & Vandewalle, J. (1991).
*The total least squares problem: computational aspects and analysis*(Vol.*9*). Philadelphia: Society for Industrial Mathematics. CrossRefMATHGoogle Scholar - Vandenberghe, L., & Boyd, S. (1996). Semidefinite programming.
*SIAM Review*,*38*(1), 49–95. MathSciNetCrossRefMATHGoogle Scholar - Zhao, P., Rocha, G., & Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection.
*The Annals of Statistics*,*37*, 3468–3497. MathSciNetCrossRefMATHGoogle Scholar