1 Introduction

Tensors are the higher order generalization of vectors and matrices. They find applications whenever the data of interest have intrinsically many dimensions. This is the case for an increasing number of areas such as econometrics, chemometrics, psychometrics, (biomedical) signal processing and image processing. Regardless of the specific domain, a common task in the data analysis workflow amounts at finding some low dimensional representation of the process under study. Existing tensor-based techniques (Kolda and Bader 2009; Smilde et al. 2004; Coppi and Bolasco 1989; Kroonenberg 2008) mostly consist of decompositions that give a concise representation of the underlying structure of data; this is useful for exploratory data analysis since it often reveals representative low-dimensional subspaces (for Tucker-type decompositions) or sum of rank-1 factors (for Canonic Polyadic Decomposition (CPD) and related techniques). In this work we take a broader perspective and consider a wider set of learning tasks. Our main goal is to extend spectral regularization (Abernethy et al. 2009; Tomioka and Aihara 2007; Argyriou et al. 2007b, 2010; Srebro 2004) to the case where data have intrinsically many dimensions and are therefore represented as higher order tensors.

1.1 Related literature

So far spectral regularization has been advocated mainly for matrices (Tomioka and Aihara 2007; Argyriou et al. 2010, 2008, 2007b; Abernethy et al. 2009). In the important low-rank matrix recovery problem, using a convex relaxation technique proved to be a valuable methodology (Cai et al. 2010; Candès and Recht 2009; Candès and Plan 2010). Recently this approach has been extended and tensor completion has been formulated (Liu et al. 2009; Signoretto et al. 2011b). The authors of Gandy et al. (2011) considered tensor completion and low multilinear rank tensor pursuit. Whereas the former assumes knowledge of some entries, the latter assumes the knowledge of measurements obtained sensing the tensor unknown via a known linear transformation (with the sampling operator being a special case). They provide algorithms for solving constrained as well as penalized versions of this problem. They also discussed formulations suitable for dealing with noisy measurements, in which a quadratic loss is employed to penalize deviation from the observed data.

1.2 Contributions

We present a framework based on convex optimization and spectral regularization to perform learning when data observations are represented as tensors. This includes in particular the cases where observations are vectors or matrices. In addition, it allows one to deal appropriately with data that have a natural representation as higher order arrays. We begin by presenting a unifying class of convex optimization problems for which we present a scalable template algorithm based on an operator splitting technique (Lions and Mercier 1979). We then specialize this class of problems to perform single as well as multi-task learning both in a transductive as well as in an inductive setting. To this end we develop tools extending to higher order tensors the concept of spectral regularization for matrices (Argyriou et al. 2007a). We consider smooth penalties (including the quadratic loss as a special case) and exploit a low multilinear rank assumption over one or more tensor unknowns through spectral regularizers. We show how this connects to the concept of Tucker decomposition (Tucker 1964, 1966) (a particular instance of which is also known as Multilinear Singular Value decomposition (De Lathauwer et al. 2000)). Additionally, as a by-product of using a tensor-based formalism, our framework allows one to tackle the multi-task case (Argyriou et al. 2008) in a natural way. In this way one exploits interdependence both at the level of the data representations as well as across tasks.

Our main contribution is twofold. A first contribution is to apply the framework to supervised transductive and inductive learning problems where the input data can be expressed as tensors. Important special cases of the framework include extensions of multitask learning with higher order observation data. A second main contribution lies within the Inexact Splitting Method that we propose as the template algorithm; we study an adaptive stopping criterion for the solution of a key sub-problem and give guarantees about the convergence of the overall algorithm.

1.3 Outline

In the next section we introduce preliminaries and present our notation. In Sect. 3 we discuss the general problem setting that we are concerned with. We present in Sect. 4 a template algorithm to solve this general class of problems and show its convergence. In Sect. 5 we extend to the tensor setting the existing definition of spectral penalty and develop the analytical tools we need. Section 6 deals with tensor-based transductive learning. Inductive learning is discussed in Sect. 7. We demonstrate the proposed methodologies in Sect. 8 and end the paper with Sect. 9 by drawing our concluding remarks.

2 Notation and preliminaries

We denote both scalars and vectors as lower case letters (a,b,c,…) and matrices as bold-face capitals (A,B,C,…). We write 1 N to denote \([1,1,\ldots,1]^{\top}\in\mathbb{R}^{N}\) and I N to indicate the N×N identity matrix. We also use subscript lower-case letters i,j in the meaning of indices and we will use I,J to denote the index upper bounds. Additionally we write \(\mathbb{N}_{I}\) to denote the set {1,…,I}. We recall that N-th order tensors, which we denote by calligraphic letters (\(\mathcal{A}\), \(\mathcal{B}\), \(\mathcal{C}\), …), are higher order generalizations of vectors (first order tensors) and matrices (second order tensors). More generally, the order N of a tensor is the number of dimensions, also known as ways or modes. We write \(a_{i_{1},\ldots,i_{N}}\) to denote the entry \((\mathcal{A})_{i_{1},\ldots,i_{N}}\). Likewise we write a i to mean (a) i and a ij to mean (A) ij .

Next we present basic facts about tensors and introduce the mathematical machinery that we need to proceed further. The level of abstraction that we consider allows one to deal in a unified fashion with different problems and provides a useful toolset for very practical purposes. For instance, a proper characterization of operators and corresponding adjoints allows one to use the chain rule for subdifferentials (see, e.g., Ekeland and Temam 1976) that we extensively use in Sect. 5. Note that this is very useful also from an implementation view point. In fact, it is being used for the automatic derivation of differentials and sub-differentials of composite functions in modern optimization toolboxes (such as Becker et al. 2010).

Fig. 1
figure 1

An illustration of the mode unfoldings for a third order tensor

2.1 Basic facts about tensors

An N-th order tensor \(\mathcal{A}\) is rank-1 if it consists of the outer product of N nonzero vectors \(u^{(1)}\in\mathbb {R}^{I_{1}},~u^{(2)}\in\mathbb{R}^{I_{2}},\ldots,~u^{(N)}\in\mathbb {R}^{I_{N}}\), that is, if \(a_{i_{1}i_{2}\ldots i_{N}} =u^{(1)}_{i_{1}}u^{(2)}_{i_{2}}\cdots u^{(N)}_{i_{N}}\) for all values of the indices. In this case we write \(\mathcal{A}=u^{(1)}\otimes u^{(2)}\otimes\cdots\otimes u^{(N)}\). The linear span of such elements forms a vector space, which once endowed with the inner product

$$ \langle\mathcal{A},\mathcal{B}\rangle:=\sum_{i_1} \sum_{i_2}\cdots\sum_{i_N} a_{i_1i_2\cdots i_N}b_{i_1i_2\cdots i_N}, $$
(1)

is denoted byFootnote 1 \(\mathbb {R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\). The corresponding Hilbert-Frobenius norm is \(\Vert\mathcal{A}\Vert:= \sqrt{\langle\mathcal{A},\mathcal{A}\rangle}\). We use 〈⋅,⋅〉 and ∥⋅∥ for any N≥1, regardless of the specific tuple (I 1,I 2,…,I N ). An n-mode vector of \(\mathcal{A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) is an element of \(\mathbb{R}^{I_{n}}\) obtained from \(\mathcal{A}\) by varying the index i n and keeping the other indices fixed. The n-rank of \(\mathcal{A}\), indicated by \(\operatorname{rank}_{n}(\mathcal{A})\), is the dimension of the space spanned by the n-mode vectors. A tensor for which \(r_{n}=\operatorname{rank}_{n}(\mathcal{A})\) for \(n\in\mathbb{N}_{N}\) is called a rank-(r 1,r 2,…,r N ) tensor; the N-tuple (r 1,r 2,…,r N ) is called the multilinear rank of \(\mathcal{A}\). For the higher order case an alternative notion of rank exists. This is:

$$ \operatorname{rank}(\mathcal{A}):=\arg\min \biggl\{R\in\mathbb{N}:\mathcal{A}= \sum_{r\in\mathbb{N}_R} u_r^{(1)}\otimes u_r^{(2)}\otimes\cdots\otimes u_r^{(N)}:u_r^{(n)} \in \mathbb{R}^{I_n}~\forall r\in\mathbb{N}_R, n\in \mathbb{N}_N \biggr\}. $$
(2)

Whereas for second order tensors \(\operatorname{rank}_{1}(\mathcal{A})=\operatorname{rank}_{2}(\mathcal {A})=\operatorname{rank}(\mathcal{A})\) for the general case we can only establish that \(\operatorname{rank}_{n}(\mathcal{A})\leq \operatorname{rank}(\mathcal{A})\) for any \(n\in\mathbb{N}_{N}\). Additionally the n-ranks differ from each other in the general N-th order case.

Let \(\mathcal{A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) and set

$$J:=\prod_{j\in\mathbb{N}_N\setminus\{n\}}I_j. $$

The n-mode unfolding (also called matricization or flattening) of \(\mathcal{A}\) is the matrix \(\mathcal {A}_{\langle n\rangle}\in\mathbb{R}^{I_{n}\times J}\) whose columns are the n-mode vectors. The ordering according to which the vectors are arranged to form \(\mathcal{A}_{\langle n\rangle}\) will not matter for our purposes; what matter is that one sticks to a chosen ordering rule.

Remark 1

Assume a second order tensor \(\boldsymbol{A}\in\mathbb{R}^{I_{1}\times I_{2}}\). Then if the 2-mode unfolding is defined upon the lexicographic ordering, we have

$$\boldsymbol{A}_{\langle2\rangle}=\boldsymbol{A}^{\top} $$

where ⋅ denotes matrix transposition.

In our setting the use of unfoldings is motivated by the elementary fact thatFootnote 2

$$ \operatorname{rank}_n(\mathcal{A})=\operatorname{rank}(\mathcal{A}_{\langle n\rangle}). $$
(3)

Note that the n-mode unfolding as introduced above defines the linear operator

$$\cdot_{\langle n\rangle}:\mathbb{R}^{I_1\times I_2\times\cdots\times I_N}\rightarrow\mathbb{R}^{I_n\times J}. $$

The refolding or tensorization, denoted as ⋅n, is defined as its adjoint \(\cdot^{\langle n\rangle}:\mathbb {R}^{I_{n}\times J}\rightarrow\mathbb{R}^{I_{1}\times I_{2}\times\cdots \times I_{N}}\) satisfying

$$\bigl\langle\boldsymbol{A}^{\langle n\rangle},\mathcal{B} \bigr\rangle= \langle \boldsymbol{A}, \mathcal{B}_{\langle n\rangle} \rangle. $$

Finally we recall that the n-mode product of a tensor \(\mathcal {A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) by a matrix \(\boldsymbol{U}\in\mathbb{R}^{J_{n}\times I_{n}}\), denoted by \(\mathcal{A}\times_{n} \boldsymbol{U}\), is defined by

$$ \mathcal{A}\times_n \boldsymbol{U}:=(\boldsymbol{U} \mathcal{A}_{\langle n \rangle })^{\langle n \rangle} \in \mathbb{R}^{I_1\times I_2\times\cdots\times I_{n-1}\times J_{n}\times I_{n+1}\times\cdots\times I_N} . $$
(4)

2.2 Sampling operator and its adjoint

Assume \(\mathcal{A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) and consider the ordered set

identifying P entries of the N-th order tensor \(\mathcal{A}\). In the following we denote by the sampling operator defined by

Note that is linear and it can be equivalently restated as where \(\mathcal{E}_{s^{p}}\) is that element of the canonical basis of \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) defined as

$$(\mathcal{E}_{s^{p}})_{i_1 i_2 \cdots i_N}:=\left \{ \begin{array}{@{}l@{\quad}l@{}} 1,&\text{if~} (i_1,i_2,\ldots,i_N)=s^{p}\\ 0,& \text{otherwise.} \end{array} \right . $$

Based upon this fact one can show that the adjoint of , namely that unique operator satisfying , is:

It is immediate to check that and hence, is a co-isometry in the sense of Halmos (1982, Sect. 127, page 69).

Remark 2

From this fact it follows that any solution of can be written as where .

Remark 3

Sampling operators in line with abound in learning theory and algorithms. For instance (Smale and Zhou 2005) considers a sampling operator on a reproducing kernel Hilbert space of functions (Aronszajn 1950) based on a set of evaluation functionals of the type

$$ E_x:f\mapsto f(x) $$
(5)

where is a function on a certain domain and . It is worth remarking that an N-th order array \(\mathcal{A}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) can be regarded as a function from \(\mathbb{N}_{I_{2}}\times\cdots\times \mathbb{N}_{I_{N}}\) to \(\mathbb{R}\). Correspondingly, can be restated in terms of evaluation functionals of the same type as (5), namely

$$E_{i_1^p i_2^p \cdots i_N^p}:\mathcal{A}\mapsto a_{i_1^p i_2^p \cdots i_N^p}. $$

This is no surprise as any finite dimensional space (such as \(\mathbb {R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\)) is isomorphic to a reproducing kernel Hilbert space of functions, see e.g. Berlinet and Thomas-Agnan (2004, Chap. 1).

2.3 Abstract vector spaces

In this paper we consider optimization problems on abstract finite dimensional inner product spaces that represent a generalization of \(\mathbb{R}^{P}\). We are especially interested in the case where such an abstract space, denoted by , is obtained by endowing the Cartesian product of Q module spaces of tensors of different orders:

$$ \bigl(\mathbb{R}^{I_1\times I_2\times\cdots\times I_{N_1}} \bigr)\times \bigl( \mathbb{R}^{J_1\times J_2\times\cdots\times J_{N_2}} \bigr)\times \cdots \times \bigl(\mathbb{R}^{K_1\times K_2\times\cdots\times K_{N_Q}} \bigr) $$
(6)

with the canonical inner product formed upon the uniform sum of the module spaces’ inner products:

(7)

Note that we denoted the q-th component using the notation reserved for higher order tensors. When N q =2 (second order case) we stick with the notation for matrices introduced above and finally we denote it as a vector if N q =1. We denote \((\mathcal{W}_{1},\mathcal {W}_{2},\ldots,\mathcal{W}_{Q})\) by \(\mathcal{W}\). The norm associated to (7) is .

Remark 4

As an example, assume is formed upon the module spaces \(\mathbb{R}^{2\times3\times3}\), \(\mathbb{R}^{4\times4}\) and \(\mathbb {R}^{5}\). A generic element of will be denoted then by \((\mathcal{A}, \boldsymbol{B},c)\) where we use different letters to emphasize the different role played by the corresponding elements.

Alternatively we will denote elements of , i.e., abstract vectors, by lower case letters (w,v,…) like ordinary vectors, i.e., elements of \(\mathbb{R}^{P}\). We use this convention whenever we do not want to specify the structure of . We note that this is consistent with the fact that elements of can always be considered as “long vectors” avoiding involved notation. Additionally we denote by capital letters (A,B,C,…) general operators between abstract spaces such as and use lower case letters (f,g,…) to denote functionals on namely operators of the type . Next we introduce the general family of optimization problems of interest.

3 General problem setting

3.1 Main optimization problem

The learning tasks that we formulate in this paper can be tackled via special instances of the following convex optimization problem on an abstract vector space:

(8)

In this problem \(\bar{f}\) is a convex and differentiable functional. As we will illustrate by preliminary examples in Sect. 3.2, it plays the role of a (possibly averaged) cost; it is assumed that \(\nabla \bar{f}\) is \(L_{\bar{f}}\)-Lipschitz, namely that:

(9)

\(\bar{g}\) is a convex but possibly non-differentiable functional playing the role of a penalty. Finally is a set which is non-empty, closed and convex; it is used to impose over w a specific structure, which will depend on the specific instance of the learning task of interest.

3.2 Some illustrative examples

Problem (8) is very general and covers a wide range of machine learning formulations where one faces single as well as composite penalties, i.e., functions corresponding to the sum of multiple atomic (stand-alone) penalties. To show this and illustrate the formalism introduced above, we begin by the simplest problems that can be cast as (8). Successively, we will move on to the cases of interest, namely tensor-based problems. In the simplest cases, such as in Ridge Regression (Hoerl and Kennard 1970), \(\bar{f}\) can be set equal to the error functional of interest f. In other cases, such as those that we will deal with in the remainder of the paper, it is more convenient to duplicate optimization variables; in these cases \(\bar{f}\) is related to f in a way that we will clarify later.

In the optimization literature the idea of solving optimization problems by duplicating variables has roots in the 1950s and was developed in the 1970s, mainly in connection to control problems, see, e.g., Bertsekas and Tsitsiklis (1989). This general approach underlies the alternating methods of multipliers and the related Douglas-Rachford technique that we discuss later in more details. As we will see, duplicating variables allows to decompose the original problem into simpler sub-problems that can be solved efficiently and can be distributed across multiple processing units.

3.2.1 Ridge regression

Unlike the original proposal we considered an additional bias term in the model, as common in machine learning. In this case the ambient space is defined upon two module spaces; Eqs. (6) and (7) read:

(10)

The error functional and the penalty term are, respectively,

$$ f(w,b)=\frac{1}{2N}\sum_{n\in\mathbb {N}_N} \biggl(y_n-\sum_{d\in\mathbb{N}_D}w_dx_{dn}-b \biggr)^2\quad \text{and}\quad g(w)=\frac {\lambda}{2}\sum _{d\in\mathbb{N}_D} w_d^2 $$
(11)

where λ>0 is a user-defined parameter. The problem of interest, namely

$$ \min_{w\in\mathbb{R}^{D}\times\mathbb{R}} f(w,b)+g(w) $$
(12)

can be solved via problem (8) by letting \(\bar{f}:=f\), \(\bar {g}:=g\) and, finally, . The affine model

$$ \hat{m}(x)=\langle\hat{w},x\rangle+\hat{b}~, $$
(13)

corresponding to the unique solution \((\hat{w},\hat{b})\), is estimated based upon input data collected in the design matrix \(\boldsymbol{X}=[x_{1},x_{2},\ldots,x_{N}]\in\mathbb{R}^{D\times N}\) and a vector of measured responses \(y\in\mathbb{R}^{N}\).

3.2.2 Group lasso

As a second example, consider the more involved situation where the l 2 penalty used in Ridge Regression is replaced by the group lasso penalty with (possibly overlapping) groups, see Zhao et al. (2009), Jacob et al. (2009). Let \(2^{N_{D}}\) denote the power set Footnote 3 of N D and consider some collection of M ordered sets . For any \(w\in\mathbb{R}^{D}\) let be defined entry-wise by

The group lasso problem with overlapping groups and an unpenalized bias term can be expressed as

$$ \min_{w\in\mathbb{R}^{D}\times\mathbb{R}} f(w,b)+g(w) $$
(14)

in which we have for λ>0:

(15)

The latter is a first example of composite penalty. In this case, grouped selection occurs for non-overlapping groups; hierarchical variable selection is reached by defining groups with particular overlapping patterns (Zhao et al. 2009). Consider now the abstract vector space

(16)

endowed with the canonical inner product

(17)

Note that the original variable w is duplicated into M copies, namely, w [1],w [2],…, w [M]. Once defined the set

(18)

we can solve (14) by means of the problem

(19)

where

$$\bar{f}(w_{[1]},w_{[2]},\ldots,w_{[M]},b)= \frac{1}{M}\sum_{m\in\mathbb {N}_M}f(w_{[m]},b) $$

and

$$\bar{g}(w_{[1]},w_{[2]},\ldots,w_{[M]})=\sum _{m\in\mathbb{N}_m}g_m(w_{[m]}). $$

Indeed it is clear that if \((\hat{w}_{[1]},\hat{w}_{[2]},\ldots,\hat {w}_{[M]},\hat{b})\) is a solution of (19) then, for any \(m\in\mathbb{N}_{M}\), \((\hat{w}_{[m]},\hat{b})\) is a solution of (14).

3.3 Learning with tensors

In the next sections we will deal with both inductive and transductive tensor-based learning problems. Regularization will be based upon composite spectral penalties that we introduce in Sect. 5. Multiple module spaces will be used to account for tensor unknowns of different orders. We will tackle multiple tasks simultaneously and assume input feature are collected within higher order tensors. A strategy similar to the one considered above for the group lasso will be used to conveniently recast our learning problems in term of (8).

3.3.1 Transductive Learning

In the transductive case one has an input data tensor with missing features and, possibly, a partially observed matrix of labels. The goal is to both infer the missing entries in the data tensors as well as predict the missing labels. Notably, the special case when there is no labeling information, corresponds to tensor completion that was considered for the first time in Liu et al. (2009) and can be regarded as a single learning task. For the case where input patterns are represented as vectors our approach boils down to the formulation in Goldberg et al. (2010). In this sense the transductive formulation that we propose can be regarded as a generalization to the case when input data admit a higher order representation. In this case the essential idea consists of regularizing the collection of input features and labels directly without learning a model.

3.3.2 Inductive learning

For the second family of problems we consider, within the setting of inductive learning, the goal is to determine a model for each learning task to be used for out of sample prediction. For the inductive case the model corresponding to a single task will be

$$ \hat{m}(\mathcal{X})=\langle\hat{\mathcal{W}},\mathcal{X} \rangle+\hat{b}, $$
(20)

where \(\mathcal{X}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}}\) represents here a generic data-tensor, and \((\hat{W},\hat{b})\) are the estimated parameters, see (13) for comparison.

Each training pair consists of an input tensor data observation and a vector of labels that corresponds to related but distinct tasks. This setting extends the standard penalized empirical risk minimization problem to allow for both multiple tasks and higher order observational data.

3.3.3 Common algorithmic framework

The full taxonomy of learning formulations we deal with is illustrated in Table 1. The fact that these distinct classes of problems can be seen as special instances of (8) allows us to develop a unified algorithmical strategy to find their solutions. In particular, a central tool is given by the fact that is a metric space (with the metric induced by an inner product as in (10) and (17)). Next we describe a provably convergent algorithm that is suitable for the situation where is high dimensional. In the next sections we will show how this general approach can be adapted to our different purposes.

Table 1 The learning tasks that we deal with via the optimization problem in (8)

4 Unifying algorithmical approach

For certain closed forms of \(\bar{f}\) and \(\bar{g}\), (8) can be restated as a semi-definite programming (SDP) problem (Vandenberghe and Boyd 1996) and solved via SDP solvers such as SeDuMi (Sturm 1999), or SDPT3 (Tütüncü et al. 2003). However there is an increasing interest in the case where is high dimensional in which case this approach is not satisfactory. Alternative scalable techniques that can be adapted to the solution of (3) consist of proximal point algorithms designed to find a zero of the sum of maximal monotone operators. Classical references include Rockafellar (1976), Lions and Mercier (1979) and Spingarn (1983). A modern and comprehensive review with application to signal processing is found in Combettes and Pesquet (2009). These algorithms include as special cases the Alternating Direction Method of Multipliers (ADMMs), see Boyd et al. (2011) for a recent review. Here we propose an algorithm in the family of the Douglas-Rachford splitting methods. Notably, ADMMs can be seen as a special instance of the Douglas-Rachford splitting method, see Eckstein and Bertsekas (1992) and references therein. Our general approach can be regarded as a variant of the proximal decomposition method proposed in Combettes and Pesquet (2008) and Combettes (2009) by which it was inspired. As the main advantage, the approach does not solve the original problem directly; rather, it duplicates some of the optimization variables and solve simpler problems (proximal problems) in a distributed fashion. As we will show later, the simplicity of proximal problems lies on the fact that they can be solved exactly in terms of the SVD. Notably, as Sect. (3.2) shows, the algorithm we develop is not relevant for our tensor-based framework only.

4.1 Proximal point algorithms and operator splitting techniques

4.1.1 Problem restatement

In order to illustrate our scalable solution strategy we begin by equivalently restating (8) as the unconstrained problem:

(21)

where is defined as:

Note that \(\hat{w}\) is a solution to (21) if and only if (Rockafellar 1970a)

(22)

where \(\nabla\bar{f}\) denotes the gradient of \(\bar{f}\), \(\partial\bar {g}\) is the subdifferential of \(\bar{g}\) and is the subdifferential of , i.e., the normal cone (Bauschke and Combettes 2011) of :

Letting now

(23)

Eq. (22) can be restated as

$$ 0\in T(\hat{w})=A(\hat{w})+B(\hat{w}) $$
(24)

where A and B, as well as their sum T=A+B, are set-valued operators (for each w their image is a subset of ) and they all qualify as maximal monotone. Maximal monotone operators, of which subdifferentials are a special instance, have been extensively studied in the literature, see e.g. Minty (1962), Rockafellar (1970b), Brézis (1973) and Phelps (1993). A recent account on the argument can be found in Bauschke and Combettes (2011).

4.1.2 Resolvent and proximal point algorithms

It is well known that, for any τ>0 and a given maximal monotone operator T on , \(\hat{x}\in T^{-1}(0)\) if and only if \(\hat{x}\) satisfies \(\hat{x}\in R_{\tau T}\hat{x} \), i.e., if \(\hat{x}\) is a fixed point of the single-valued resolvent of τT, defined as

$$ R_{\tau T}:=(I+\tau T)^{-1} $$
(25)

see e.g. Bauschke and Combettes (2011). Proximal point algorithms are based on this fundamental fact and consist of variations of the basic proximal iteration:

$$ x^{(t+1)}=(I+\tau T)^{-1}x^{(t)}. $$
(26)

In the problem of interest T is a special monotone operator; indeed it corresponds to the subdifferential of the convex function . In case of a subdifferential, (26) can be restated as \(x^{(t)}\in x^{(t+1)}+\tau\partial\bar{h}(x^{(t+1)})\). This, in turn, is equivalent to:

(27)

4.1.3 Proximity operator

Equation (27) represents the optimality condition for the optimization problem:

(28)

In light of this, one can restate the proximal iteration (26), as:

$$ x^{(t+1)}=\mathrm {prox}_{\tau\bar{h}} \bigl(x^{(t)} \bigr), $$
(29)

where prox g is the proximity operator (Moreau 1962) of g:

(30)

4.1.4 Operator splitting approaches

The proximal iteration (29) is numerically viable only in those cases in which it is easy to solve the optimization problem (28). When \(\bar{h}\) is a quadratic function, for instance, (28) corresponds to the solution of a system of linear equations that can be approached by reliable and well studied routines. In general, however, it is non trivial to tackle problem (28) directly. A viable alternative to the proximal iteration (29) rely on an operator splitting approach, see Bauschke and Combettes (2011) for a modern review. In the present context, the use of a splitting technique arises quite naturally from separating the objective function into (1)  (corresponding to the operator A) and (2) the (generally) non-smooth term \(\bar{g}\) (corresponding to the operator B). As we will see, this decomposition leads to a tractable algorithm, in which the operators A and B are employed in separate subproblems that are easy to solve. In particular, a classical method to solve (24) is the Douglas-Rachford splitting technique that was initially developed in Lions and Mercier (1979) based upon an idea found in Douglas and Rachford (1956).

4.2 Douglas-Rachford splitting technique

The Douglas-Rachford splitting technique allows one to solve the inclusion problem (24) when A and B are maximal monotone operators. The main iteration G DR consists of the following steps:

figure a

In the latter τ is a positive proximity parameter and (γ (k)) k is a sequence of parameters that, once chosen appropriately, ensures convergence. With reference to (23), Eq. (31a) reads in our context

(32)

whereas (31b) reads

$$ r^{(k)}=\mathrm {prox}_{\tau\bar{g}}\bigl(2y^{(k)}-w^{(k)} \bigr). $$
(33)

4.3 Modelling workflow within the Douglas-Rachford algorithmic framework

The use of a splitting technique arises quite naturally in our context from separating the objective function (with constraints embedded via the indicator function) into (1) a part that can be approached by gradient projection and (2) a non-smooth term that can be conveniently tackled via a proximal problem. On the other hand, the Douglas-Rachford algorithmic framework, together with the abstract vector space machinery introduced above, naturally results into the following mathematical engineering workflow.

Optimization modelling

Specification of the target problem: definition of the cost f and of the composite penalty g depending on the learning task of interest.

Problem casting

Specification of the auxiliary problem: definition of the abstract vector space ; \(\bar{f}~, \bar {g}\) and \(\bar{\mathcal{C}}\) are specified so that a solution of the auxiliary problem can be mapped into a solution of the target problem.

Sect. 3.2 already provided an illustration of these steps in connection to learning problems involving a parameter vector and a bias term. In general, a key ingredient in doing the problem casting is to ensure that \(\bar{g}\) is an additive separable function. In this case, in fact, computing \(\mathrm {prox}_{\tau\bar{g}}\) in (33) involves subproblems on each module space that can be distributed. We formally state this result in the following proposition. The simple proof can be found in the literature, see e.g. Bauschke and Combettes (2011, Proposition 23.30).

Proposition 1

For \(i\in\mathbb{N}_{I}\) let be some vector space with inner product 〈⋅,⋅〉 i . Let be the space obtained endowing the Cartesian product with the inner product \(\langle x,y\rangle=\sum_{i\in\mathbb{N}_{I}}\langle x_{i},y_{i}\rangle_{i}\). Assume a function defined by

$$\bar{g}:(x_1,x_2,\ldots,x_I)\mapsto\sum _{i\in\mathbb{N}_I}g_i(x_i) $$

where for any \(i\in\mathbb{N}_{I}\), is convex. Then we have:

$$\mathrm {prox}_{\bar{g}}(x)= \bigl(\mathrm {prox}_{g_1}(x_1), \mathrm {prox}_{g_2}(x_2),\ldots,\mathrm {prox}_{g_I}(x_I) \bigr). $$

4.4 Limits of two-level strategies

Next we present our algorithm based on an inexact variant of the Douglas-Rachford iteration. Our interest is in those situations where (33) can be computed exactly whereas the inner problem (32) requires an iterative procedure. As it turns out, in fact, in many situations one can cast the learning problem of interest in such a way that (33) can be computed easily and with high precision. Nonetheless, for general \(\bar{f}\) in the inner problem (32), using the Douglas-Rachford iteration to solve (8) requires a procedure consisting of two nested iterative schemes. In general, the convergence of such a two-level strategy is ensured only upon exact solution of the inner problem. On the other hand, practical implementations require to specify a termination criterion and a corresponding accuracy. Notably Gandy et al. (2011) proposes different algorithms for an instance of the general problem in (8) similar to the formulations we will consider in Sect. 6. In particular, in Sect. 5.4 they also devise an inexact algorithm but they do not provide any convergence guarantee. Motivated by this we propose an adaptive termination criterion for the inner problem and prove the convergence of the outer scheme to a solution of (8).

4.5 Template based on inexact splitting technique

The approach that we propose here for solving (8), termed Inexact Splitting Method (ISM), is presented in Algorithm 1 in which we denoted by the projection onto :

(34)

The idea is sketched as follows.

  1. 1.

    We apply an inexact version of G DR to solve problem (8), where we only require to compute y (k) in (32) up to a given precision ϵ(k). Since, in our setting, (31b) can be computed in a closed form, we do not require any inexactness at this step.

  2. 2.

    Problem (32) is strongly convex for any τ>0 and convex and differentiable function \(\bar{f}\). One can apply a gradient method that converges in this situation at a linear rate (Nesterov 2003, Theorem 2.2.8, p. 88).

Notice that step 2 in the Main procedure consists of solving the optimization subproblem (32) with a precision ϵ(k) that depends upon the iteration index k. In practice this is achieved via the Goldstein-Levitin-Polyak gradient projection method, see Bertsekas (1976, 1995). In the first main iterations a solution for (32) is found with low accuracy (from which the term inexact); as the estimate is refined along the iterations of Main the precision within the inner problem is increased; this ensures that the sequence (y (k)) k produced by Algorithm 1 converges to a solution of problem (8), as the following result shows.

Theorem 1

Assume the solution set of problem (8) is non-empty; In Algorithm 1 let ϵ 0>0, σ>1 and τ>0 be arbitrarily fixed parameters. Then {y (k)} k converges to .

Proof

See Appendix B. □

Remark 5

(Unknown Lipschitz constant)

Notice that in the procedure that computes the proximity operator with adaptive precision we assumed known \(L_{\bar{f}}\) as defined in (9); based upon the latter, \(L_{\bar{q}}\) is immediately computed since \(L_{\bar{q}}=L_{\bar{f}}+1/\tau\), see Lemma 2 in Appendix B. In practical application, however, \(L_{\bar{f}}\) is often unknown or hard to compute. In this situation an upper bound for \(L_{\bar{q}}\) can be found according to a backtracking strategy, see Beck and Teboulle (2009), Nesterov (2007) for details. The constant step-size \(L_{\bar{q}}\) in step 3 of InexactProxy is replaced by an adaptive step-size \(h \in(0, \frac{1}{L_{\bar{q}}}]\) as appropriately chosen by the backtracking procedure.

Remark 6

(Termination of the outer loop)

Since, as we proved, the sequence {y (k)} k converges to the solution of problem (8), one can use the condition

(35)

to terminate the loop in the procedure Main, where η>0 is a desired accuracy. However, for the specific form of the learning problems considered in this paper, we prefer to use the objective value. Typically, we terminate the outer loop ifFootnote 4

$$ \frac{\vert \bar{h}(y^{(k+1)}) - \bar{h}(y^{(k)})\vert }{ \vert \bar{h}(y^{(k)})\vert } \leq\eta. $$
(36)

The reason for this choice is as follows: generally the termination condition (36) finds solution close to optimal (with respect to the optimization problem). When it does not, the algorithm is normally stuck in a plateau which means that the optimization is typically going to require a lot of time, with no significant improvement in the estimate. In this setting the termination condition achieves a shorter computational time by accepting the estimate we got so far and exiting the loop.

5 Spectral regularization and multilinear ranks

So far we have elaborated on the general formulation in (8); in this section we specify the nature of the penalty functions that we are concerned with in our tensor-based framework. We begin by focusing on the case where corresponds to \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\); we then consider multiple module spaces in line with (6) and (7).

5.1 Spectral penalties for higher order tensors

We recall that a symmetric gauge function \(h:\mathbb {R}^{P}\rightarrow\mathbb{R}\) is a norm which is both absolute and invariant under permutationsFootnote 5 (von Neumann 1937), see also Horn and Johnson (1994, Definition 3.5.17). Symmetric gauge functions are for instance all the l p norms. The following definition generalizes to higher order tensors the concept of spectral regularizer studied in Abernethy et al. (2009) and Argyriou et al. (2010).

Definition 1

(n-mode spectral penalty for higher order tensors)

For \(n\in\mathbb{N}_{N}\) a function \(\varOmega:\mathbb {R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\rightarrow\mathbb{R}\) is called an n-mode spectral penalty if it can be written as:

$$\varOmega(\mathcal{W})=h\bigl(\sigma(\mathcal{W}_{\langle n\rangle})\bigr) $$

where, for \(R=\min \{I_{n},\prod_{j\in\mathbb{N}_{N}\setminus\{n\} }I_{j} \}\), \(h:\mathbb{R}^{R}\rightarrow\mathbb{R}\) is some symmetric gauge function and \(\sigma(\mathcal{W}_{\langle n\rangle})\in[0,\infty )^{R}\) is the vector of singular values of the matrix \(\mathcal {W}_{\langle n\rangle} \) in non-increasing order.

We are especially interested in composite spectral regularizers corresponding to the (weighted) sum of different n-mode spectral penalties. The earliest example of such a situation is found in Liu et al. (2009). Denoting by ∥⋅∥ the nuclear norm for matrices, Liu et al. (2009) considers the penalty

$$ g(\mathcal{W})=\sum_{n\in\mathbb{N}_N} \frac{1}{N}\Vert\mathcal {W}_{\langle n\rangle}\Vert_* $$
(37)

with the purpose of performing completion of a partially observed tensor. It is clear that since \(\Vert\mathcal{W}_{\langle n\rangle }\Vert_{*}=\Vert\sigma(\mathcal{W}_{\langle n\rangle})\Vert_{1}\) and ∥⋅∥1 is a symmetric gauge function, (37) qualifies as a composite spectral regularizer.

The nuclear norm has been used to devise convex relaxation for rank constrained matrix problems (Recht et al. 2007; Candès and Recht 2009; Candes et al. 2011); this parallels the use of the l 1-norm in sparse approximation and cardinality minimization (Tibshirani 1996; Chen et al. 2001; Donoho 2006). Likewise, minimizing (37) can be seen as a convex proxy for the minimization of the multilinear ranks.

5.2 Relation with multilinear rank

A tensor \(\mathcal{W}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) can be written as Tucker (1964)

$$ \mathcal{W}=\mathcal{S}\times_1\boldsymbol{U}^{(1)} \times_2\boldsymbol{U}^{(2)}\times _3\cdots \times_N\boldsymbol{U}^{(N)} $$
(38)

where \(\mathcal{S}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) is called the core tensor and for any \(n\in\mathbb{N}_{N}\), \(\boldsymbol{U}^{(n)}\in\mathbb{R}^{I_{n}\times I_{n}}\) is a matrix of n-mode singular vectors, i.e., the left singular vectors of the n-mode unfolding \(\mathcal{W}_{\langle n\rangle}\) with SVDFootnote 6

$$ \mathcal{W}_{\langle n\rangle}=\boldsymbol{U}^{(n)}\operatorname{diag}\bigl(\sigma(\mathcal{W}_{\langle n\rangle})\bigr) {\boldsymbol{V}^{(n)}}^\top. $$
(39)

Equation (38) is also known as the Multilinear Singular Value (MLSVD) decomposition. It has some striking similarities with the matrix SVD, see De Lathauwer et al. (2000). In particular, a good approximation of \(\mathcal{W}\) can often be achieved by disregarding the n-mode singular vectors corresponding to the smallest singular values \(\sigma(\mathcal {W}_{\langle n\rangle})\). See Fig. 2 for an illustration. Since penalizing the nuclear norm of \(\mathcal{W}_{\langle n\rangle}\) enforces the sparsity of \(\sigma(\mathcal{W}_{\langle n\rangle})\), (37) favors low multilinear rank tensors. Notably for N=2 (second order case) it is easy to see that (37) is consistent with the definition of nuclear norm for matrices.

The nuclear norm is the convex envelope of the rank function on the spectral-norm unit ball (Fazel 2002); as such it represents the best convex approximation for a number of non-convex matrix problems involving the rank function. Additionally it has been established that under certain probabilistic assumptions it allows one to recover with high probability a low rank matrix from a random subset of its entries (Candès and Recht 2009; Koltchinskii et al. 2010). Similar results do not exist for (37) when N>2, no matter what definition of tensorial rank one considers (see Sect. 2.1). It is therefore arguable whether or not it is appropriate to call it nuclear norm for tensors, as done in Liu et al. (2009). Nonetheless this penalty provides a viable way to compute low complexity estimates in the spirit of the Tucker decomposition. By contrast problems stated in terms of the tensorial rank (2) are notoriously intractable (Hillar and Lim 2010; Hastad 1990). To the best of our knowledge it remains an open problem to devise an appropriate convexification for this type of rank function.

5.3 Proximity operators

The numerical feasibility of proximal point algorithms largely depends upon the simplicity of computing the proximal operator introduced in (30). For the class of n-mode spectral penalties we can establish the following.

Proposition 2

(Proximity operator of an n-mode spectral penalty)

Assume \(\mathcal{W}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\) and let (39) be the SVD of its n-mode unfolding \(\mathcal{W}_{\langle n \rangle}\). Then the evaluation at \(\mathcal{W}\) of the proximity operator of \(\varOmega(\mathcal{W})=h(\sigma(\mathcal{W}_{\langle n\rangle}))\) is

$$ \mathrm {prox}_{\varOmega}(\mathcal{W})= \bigl(\boldsymbol{U}^{(n)} \operatorname{diag}\bigl( \mathrm {prox}_h\bigl(\sigma(\mathcal{W}_{\langle n\rangle})\bigr)\bigr) { \boldsymbol{V}^{(n)}}^\top \bigr)^{\langle n\rangle}. $$
(40)

Proof

For a matrix A with SVD \(\boldsymbol{A}=\boldsymbol{U}\operatorname{diag}(\sigma(\boldsymbol{A})) \boldsymbol{V}^{\top}\), Argyriou et al. (2011, Proposition 3.1) established that

$$\mathrm {prox}_{h\circ\sigma}(\boldsymbol{A})=\boldsymbol{U}\operatorname{diag}\bigl(\mathrm {prox}_{h}\bigl( \sigma(\boldsymbol{A})\bigr)\bigr) \boldsymbol{V}^\top . $$

It remains to show that \(\mathrm {prox}_{\varOmega}(\mathcal{W})= (\mathrm {prox}_{h\circ \sigma}(\mathcal{W}_{\langle n\rangle}) )^{\langle n\rangle}\). Note that ⋅n is a linear one-to-one (invertible) operator and that \((\mathcal{W}_{\langle n\rangle})^{\langle n\rangle}=\mathcal{W}\) namely, the composition between the folding operator and its adjoint yields the identityFootnote 7 on \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\). Additionally by the chain rule for the subdifferential (see e.g. Nesterov 2003, Lemma 3.18) and by definition of Ω one has \(\partial\varOmega(\mathcal{V})=(\partial(h\circ\sigma)(\mathcal {V}_{\langle n\rangle}))^{\langle n\rangle}\). We now have:

$$ \begin{array} {r@{\quad}c@{\quad}l} \mathcal{V}= \mathrm {prox}_{\varOmega}(\mathcal{W})&\Leftrightarrow& \mathcal {V}-\mathcal{W}\in \partial\varOmega(\mathcal{V})=\bigl(\partial(h\circ\sigma ) (\mathcal{V}_{\langle n\rangle}) \bigr)^{\langle n\rangle} \\[4pt] &\Leftrightarrow&(\mathcal{V}-\mathcal{W})_{\langle n\rangle}\in \bigl(\bigl( \partial(h\circ\sigma) (\mathcal{V}_{\langle n\rangle})\bigr)^{\langle n\rangle} \bigr)_{\langle n\rangle} \\[4pt] &\Leftrightarrow&\mathcal{V}_{\langle n\rangle}= \mathrm {prox}_{h\circ\sigma }(\mathcal{W}_{\langle n\rangle}) \\[4pt] &\Leftrightarrow&\mathcal{V}= \bigl( \mathrm {prox}_{h\circ\sigma}(\mathcal {W}_{\langle n\rangle}) \bigr)^{\langle n\rangle}. \end{array} $$
(41)

 □

In particular for the case where \(\varOmega(\mathcal{W})=\lambda \Vert \sigma(\mathcal{W}_{\langle n\rangle})\Vert _{1}\) one has

$$ \textstyle \mathrm {prox}_{\lambda\Vert\sigma(\cdot_{\langle n\rangle})\Vert_1}(\mathcal{W})= \bigl( \boldsymbol{U}^{(n)} \operatorname{diag}(d_\lambda) {\boldsymbol{V}^{(n)}}^\top \bigr)^{\langle n\rangle} $$
(42)

where \((d_{\lambda})_{i}:=\max(\sigma_{i}(\mathcal{W}_{\langle n\rangle })-\lambda,0)\). Note that (42) corresponds to refolding the matrix obtained applying to \(\mathcal{W}_{\langle n\rangle}\) the matrix shrinkage operator as introduced in Cai et al. (2010).

5.4 Multiple module spaces

So far we considered the case where consisted solely of the module space \(\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N}}\). Next we focus on the case where is given by 2 modules, see Sect. 2.3. The following definition will turn out useful in the next section where we deal with two distinct type of unknowns that are jointly regularized.

Definition 2

((n 1,n 2)-mode spectral penalty)

Assume a vector space obtained endowing \((\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{N_{1}}} )\times (\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{N_{2}}} ) \) with the canonical inner product

(43)

and norm . Suppose that for \(n_{1}\in\mathbb{N}_{N_{1}}\) and \(n_{2}\in\mathbb{N}_{N_{2}}\) \(I_{n_{1}}=J_{n_{2}}=K\) and let \(S_{1}:=\prod_{p\in\mathbb{N}_{N_{1}}\setminus\{ n_{1}\}}I_{p},~S_{2}:= \prod_{p\in\mathbb{N}_{N_{2}}\setminus\{n_{2}\}}J_{p}\). A function is called an (n 1,n 2)-mode spectral penalty if it can be written as:

$$ \varOmega(\mathcal{W})=h \bigl(\sigma \bigl( [\mathcal{W}_{1\langle n_1\rangle}, \mathcal{W}_{2\langle n_2\rangle} ] \bigr) \bigr) $$
(44)

where, for R=min{K,S 1 S 2}, \(h:\mathbb{R}^{R}\rightarrow\mathbb{R}\) is some symmetric gauge function and \(\sigma( [\mathcal{W}_{1\langle n_{1}\rangle}, \mathcal{W}_{2\langle n_{2}\rangle} ])\in[0,\infty)^{R}\) is the vector of singular values of the matrix \([\mathcal{W}_{\langle n_{1}\rangle},\mathcal{W}_{\langle n_{2}\rangle} ] \) in non-increasing order.

Note that we required that \(I_{n_{1}}=J_{n_{2}}=K\) since otherwise \(\mathcal {W}_{1\langle n_{1}\rangle}\) and \(\mathcal{W}_{2\langle n_{2}\rangle}\) cannot be concatenated.

Proposition 3

(Proximity operator of an (n 1,n 2)-mode spectral penalty)

Let , Ω, S 1 and S 2 be defined as in Definition 2 and assume the SVD:

$$[\mathcal{W}_{1\langle n_1\rangle}, \mathcal{W}_{2\langle n_2\rangle} ]=\boldsymbol{U} \sigma \bigl( [\mathcal{W}_{1\langle n_1\rangle}, \mathcal {W}_{2\langle n_2\rangle} ] \bigr) \boldsymbol{V}^\top. $$

Then we have

$$ \mathrm {prox}_{\varOmega}(\mathcal{W})= \bigl(\boldsymbol{Z}_1^{^{\langle n_1\rangle}}, \boldsymbol{Z}_2^{^{\langle n_2\rangle}} \bigr) $$
(45)

where

$$ \boldsymbol{Z}= \boldsymbol{U} \operatorname{diag}\bigl(\mathrm {prox}_h \bigl(\sigma \bigl( [ \mathcal{W}_{1\langle n_1\rangle}, \mathcal{W}_{2\langle n_2\rangle} ] \bigr) \bigr) \bigr) \boldsymbol{V}^\top $$
(46)

is partitioned into [Z 1,Z 2] where Z 1 is a (K×S 1)-matrix and Z 2 is a (K×S 2)-matrix.

Proof

Consider the unfolding operator on , \(\cdot_{\langle n_{1} n_{2}\rangle}:(\mathcal{W}_{1},\mathcal{W}_{1})\mapsto [\mathcal{W}_{1\langle n_{1}\rangle},\mathcal{W}_{2\langle n_{2}\rangle}] \). Based on (43) it is not difficult to see that its adjoint corresponds to the operator given by

$$\cdot^{\langle n_1 n_2\rangle}:[\boldsymbol{W}_1,\boldsymbol{W}_2]\mapsto \bigl( \boldsymbol{W}^{\langle n_1\rangle}_1,\boldsymbol{W}^{\langle n_2 \rangle}_2 \bigr). $$

The chain rule for the subdifferential reads now \(\partial\varOmega(\mathcal{V})= (\partial(h\circ\sigma) (\mathcal{V}_{\langle n_{1}n_{2} \rangle} ) )^{\langle n_{1}n_{2} \rangle}\). In the same fashion as in (41) we now have:

$$ \begin{array} {r@{\quad}c@{\quad}l} \mathcal{V}=\mathrm {prox}_{\varOmega}(\mathcal{W})& \Leftrightarrow& \mathcal {V}-\mathcal{W}\in\partial\varOmega(\mathcal{V})=\bigl( \partial(h\circ\sigma ) (\mathcal{V}_{\langle n_1n_2\rangle})\bigr)^{\langle n_1n_2\rangle} \\[4pt] &\Leftrightarrow&(\mathcal{V}-\mathcal{W})_{\langle n_1n_2\rangle}\in \bigl(\bigl( \partial(h\circ\sigma) (\mathcal{V}_{\langle n_1n_2\rangle })\bigr)^{\langle n_1n_2\rangle} \bigr)_{\langle n_1n_2\rangle} \\[4pt] &\Leftrightarrow&\mathcal{V}_{\langle n_1n_2\rangle}= \mathrm {prox}_{h\circ \sigma}(\mathcal{W}_{\langle n_1n_2\rangle}) \\ [4pt] &\Leftrightarrow&\mathcal{V}= \bigl( \mathrm {prox}_{h\circ\sigma}(\mathcal {W}_{\langle n_1n_2\rangle}) \bigr)^{\langle n_1n_2\rangle}. \end{array} $$
(47)

 □

We note that Definition 2 and the result above can be easily generalized to more than two module spaces at the price of a more involved notation.

6 Transductive learning with higher order data

In this section we specialize problem (8) in order to perform transductive learning with partially observed higher order data.Footnote 8 It is assumed one has a set of N items with higher order representation \(\mathcal {X}^{(n)}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}},~n\in \mathbb{N}_{N}\). These items are gathered in the input dataset \(\mathcal {X}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times N}\) defined entry-wise by

$$x_{d_1d_2\cdots d_Mn}=x^{(n)}_{d_1d_2\cdots d_M}. $$

Associated to the n-th item there is a target vector . In particular we shall focus on the case where so that Y=[y (1),y (2),…,y (N)] is a (T×N)-matrix of binary labels. Entries of \(\mathcal{X}\) and Y can be missing with

(48)
(49)

being the index set of the observed entries in, respectively, \(\mathcal {X}\) and Y. The goal is to infer the missing entries in \(\mathcal{X}\) and Y simultaneously, see Fig. 3. We refer to this task as heterogeneous data completion to emphasize that the nature of \(\mathcal{X}\) and Y is different. Note that this reduces to standard transductive learning as soon as T=1, M=1 and finally (no missing entries in the input dataset). Goldberg et al. (2010) considers the more general situation where T≥1 and . Here we further generalize this to the case where M≥1, that is, items admit a higher order representation. We also point out that the special case where T=1 and there is no labeling task defined (in particular, ) corresponds to tensor completion as considered for the first time in Liu et al. (2009). Next we clarify our modelling assumptions.

6.1 Modelling assumptions

The heterogeneous data completion task is ill-posed in the sense that there are infinitely many ways to fully specify the entries of \(\mathcal {X}\) and Y.Footnote 9 Making the inference process feasible requires to formulate assumptions for both the input dataset as well as for the matrix of labels.

In this section we consider the following generative model. It is assumed that the input dataset \(\mathcal{X}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times N}\) can be decomposed into

$$ \mathcal{X}=\tilde{\mathcal{X}}+\mathcal{E} $$
(50)

where \(\tilde{\mathcal{X}}\) is a rank-(r 1,r 2,…,r M ,r M+1) tensor and \(\mathcal{E}\) is a remainder. In our setting the assumption considered in Goldberg et al. (2010) is solely that

$$ r_{M+1}\ll\min(N,J) $$
(51)

where

$$J=\prod_{j\in\mathbb{N}_{M}}D_j. $$

This amounts at regarding items as elements of \(\mathbb{R}^{J}\) hereby neglecting their multimodal structure. By contrast we further assume that

$$ r_{m}\ll\min(D_m,NJ/D_m ) \quad\text{for some }m\in\mathbb{N}_M. $$
(52)

This type of assumption is generally fulfilled in a number of cases where multimodal dependence arises; this occurs for instance when dealing with spectral images (Signoretto et al. 2011b). Additionally we suppose that \(y^{(n)}_{t}\), the label of the n-th pattern for the t-th task, is linked to \(\tilde{\mathcal{X}}^{(n)}\) via a latent variable model. More specifically, we let

$$ \tilde{y}^{(n)}_t=\bigl\langle\tilde{ \mathcal{X}}^{(n)},\mathcal {W}^{(t)}\bigr\rangle $$
(53)

where \(\mathcal{W}^{(t)}\) is the parameter tensor corresponding to the t-th task; we assume that \({y}^{(n)}_{t}\) is produced by assigning at random each binary entry with alphabet {−1,1} following the probability model

$$ p(y_{tn}|\tilde{y}_{tn},b_t)=1/ \bigl(1+\exp\bigl(-y_{tn}(\tilde{y}_{tn}+b_t) \bigr)\bigr). $$
(54)

Note that, in the latter, we considered explicitly a bias term b t . Let \(\mathcal{W}\) be that element of \(\mathbb{R}^{D_{1}\times D_{2}\times \cdots\times D_{M}\times T}\) defined as \(w_{d_{1}d_{2}\cdots d_{M}t}:=w^{(t)}_{d_{1}d_{2}\cdots d_{M}}\). Note that \(\mathcal{W}\) gathers the representers of the linear functionals associated to the T tasks. We now have that

$$ \tilde{\boldsymbol{Y}}=\mathcal{W}^{~}_{\langle M+1\rangle}\tilde{\mathcal {X}}_{\langle M+1\rangle}^\top $$
(55)

and it follows from (51) that

$$ \operatorname{rank}\bigl( \bigl[\tilde{\mathcal{X}}_{\langle M+1\rangle},\tilde{ \boldsymbol{Y}}^\top \bigr] \bigr)\leq r_{M+1}\ll\min(N,J+T ). $$
(56)

Remark 7

Notice that we have deliberately refrained from specifying the nature of \(\mathcal{E}\) in (50). Central to our approach is the way input features and target labels are linked together; this is specified by the functional relation (53) and by (54). One could interpret \(\mathcal{E}\) as noise in which case \(\tilde{\mathcal{X}}\) can be regarded as the underlying true representation of the input observation. This is in line with error-in-variables models (Golub and Van Loan 1980; Van Huffel and Vandewalle 1991). Alternatively one might regard \(\mathcal{X}\) as the true representation and assume that the target variable depends only upon the latent tensor \(\tilde{\mathcal{X}}\), having low multilinear rank (r 1,r 2,…,r M ,r M+1).

6.2 Multi-task learning via soft-completion of heterogeneous data

We denote by and the sampling operators (Sect. 2.2) defined, respectively, upon (48) and upon (49) and let \(z^{x}\in \mathbb{R}^{P}\) and \(z^{y}\in\mathbb{R}^{Q}\) be the corresponding measurement vectors. Let \(l_{x},l_{y}:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb {R}^{+}\) be some predefined convex loss functions respectively for the input data and the target labels. The empirical error functional we consider, namely

$$ f_{\lambda_0}(\tilde{\mathcal{X}},\tilde{\boldsymbol{Y}},b):=f^x( \tilde {\mathcal{X}})+\lambda_0 f^{y}(\tilde{\boldsymbol{Y}},b) $$
(57)

is composed by an error for the inputs,

$$ f^x: \tilde{\mathcal{X}} \mapsto \sum _{p\in\mathbb{N}_P}l_x\bigl((\varOmega_{\mathcal{S}_\mathcal{X}} \tilde {\mathcal{X}})_p,z^x_p\bigr) $$
(58)

and one for the latent variables and bias terms,

$$ f^{y}:(\tilde{\boldsymbol{Y}},b)\mapsto \sum_{q\in\mathbb{N}_Q}l_y \bigl(\bigl(\varOmega_{\mathcal{S}_{\boldsymbol{Y}}}(\tilde{\boldsymbol{Y}}+b\otimes1_J) \bigr)_q,z^y_q\bigr).$$
(59)

The heterogeneous completion task is then solved by means of the optimization problem

(60)

where is obtained endowing the Cartesian product:

$$ \bigl(\mathbb{R}^{D_1\times D_2\times\cdots\times D_{M}\times N} \bigr)\times \bigl( \mathbb{R}^{T\times N} \bigr)\times\mathbb{R}^{T} $$
(61)

with the inner product

(62)

for any \(m\in \{0 \}\cup\mathbb{N}_{M+1}\), λ m >0 is a user-defined parameter and \(\sum_{m\in\mathbb{N}_{M+1}}\lambda_{m}=1\). Problem (60) is convex since its objective is the sum of convex functions. It is a form of penalized empirical risk minimization with a composite penalty. The first M penalty terms

$$ \varOmega_m:\tilde{\mathcal{X}}\mapsto\lambda_m\Vert \tilde {\mathcal{X}}_{\langle m\rangle}\Vert_*,~m\in\mathbb{N}_{M} $$
(63)

reflect the modelling assumption (52). The (M+1)-th penalty

$$ \varGamma:(\tilde{\mathcal{X}},\tilde{\boldsymbol{Y}})\mapsto \lambda_{M+1}\bigl \Vert \bigl[ \tilde{\mathcal{X}}_{\langle M+1\rangle},\tilde{\boldsymbol{Y}}^\top \bigr] \bigr \Vert _* $$

ensures that the recovered matrix \([\hat{\tilde{\mathcal {X}}}_{\langle M+1\rangle},\hat{\tilde{\boldsymbol{Y}}}^{\top}]\) is approximately low rank, in line with Eq. (56). The contribution of the different terms is trimmed by the associated regularization parameters which are either preselected or chosen according to some model selection criterion. In principle any meaningful pair of convex penalties can be used to define the error functional. Here we follow (Goldberg et al. 2010) and consider in (58) and (59)

(64a)
(64b)

Note that (64a) is fully justified by assuming that \(\mathcal{E}\) in (50) has Gaussian entries. These losses ensure that the overall error functional (57) is smooth. This fact, along with the tools developed in Sect. 5, allows us to use Algorithm 1 as a template to devise a solution strategy.

6.3 Algorithm for soft-completion

In order to rely on Algorithm 1, we need to suitably design , \(\bar{f}\), \(\bar{g}\) as well as . That is, we have to cast (60) into the prototypical formulation in (8). Consider the abstract vector space obtained endowingFootnote 10

$$ \bigl(\times_{m\in\mathbb{N}_{M+1}} \bigl\{ \mathbb{R}^{D_1\times D_2\times\cdots\times D_{M}\times N} \bigr\} \bigr)\times \bigl(\mathbb{R}^{T\times N} \bigr)\times \mathbb{R}^{T} $$
(65)

with the canonical inner product

(66)

Once defined the set

(67)

we can solve (60) by means of the problem

(68)

where

(69)
(70)

Application of Propositions 2, 3 and 1 shows now that \(\mathrm {prox}_{\tau\bar{g}}\) in Step 3 of Main (Algorithm 1) reads:

(71)

where \([\boldsymbol{Z}_{1}(\tilde{\mathcal{X}},\tilde{\boldsymbol{Y}}),\boldsymbol{Z}_{2}(\tilde{\mathcal{X}},\tilde{\boldsymbol{Y}})]\) is a partitioning of

$$ \boldsymbol{Z}(\tilde{\mathcal{X}},\tilde{\boldsymbol{Y}})= \boldsymbol{U} \operatorname{diag}\bigl( \mathrm {prox}_{\tau\lambda_{M+1}\Vert\sigma(\cdot)\Vert_1} \bigl( \bigl[\tilde{\mathcal{X}}_{\langle M+1\rangle}, \tilde{ \boldsymbol{Y}}^\top \bigr] \bigr) \bigr) \boldsymbol{V}^\top $$
(72)

consistent with the dimensions of \(\tilde{\mathcal{X}}_{\langle M+1\rangle}\) and \(\tilde{\boldsymbol{Y}}^{\top}\), the operator \(\mathrm {prox}_{\lambda \Vert\sigma(\cdot)\Vert_{1}}\) is defined as in (42) and finally U and V are respectively left and right singular vectors of the matrix \([\tilde{\mathcal{X}}_{\langle M+1\rangle}, \tilde{ \boldsymbol{Y}}^{\top}]\). Note that (34) reads here:

(73)

For completeness we reported in Appendix C the closed form of \(\nabla\bar{f}\). We summarize in Algorithm 2 the steps required to compute a solution. We stress that these steps are obtained by adapting the steps of our template procedure given in Algorithm 1.

6.4 Hard-completion without target labels

The problem of missing or unknown values in multi-way arrays is frequently encountered in practice. Missing values due to data acquisition, transmission, or storage problems are for instance encountered in face image modelling by multilinear subspace analysis (Geng et al. 2011). Generally speaking, missing data due to faulty sensors are widespread in biosignal processing; Acar et al. (2011), in particular, considers an EEG (electroencephalogram) application where data are missing due to disconnections of electrodes. Another problem in Acar et al. (2011) arises from modelling time-evolving computer network traffic where cost-sensitivity imposes that only a subset of edges in the network are sampled.

Problem (60) assumes that data consist of both input and target measurements. In turn, the situation where we do not consider target measurements can be dealt with by the following special instance of (60):

$$ \hat{\tilde{\mathcal{X}}}=\arg\min_{\tilde{\mathcal{X}}\in \mathbb{R}^{D_1\times D_2\times\cdots\times D_{M}\times N}} f^x(\tilde{\mathcal{X}}) + \sum_{m\in\mathbb{N}_{M+1}} \varOmega_m(\tilde{\mathcal{X}}). $$
(74)

In the latter, f x penalizes the misfit of \(\tilde{\mathcal{X}}\) to the partially observed input data tensor; the composite penalty term favors solution with small multilinear rank. The solution strategy illustrated in the previous section can be easily adjusted to deal with this situation. For certain practical problems, however, it is more desirable to complete the missing entries while requiring the exact adherence to the data. Let us use as a shorthand notation for \(\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times N}\). Strict adherence to observables can be accomplished by means of the following constrained formulation of tensor completion (Gandy et al. 2011; Tomioka et al. 2011; Liu et al. 2009; Signoretto et al. 2011b):

(75)

where is the sampling set (48).

6.5 Algorithm for hard-completion

As before in order to devise a solution strategy for problem (75) we accommodate Algorithm 1. Consider the abstract space obtained endowing with the canonical inner product. Let us introduce the constraint set:

(76)

It is clear that a solution of (75) is readily obtained from a solution of the following problem:

(77)

Note that, with respect to the prototypical problem (8), we now have that \(\bar{f}\) is identically zero and

$$ \bar{g}: (\tilde{\mathcal{X}}_{[1]},\tilde{ \mathcal{X}}_{[2]},\ldots,\tilde {\mathcal{X}}_{[M+1]})\mapsto \sum_{m\in\mathbb{N}_{M+1}} \varOmega_m(\tilde{ \mathcal{X}}_{[m]}). $$
(78)

Additionally, the projection of elements of onto can be computed in closed form. To see this let \(\tilde {\mathcal{X}}\vert_{\mathcal{B}}\) denote the tensor obtained from \(\tilde{\mathcal{X}}\in\mathbb{R}^{D_{1}\times\cdots\times D_{M}\times N}\) setting to zero those entries that are not indexed by \(\mathcal {B}\subset\mathbb{N}_{D_{1}}\times\mathbb{N}_{D_{2}}\times\cdots\times \mathbb{N}_{D_{M}}\times\mathbb{N}_{N}\):

$$({\tilde{\mathcal{X}}\vert_{\mathcal{B}}})_{b_1b_2\cdots b_M c}:=\left \{ \begin{array}{l@{\quad}l} 0&\text{if~}(b_1,b_2,\ldots, b_M, c)\notin\mathcal{B}\\ [2pt] x_{b_1b_2\cdots b_M c}&\text{otherwise.} \end{array} \right . $$

We have the following result where we denote by the complement of .

Proposition 4

(Projection onto )

Let and be defined as above. Then for any , it holds that

where

and we denoted by the adjoint of the sampling operator .

Proof

See Appendix D. □

Finally by Propositions 2 and 1 it follows that \(\mathrm {prox}_{\tau\bar{g}}\) in Step 3 of Main (Algorithm 1) reads:

(79)

The steps needed to compute a solution are reported in Algorithm 3, which is obtained adapting Algorithm 1 to the present setting. With reference to the latter, note that InexactProxy is no longer needed. Indeed, since \(\bar{f}\) is identically zero, (32) boils down to computing the projection onto . By Proposition 4, this can be done in closed form.

Algorithm 1
figure 2

ISM()

Fig. 2
figure 3

An illustration of the (truncated) MLSVD

Fig. 3
figure 4

An illustration of transductive learning with higher order data and multiple tasks. In this case each observation consists of a D 1×D 2 matrix and a T-dimensional target vector. Input observations are stacked along the third-mode of the input dataset \(\mathcal{X}\) (top), whereas target observations are gathered in the matrix Y (bottom). Missing entries of \(\mathcal{X}\) and Y, indexed by and respectively, are indicated with purple tones (Color figure online)

Remark 8

Algorithm 3 is explicitly designed for the higher order case (M≥2). However it can be easily simplified to perform hard completion of matrices (M=1). In this case it is not difficult to see that one needs to evaluate only one proximity operator; consequently, duplication of the matrix unknown can be avoided.

7 Inductive learning with tensor data

For the inductive case the goal is to learn a predictive model based upon a dataset of N input-target training pairs

(80)

Each item is represented by an M-th order tensor and is associated with a vector of T labels. As before, we focus on the case where . For ease of notation we assumed that we have the same input data across the tasks; in general, however, this needs not to be the case.

To understand the rationale behind the regularization approach we are about to propose, consider the following generative mechanism.

7.1 Modelling assumptions

For a generic item, represented by the tensor \(\mathcal{X}\), assume the decomposition \(\mathcal{X}=\tilde{\mathcal{X}}+ \mathcal{E}\) where

$$ \tilde{ \mathcal{X}}=\mathcal{S}_{\tilde{\mathcal{X}}} \times_1 \boldsymbol{U}_1 \times_2 \boldsymbol{U}_2 \times\cdots \times_M \boldsymbol{U}_M $$
(81)

where for any \(m\in \mathbb{N}_{M}\) and R m <D m , \(\boldsymbol{U}_{m}\in\mathbb {R}^{D_{m} \times R_{m}}\) is a matrix with orthogonal columns. Note that the core tensor \(\mathcal{S}_{\tilde{\mathcal{X}}}\in\mathbb {R}^{R_{1}\times R_{2}\times\cdots\times R_{M}}\) and \(\mathcal{E}\in\mathbb {R}^{D_{1}\times D_{2}\times\cdots\times D_{M}}\) are item-specific; on the other hand for any \(m\in\mathbb{N}_{M}\), the full rank matrix U m spans a latent space relevant to the tasks at hand and common to all the input data. To be precise we assume the target label y t were generated according to the probability model \(p(y_{t}|\tilde{y}_{t})=1/(1+ \exp(-y_{t}\tilde {y}_{t}))\), where \(\tilde{y}_{t}\) depends upon the core tensor \(\mathcal {S}_{\tilde{\mathcal{X}}}\):

$$ \tilde{y}_t=\langle\mathcal{S}_{\tilde{\mathcal{X}}}, \mathcal{S}_{{\mathcal{W}}^{(t)}} \rangle +b_t $$
(82)

where \(\mathcal{S}_{{\mathcal{W}}^{(t)}}\in\mathbb{R}^{R_{1}\times R_{2}\times\cdots\times R_{M}}\) and b t are task-specific unknowns. It is important to remark that, in this scenario, \(\mathcal{S}_{\mathcal {W}^{(t)}}\) comprises R 1 R 2R M D 1 D 2D M parameters. In practice the common latent spaces as well as the core tensor \(\mathcal{S}_{\tilde{\mathcal{X}}}\) are both unknowns so that \(\mathcal{S}_{{\mathcal{W}}^{(t)}}\) cannot be estimated directly. However if we further assume that

(83)

for at least one \(m\in\mathbb{N}_{M}\), where we denote by the range of a matrix A, one has the following.

Proposition 5

Assume (83) holds for \(m_{1}\in \mathbb{N}_{M}\). Then

$$ \tilde{y}_t = \bigl\langle \mathcal{X},{\mathcal{W}}^{(t)} \bigr\rangle +b_t $$
(84)

where \({\mathcal{W}}^{(t)}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots \times D_{M}}\) is the low multilinear rank tensor:

$$ { \mathcal{W}}^{(t)}=\mathcal{S}_{{\mathcal{W}}^{(t)}} \times_1 \boldsymbol{U}_1 \times_2 \boldsymbol{U}_2 \times\cdots \times_M \boldsymbol{U}_M. $$
(85)

Proof

See Appendix E. □

Note that the right-hand side of (84) is an affine function of the given higher order representation \(\mathcal{X}\) of the item, rather than an affine function of the unobserved core tensor \(\mathcal {S}_{\tilde{\mathcal{X}}}\), as in (82).

Remark 9

Equation (83) requires that \(\mathcal{E}\) does not “overlap” with the discriminative features; in practice this will not hold. One can only hope that \(\mathcal{E}\) does not “overlap too much” so that \(\tilde{y}_{t}\approx\langle\mathcal{X} , {\mathcal{W}}^{(t)} \rangle +b_{t}\).

Let now \({\mathcal{W}}\in\mathbb{R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times T}\) be the tensor that gathers all the T tasks:

$$ \tilde{w}_{d_1d_2\cdots d_Mt}:=\tilde{w}^{(t)}_{d_1d_2\cdots d_M} \quad\text{for~any~}t\in\mathbb{N}_T. $$
(86)

Additionally we consider the case where the tasks are related, as common in the literature of multi-task learning, see e.g. Argyriou et al. (2007c). In our context this is accomplished by assuming that \({\mathcal {W}}_{\langle M+1\rangle}\) can be explained by a limited number of factors, namely that \({\mathcal{W}}_{\langle M+1\rangle}\) admits thin SVD (Golub and Van Loan 1996):

$${\mathcal{W}}_{\langle M+1\rangle}=\boldsymbol{U}_{M+1}\boldsymbol{S}_{M+1}\boldsymbol{V}^\top_{M+1} $$

where for R M+1D M+1 one has \(\boldsymbol{U}_{M+1}\in\mathbb {R}^{D_{M+1}\times R_{M+1}}\), \(\boldsymbol{S}_{M+1}\in\mathbb{R}^{R_{M+1}\times R_{M+1}}\) and finally \(\boldsymbol{V}_{M+1}\in\mathbb{R}^{R_{M+1}\times D_{1}D_{2}\cdots D_{M}}\). Note that \({\mathcal{W}}\) can now be equivalently restated as the low multilinear rank tensor:

$$ {\mathcal{W}}=\mathcal{S}_{{\mathcal{W}}}\times_1 \boldsymbol{U}_1 \times_2\boldsymbol{U}_2 \times_3\cdots\times_{M+1}\boldsymbol{U}_{M+1} $$
(87)

for some core tensor \(\mathcal{S}_{{\mathcal{W}}}\in\mathbb {R}^{R_{1}\times R_{2}\times\cdots\times R_{M}\times R_{M+1}}\) and latent matrices U m , \(m\in\mathbb{N}_{M+1}\) that define subspaces that concentrate the discriminative relationship.

We conclude by pointing out that a supervised learning problem where data observations are represented as matrices, namely second order tensors, is a special case of our setting. Single classification tasks in this situation were studied in Tomioka and Aihara (2007). Similarly to the present setting, the latter proposes a spectral regularization as a principled way to perform complexity control over the space of matrix-shaped models. Before discussing a solution strategy we point out that the method can be easily generalized to regression problems by changing the loss function.

7.2 Model estimation

As for transduction we evaluate misclassification errors via the logistic loss; we therefore measure the empirical risk associated to the different tasks on the dataset (80) via:

(88)

where

$$ w_{d_1d_2\cdots d_Mt}:=w^{(t)}_{d_1d_2\cdots d_M}\quad \text{for~any~}t\in \mathbb{N}_T. $$
(89)

The pair \((\mathcal{W},b)\) is estimated based upon the following penalized empirical risk minimization problem:

(90)

where is formed upon the module spaces \(\mathbb {R}^{D_{1}\times D_{2}\times\cdots\times D_{M}\times T}\) and \(\mathbb {R}^{T}\). The composite spectral penalty in (90) is designed to match the assumption that \(\mathcal{W}\) has low multilinear rank, as discussed above. Note that, in line with (87), other than performing complexity control, the regularization allows one to determine subspaces that concentrate discriminative information, without any additional feature extraction step.

7.3 Algorithm for inductive learning

Consider the abstract vector space obtained endowing

$$ \bigl(\times_{m\in\mathbb{N}_{M+1}} \bigl\{ \mathbb{R}^{D_1\times D_2\times\cdots\times D_M\times T} \bigr\} \bigr) \times\mathbb{R}^{T} $$
(91)

with the canonical inner product. Additionally let

(92)

We solve (90) based upon the following problem:

(93)

where

(94)

and \(\bar{g}\) is the same as in (78). Its proximity operator is found in (79). The gradient of \(\bar{f}\) is given in Appendix C.

8 Experiments

8.1 Transductive learning

We begin by presenting experiments on transductive learning with multiple tasks, see Sect. 6.

8.1.1 Evaluation criterion and choice of parameters

As performance indicators we considered: (1) the capability of each procedure to predict the correct test labels; (2) the capability to interpolate the missing entries in the input data tensor. We measure the latter based upon the normalized root mean square error (NRMSE) on the complementary set :

(95)

where we denoted by J c the cardinality of and \(\tilde{\mathcal{X}}\) is as in (50). For both tensor soft-completion (tensor-sc) and matrix soft-completion (matrix-sc) we solve the optimization problem in (60) via the approach presented in Sect. 6.3. The parameter λ 0 is chosen in the set {10−5,10−4,10−3, 10−2,10−1,1}. For tensor-sc we set the parameters in the composite spectral penalty as

$$ \lambda_m=\frac{1}{M+1}\bar{\lambda} \quad\text{for any~}m\in\mathbb{N}_{M+1} $$
(96)

where M+1 is the order of the input data tensor and \(\bar{\lambda}\) is a varying parameter. For matrix-sc we take

$$ \lambda_m= \left \{ \begin{array}{@{}l@{\quad}l@{}} 0&\text{if~}m\in\mathbb{N}_M\\ \bar{\lambda}&\text{if~}m= M+1. \end{array} \right . $$
(97)

Notice that by doing so we essentially recover the matrix-based approach proposed in Goldberg et al. (2010, Formulation 1). We let \(\bar{\lambda}\) in both (96) and (97) vary on a wide range. More precisely we follow Goldberg et al. (2010) and Ma et al. (2011); for each value of λ 0 we compute the regularization path with respect to \(\bar{\lambda}\) beginning with a large value \(\bar{\lambda}^{(0)}\) and solving a sequence of problems with \(\bar{\lambda}^{(t)}=\eta_{\bar{\lambda}} \bar{\lambda}^{(t-1)}\) where as in Goldberg et al. (2010) we consider as decay parameter \(\eta_{\bar{\lambda}}=0.25\). At each step t we perform warm-starting, that is, we take as initial point the solution obtained at step t−1. We stop when \(\bar{\lambda}\leq10^{-6}\). For both tensor and matrix soft-completion we choose the values of parameters corresponding to the minimum fraction of mis-predicted labels of a hold-out validation set.

8.1.2 Implementation of the optimization algorithm

As termination criterion for the algorithm that finds a solution of (60) we use the relative increment (36) where we set η=10−4. With reference to Algorithms 2 and 3 we let ϵ 0=10−2 and set σ=1.1. We use a backtracking procedure to find an upper bound for the Lipschitz constant \(L_{\bar{q}}\) (see Remark 5 and references therein). Finally we let \(\tau=0.02/\tilde{L}_{\bar{f}}\) where \(\tilde{L}_{\bar{f}}\) is an upper bound for the Lipschitz constant \(L_{\bar{f}}\), also found via backtracking. As explained above we compute the entire path with respect to the penalty parameter and use warm-starting at each step. At step t=0 the initialization of the algorithm is performed as follows. For both matrix as well as tensor-sc we set b (0) to be a vector of zeros. For what concerns \(\mathcal {X}^{(0)}\) and \(\tilde{\boldsymbol{Y}}^{(0)}\) we do as follows. Let \(\mathcal {X}^{*}\) and Y be obtained setting to zero unobserved entries of \(\mathcal{X}\) and Y respectively. Consider a partitioning [Z M+1,1,Z M+1,2] of the rank-1 approximation of the matrix \([\mathcal{X}^{*}_{\langle M+1\rangle},\boldsymbol{Y}^{*\top}]\) consistent with the dimension of \(\mathcal{X}^{*}_{\langle M+1\rangle}\) and Y ∗⊤. Both matrix-sc and tensor-sc are then initialized according to

$$\mathcal{X}^{(0)}= \boldsymbol{Z}^{\langle M+1\rangle}_{M+1,1} \quad \text{and}\quad\tilde{\boldsymbol{Y}}^{(0)}=\boldsymbol{Z}_{M+1,2}^\top.$$

This approach is adapted from the method suggested in Goldberg et al. (2010) for their matrix soft completion procedure.

Algorithm 2
figure 5

SoftCompletion()

Algorithm 3
figure 6

HardCompletion()

8.1.3 Alternative approach

We also report results obtained using linear kernel within LS-SVM classifiers applied to vectorized input data, see Suykens and Vandewalle (1999). We find models via the LS-SVMlab toolbox (De Brabanter et al. 2010). A classifier is built for each task independently since these models do not handle vector-valued labels simultaneously. Although the presence of missing values in the context of LS-SVM has been studied (Pelckmans et al. 2005) the toolbox does not implement any strategy to handle this situation. For this reason we considered as input data the vectorized version of where \(\mathcal{X}^{*}\) and \(\boldsymbol{Z}^{\langle M+1\rangle}_{M+1,1}\) are as in the previous paragraph. We denote this approach as imp+ls-svm where imp is a shorthand for imputation.

8.1.4 Soft completion: toy problems on multi-labeled data

For the first set of experiments we considered a family of synthetic datasets following the generative mechanism illustrated in Sect. 6.1. For each experiment we generated a core tensor \(\mathcal{S}\) in \(\mathbb{R}^{r_{1}\times r_{2}\times r_{3}}\) with entries i.i.d. from a normal distribution; for i∈{1,2} a matrix \(\boldsymbol{U}_{i}\in\mathbb{R}^{D\times r_{i}}\) with entries i.i.d. from a normal distribution. Finally \(\boldsymbol{U}_{3}\in\mathbb {R}^{N\times r_{3}}\) was generated according to the same distribution. The input data tensor in \(\mathbb{R}^{D\times D\times N}\) was taken to be

$$\mathcal{X}=\mathcal{S} \times_1 \boldsymbol{U}_1 \times_2 \boldsymbol{U}_2 \times_3 \boldsymbol{U}_3+\sigma\,\mathcal{E}. $$

Next, for each task \(t\in\mathbb{N}_{T}\) we created a weight tensor \(\mathcal{W}_{t}\) and a bias b t , again with independent and identically normally distributed entries; successively, we produced \(\tilde{\boldsymbol{Y}}\) and Y according to (55) and the probability model (54).Footnote 11 Finally, the sampling sets and in (48) and (49) were created by picking uniformly at random a fraction ω of entries of the data tensor and the target matrix respectively. For matrix and tensor soft-completion we performed model selection by using 70 % of these entries for training; we measure the performance corresponding to each parameter pair on the hold-out validation set constituted by the remaining entries. We finally use the optimal values of parameters and train with the whole set of labeled data; we then measure performance on the hold-out test set. Model selection for the linear LS-SVM classifiers was based on 10-fold cross-validation.

The procedure above was considered for D=30, T=10, σ=0.1 and different values of the multilinear rank (r 1,r 2,r 3), N and ω. Table 2 concerns the fraction of unobserved labels (that is, test data) predicted incorrectly by the different procedures as well as NRMSEs. Note that the latter is not reported for the linear LS-SVM models as these approaches do not have an embedded imputation strategy. We report the mean (and standard deviation) over 10 independent trials where each trial deals with independently generated data and sample sets.

Table 2 Fractions of unobserved labels (test data) predicted incorrectly and NRMSEs. tensor-sc exploits low rank assumptions along all the modes; in contrast, matrix-sc works only with the third mode unfolding (hereby ignoring the two-way nature of each data observation). tensor-sc generally performs comparably or better than matrix-sc and imp+ls-svm in terms of misclassification errors; tensor-sc generally outperforms matrix-sc on the reconstruction of the underlying input data tensor \(\tilde{\mathcal {X}}\), see (50)

Remark 10

According to Table 2 tensor-sc generally performs comparably or better to matrix-sc in terms of misclassification errors; however the experiments show that the former leads to more favorable results for the reconstruction of the underlying input data tensor \(\tilde{\mathcal{X}}\), see (50).

8.1.5 Multi-class categorization via soft-completion: Olivetti faces

In this experiment we deal with classification of pictures of faces of different persons; we compared tensor-sc with matrix-sc and imp+ls-svm as before. We considered the first five persons of the Olivetti database.Footnote 12 For each person, ten different 56×46 grayscale picturesFootnote 13 are available; the input dataset consists therefore of a (56×46×50)-tensor of which 65 % of entries were artificially removed. For each input image a vector-valued target label was created with one-vs-one encoding. That is, if c i ∈{1,2,…,5} denotes the class (person) indicator for the i-th image we set y (i)∈{−1,1}5 to be

$$\bigl(y^{(i)}\bigr)_j= \left \{ \begin{array}{@{}l@{\quad}l@{}} -1,&\mathrm{if~}j\neq c_i\\ 1,&\mathrm{otherwise.} \end{array} \right . $$

The same type of encoding was also used within imp+ls-svms. For these classifiers we considered as input the vector unfoldings of the images. For tensor-sc the task is to simultaneously complete the (56×46×50)-tensor and the (5×50)-matrix of target vectors. Likewise matrix-sc, obtained from (60) setting all but the last regularization parameter equal to zero, treats each image as a vector by considering only the last matrix unfolding. In all the cases we use 25 images for training and validation and the remaining for testing. As for the toy problems above, the spectral penalties parameters within matrix-sc and tensor-sc were set according to (97) and (96). We compute the regularization path corresponding to the free parameter \(\bar{\lambda}\). In all the cases the selection of parameters is driven by the misclassification error. For the LS-SVM models we used ten-fold CV. For matrix-sc and tensor-sc we chose parameters according to a hold-out set. More precisely, of the 25 images we use 17 for actual training and consider the remaining 8 for validation and then use all of the 25 images once the set of optimal parameters has been found. Following this procedure we performed five trials each of which obtained from random splitting of training and test data and random mask of input missing entries. For each method we report in Table 3 the cumulative confusion matrix obtained summing up the confusion matrices associated to the test (unlabeled) data found in the different trials. Table 4 summarizes the performance of the different methods in terms of classification accuracy and feature imputation.

Table 3 Cumulative confusion matrices for the different procedures. tensor-sc leads to better classification accuracy in comparison to the alternative techniques
Table 4 Mean and standard deviation of misclassification error rates and NRMSE of features imputation for the Olivetti dataset

Remark 11

Note that, since the choice of parameters was driven by misclassification errors, the objective of the approach is the correct completion of the labeling. Therefore the estimated input features \(\hat {\mathcal{X}}\) in (60) should be interpreted as carriers of latent discriminative information rather than a reconstruction of the underlying images. With reference to Remark 7 we interpret here \(\mathcal{X}^{(i)}\) as the true representation of the i-th image, \(\tilde{\mathcal{X}}^{(i)}\) as latent discriminative features and \(\hat{\mathcal{X}}^{(i)}\) as their estimates.

Remark 12

Unlike in the toy problems above, for which \(\tilde{\mathcal{X}}\) was available, the NRMSEs in Table 4 are computed upon the actual set of images \(\mathcal{X}\).

Figure 4 illustrates the retrieval of latent features for some unlabeled (test) pictures. Notably the latent features obtained by tensor-sc look as over-smoothed images whereas those obtained by matrix-sc generally look more noisy. In particular, the cases for which matrix-sc incorrectly assigns labels often correspond to situations where latent features do not capture person-specific traits (first and second rows). Wrongly assigned labels also correspond to cases where latent features are close to those corresponding to a different person (last two rows).

8.1.6 Hard completion: toy problems

In here we test the capability of hard completion (Sect. 6.4), denoted by tensor-hc, to recover missing entries of a partially specified input data tensor of order 3. We compare to the case where the higher order structure is neglected; in this case the input data tensor is flattened into its third matrix unfolding and one performs matrix completion of the arising second order tensor (matrix-hc). Note that, with reference to (75), this is equivalent to retain the tensor structure and set all but the last of the regularization parameters λ m m∈{1,2,3} to zero. In either case we compute solutions via Algorithm 3 keeping into account the simplifications that occur in the second order case. For each experiment we generated a core tensor \(\mathcal{S}\) in \(\mathbb {R}^{r_{1}\times r_{2}\times r_{3}}\) with entries i.i.d. according to the uniform distribution on the interval [−0.5,0.5], denoted as U([−0.5,0.5]); for i∈{1,2,3} a matrix \(\boldsymbol{U}_{i}\in \mathbb{R}^{D\times r_{i}}\) was generated also with entries i.i.d. from U([−0.5,0.5]). The input data tensor in \(\mathbb{R}^{D\times D\times D}\) was taken to be \(\mathcal{X}=\tilde{\mathcal{X}}+ \sigma\,\mathcal{E}\) where

$$\tilde{\mathcal{X}}=\mathcal{S} \times_1 \boldsymbol{U}_1 \times_2 \boldsymbol{U}_2 \times_3 \boldsymbol{U}_3 $$

and \(\mathcal{E}\) is a (D×D×D)-tensor with independent normally distributed entries. Finally the sampling sets were created by picking uniformly at random a fraction ω of entries of the data tensor. We took D=50 and considered different values of σ, the multilinear rank (r 1,r 2,r 3) and ω. In Table 5 we report the mean (and standard deviation) of NRMSEs and execution times over 10 independent trials where each trial deals with independently generated data and sample sets. Note that keeping fixed σ across different values of multilinear rank gives a different noise level. We therefore report in the table the approximate signal-to-noise ratio (SNR) obtained on the different trials. The latter is defined as \(\mathit{SNR}:=\operatorname{var}(\tilde{x})/(\sigma ^{2})\) where \(\tilde{x}\) denotes the generic entry of \(\tilde{\mathcal {X}}\) and \(\operatorname{var}\) denotes the empirical variance.

Table 5 NRMSEs and execution times for tensor-hc and matrix-hc; the latter completes the given tensor based upon low rank assumption on the third mode only. Note that the computational complexity of tensor-hc is roughly 3 times that of matrix-hc, in line with Remark 14

Remark 13

The experimental evidence suggests that tensor completion performs better than matrix completion (performed along the third mode) when either no n-rank dominates the others or when the 3-rank is higher. This observation holds across different noise levels and fractions of entries used in the reconstruction.

Remark 14

As Table 5 shows, for the third order case, the computational complexity of our implementation of tensor completion is roughly three times that of matrix completion. This is expected since the computational load is determined by Step 5 and 6 which, in turn, involve a number of iterations equal to the order of the tensor (M+1). This is no longer needed for M=1 (second order case), see Remark 8.

8.1.7 Impainting of colored images via hard completion

In here we apply hard-completion to impainting of 8-bit RGB colored images, each of which is represented as a third order tensor. The first two modes span the pixels space; the third mode contains information from the three channels. For each image we remove entries in all the channels simultaneously (first three rows of Fig. 5), or consider the case where entries are missing at random (last two rows of Fig. 5). We then solve the problem in Eq. (75) with the sampling set indexing non-missing pixels. A solution is found via Algorithm 3. As termination criterion we use the relative increment (36) where we set η=10−7. With reference to Algorithm 2 we let τ=104 and λ m =1 for m∈{1,2,3}. Figure 5 reports the original pictures, the input data tensor and the outcome of our algorithm.

8.2 Inductive learning

We report here experiments on inductive learning with multiple tasks, see Sect. 7. As performance indicator we considered the capability of each procedure to predict the correct test labels. We compared LS-SVM models with linear kernel (lin ls-svm) and naive Bayes classifiers (naive Bayes) (Domingos and Pazzani 1997)Footnote 14 with models obtained solving (90). With reference to the latter we set the parameters in the composite spectral penalty as follows. In one case, referred to as log mlrank,Footnote 15 we set

$$ \lambda_m=\frac{1}{M+1}\bar{\lambda} \quad\text{for any~}m\in\mathbb{N}_{M+1} $$
(98)

where M is the order of the input data and \(\bar{\lambda}\) is a varying parameter. Alternatively we take

$$ \lambda_m= \left \{ \begin{array}{@{}l@{\quad}l@{}} 0&\text{if~}m\in\mathbb{N}_M\\ \bar{\lambda}&\text{if~}m= M+1. \end{array} \right . $$
(99)

This latter approach, referred to as log rank, corresponds to leverage only the interdependence between tasks; structural assumptions over the input features are not exploited. In either case we use Algorithm 4 and compute the entire regularization path with respect to \(\bar{\lambda}\). In our experiments the choice of this parameter was driven by the misclassification rate. For log mlrank and log rank we use a hold-out validation set. For LS-SVM we found models via the LS-SVMlab toolbox (De Brabanter et al. 2010) and considered ten-fold CV for model selection.

Algorithm 4
figure 7

InductiveLearning()

Fig. 4
figure 8

Olivetti faces: the task is to simultaneously complete the input features, a (56×46×50)-tensor (ten 56×46 images for each one of the five persons), and the (5×50)-matrix of five-dimensional target vectors, where five relates to the one-versus-one encoding used for the five classes. Here we reported the reconstructed features and assigned labels for 6 unlabelled 56×46 images; see also Tables 3 and 4. Wrong labels are reported in red. Estimated input features should be interpreted as carriers of latent discriminative information rather than a reconstruction of the underlying images, see Remark 11. In particular, the cases for which matrix-sc incorrectly assigns labels often correspond to situations where latent features do not capture person-specific traits (first and second rows). Wrongly assigned labels also correspond to cases where latent features are close to those corresponding to a different person (last two rows)

Fig. 5
figure 9

Hard-completion for impainting of a 8-bit RGB image. Here we report the result of five impainting experiments. Each image is represented as a (300×300×3)-tensor; the first two modes span the pixels space; the third mode contains information from the three channels. Original images (left column); given pixels (middle column); reconstruction by Algorithm 3 (right column)

8.2.1 Multi-labeled data toy problems

We begin by a set of artificially generated problems. For each trial we considered an ensemble of T models represented by a (D×D×T)-tensor \({\mathcal{W}}\) with multilinear rank (r,r,r) and a vector of bias terms \(b\in\mathbb{R}^{T}\) with normal entries; \({\mathcal{W}}\) was obtained by generating a core tensor \(\mathcal{S}_{{\mathcal{W}}}\) in \(\mathbb{R}^{r\times r\times r}\) with entries i.i.d. from a normal distribution; for i∈{1,2} a matrix \(\boldsymbol{U}_{i}\in\mathbb {R}^{D\times r}\) and \(\boldsymbol{U}_{3}\in\mathbb{R}^{T\times r}\) were generated also with entries i.i.d. from a normal distribution. Finally we set

$$ {\mathcal{W}}=\mathcal{S}_{{\mathcal{W}}} \times_1 \boldsymbol{U}_1 \times_2 \boldsymbol{U}_2 \times \boldsymbol{U}_3. $$
(100)

We then created a dataset of N input-output pairs as follows. For \(n\in\mathbb{N}_{N}\) we let \(\mathcal{X}^{(n)}\) be a (D×D)-matrix with normal entries; for any \(t\in\mathbb{N}_{T}\) we let the corresponding label \(y^{(n)}_{t}\) be a Bernoulli random variable with alphabet {1,−1} and success probability \(1/(1+\exp (-\tilde{y}^{(n)}_{t}))\); the latent variable \(\tilde{y}^{(n)}_{t}\) was taken to be

$$\tilde{y}^{(n)}_t=\bigl\langle{\mathcal{W}}^{(t)}, \mathcal{X}^{(n)}\bigr\rangle+b_t $$

where for any \(t\in\mathbb{N}_{T}\), \(\mathcal{W}^{(t)}\) is as in (86).

We set r=2 and considered the procedure above for different values of T and D. For a fixed value of T we use N=100T pairs for training. Note that N refer to the whole set of tasks; in turn, the N pairs were distributed uniformly at random across different tasks. As such, there are on average 100 input-output pairs per task; in this way, the amount of training information is kept constant as T varies. For each setting we perform 10 trials. For log mlrank and log rank we chose \(\bar{\lambda}\) based upon a validation set of 30 % pairs selected at random within the N observations. Results in terms of misclassification rate on a test set are reported in Table 6.

Table 6 Fractions of unobserved labels (test data) predicted incorrectly

Remark 15

The experiments show that leveraging relations between tasks (log mlrank and log rank) significantly improves results; the performance of ls-svm models, which are trained independently, is the same as T is increased. This is to be expected since the amount of training data per task is kept approximately the same across different values of T.

Remark 16

A comparison between log mlrank and log rank reveals that exploiting structural assumptions over the input features is a good idea; this is seen to be the case especially when the number of tasks is small or the features dimensionality is higher.

8.2.2 Multiclass classification of Binary Alphadigits

In this experiment we considered discrimination between handwritten digits. We focused on the Binary Alphadigits datasetFootnote 16 made up of digits from “0” through “9” followed by capital letters from “A” through “Z” (English alphabet). Each digit is represented by 39 examples each of which consists of a binary 20×16 matrix. In log mlrank the matrix shape of each digit is retained; log rank and lin ls-svm treat each input pattern as a vector of dimension 320. We consider problems with different numbers of classes. As before we used one-vs-one encoding (see Sect. 8.1.5); correspondingly, the number of tasks T is equal to the number of classes. In each case we train models upon N training examples uniformly distributed across the considered classes; we chose N so that approximately 10 examples per class are used for training (and model selection) whereas the remaining examples are retained for testing. For each setting we average results over 20 trials each of which is obtained from a random splitting of training and test data. Due to the scarcity of training patterns an error occurs when running NaiveBayes; therefore we could not obtain results for this approach. Tables 7 and 8 report results for different values of T. For T=2 we considered a subset of arbitrary binary problems; for T>2 we considered classes of digits in their given order. In general, for T≤4 log mlrank seems to perform slightly better than log rank (Tables 7 and 8). As for the multi-labels example above, there is a strong evidence that enforcing task relationships via the regularization mechanism in log mlrank and log rank improves over the case where tasks are considered independently (Table 8).

Table 7 Fractions of misclassified test digits, multiple problems, T=2
Table 8 Fractions of misclassified test digits, T>2

9 Concluding remarks

In this paper we have established a mathematical framework for learning with higher order tensors. The transductive approach we considered is especially useful in the presence of missing input features. The inductive formulation, on the other hand, allows one to predict labels associated to input items unavailable during training. Both these approaches work by simultaneously identifying subspaces of highly predictive features without the need for a preliminary feature extraction step. This is accomplished both leveraging relationships across tasks and within the higher order representation of the (possibly very high dimensional) input data.

A drawback of our methods is their restriction to linear models only. An interesting line of future research concerns the extension to a broader class of models. For certain problem of interest one could perhaps extend results for matrix representer theorems (Argyriou et al. 2009), used within multi-task learning (Argyriou et al. 2008). In the setting of Argyriou et al. (2008), different but related learning problems are associated to task vectors belonging to a feature space associated to a used-defined kernel function. As a special case, i.e. when the feature mapping is the identity, the feature space corresponds to the input space where data are originally represented. In this case one obtains linear models in the data, like in this paper. In general, one can show that these (possibly infinite dimensional) task vectors lie within the span of the mapped data associated to all the tasks (Argyriou et al. 2009). In the context of this paper, we assumed low multilinear ranks hereby leveraging the algebraic structure of data in the input space. This is rather crucial, especially when the higher order data entails missing observations and part of the learning problem consists of completing the data. In contrast, Argyriou et al. (2008) exploits the geometry of the feature space rather than that of the original input space. Therefore, although nonlinear extensions are certainly possible (and desirable) they will need working assumptions attached to the geometry of the feature space rather than that of the input space. For some cases, a viable alternative is to conceive a mapping that preserves properties of the higher order data in the input space. This is the spirit of the Grassmanian kernels proposed in Signoretto et al. (2011a).

It is also important to note that in our experiments we either considered a single spectral penalty (as in (97) and (99)) or a composite regularizer where all the penalties were equally enforced ((96) and (98)). Although uniform weights are shown to work in practice, this black and white setting is clearly restrictive: ideally one would perform model selection to search for the optimal combination of parameters. Unfortunately this comes at the price of increasing substantially the computation burden.