1 Introduction

In this paper, we study theoretical and computational properties of learning a multi-output function using kernel methods. This problem has been considered by Micchelli and Pontil (2005) where the framework of vector-valued reproducing kernel Hilbert spaces was adopted and the representer theorem for Tikhonov regularization was generalized to the vector-valued setting. Our work can be seen as an extension of the work of Micchelli and Pontil (2005) aimed in particular at:

  • Investigating the application of spectral regularization schemes (Lo Gerfo et al. 2008) to multi-output learning problems.

  • Establishing consistency and finite sample bounds for Tikhonov regularization as well as for the other methods in the setting of vector-valued learning.

  • Discussing the problem of multi-category classification within the vector-valued framework as well as Bayes consistency of spectral regularization methods.

A main outcome of our study is a general finite sample bound for spectral methods that leads to consistency. Moreover, we show in theory and practice how iterative methods can be computationally much more efficient than Tikhonov regularization. As a byproduct of our analysis, we discuss theoretical and practical differences among vector-valued learning, multi-task learning and multi-category classification.

Classical supervised learning focuses on the problem of estimating functions with scalar outputs: a real number in regression and one between two possible labels in binary classification. The starting point of our investigation is the observation that in many practical problems it is convenient to model the object of interest as a function with multiple outputs. In machine learning this problem typically goes under the name of multi-task or multi-output learning and has recently attracted a certain attention. It is interesting to recognize at least two classes of problems with multiple output functions. The first class, that we might call multi-task learning, corresponds to the situation in which we have to solve several standard scalar learning problems that we assume to be related, so that we can expect to obtain a better solution if we attempt to solve them simultaneously. A practical example is the problem of modelling buying preferences of several people based on previous purchases (Evgeniou et al. 2005). People with similar tastes tend to buy similar items and their buying histories are probably related. The idea is then to predict the consumer preferences for all individuals simultaneously by solving a multi-output learning problem. Each consumer is modelled as a task and its previous preferences are the corresponding training set. The second class of problems corresponds to learning vector-valued functions. This situation is better described as a supervised learning problem where the outputs are vector-valued. For example, a practical problem is that of estimating the velocity field of an incompressible fluid from scattered spatial measurements (see experiments section).

The two problems are clearly related. Indeed, we can view tasks as components of a vector valued function or equivalently learning each component of a vector-valued function as one of many scalar tasks. Nonetheless, there are also some differences that make the two problems different both from a practical and a theoretical point of view. For example, in multi-task learning the input points for each task (component) can be different, they can be represented by different features and the sample size might vary from one task to the other. In particular, each task can be sampled in a different way so that, in some situations, we can essentially augment the number of effective points available for each individual task by assuming that the tasks are highly correlated. This effect does not occur while learning vector fields—see Fig. 1—where each component is sampled at the same input points. Since the sampling procedures are somewhat different, the error analyses for multi-task and vector-valued learning are also different. The latter case is closer to the scalar setting, whereas in the multi-task case the situation is more complex: one might have different cardinalities for the various tasks or be interested to evaluate performances for each task individually.

Fig. 1
figure 1

Comparison of a multi-task and a vector-valued learning problem. We consider a simplified situation in which there are only two tasks/components and they are the same function. In the multi-task case, the tasks can be sampled at different input points, whereas in the vector-valued case it is natural to assume all the components to be sampled at the same input points

Several recent works considered multi-output learning, especially multi-task, and proposed a variety of approaches. Starting from the work of Caruana (1997), related ideas have been developed in the context of regularization methods (Argyriou et al. 2008b; Jacob et al. 2008), Bayesian techniques—e.g. Gaussian processes (Boyle and Frean 2005; Chai et al. 2009; Alvarez et al. 2009), collaborative filtering (Abernethy et al. 2009) and online sequential learning (Abernethy et al. 2007). The specific problem of learning a vector-valued function has received considerably less attention in machine learning. In statistics we mention the Curds & Whey method of Breiman and Friedman (1997), Reduced Rank Regression (Izenman 1975), Filtered Canonical y-variate Regression (van der Merwe and Zidek 1980) and Partial Least Squares (Wold et al. 1984). Interestingly, a literature on statistical techniques for vector field estimation exists in the context of geophysics and goes under the name of kriging (or co-kriging) (Stein 1999), which is very much related to the kernel methods we discuss in this paper. It is worth noting that neural networks algorithms (Bishop 2006) are naturally adapted to deal with vector valued problems. In fact, they amount to solve a least square problem in a multi-layer fashion. In this case, the relation among the outputs is implicitly enforced when jointly learning the weights. More recently, few attempts to extend machine learning algorithms from the scalar to the vector setting have also been made. For example, some extensions of Support Vector Machines have been proposed (Brudnak 2006; Vazquez and Walter 2003). A study of vector-valued learning with kernel methods is started by Micchelli and Pontil (2005), where regularized least squares are analyzed from the computational point of view. The error analysis of vector-valued Tikhonov regularization is given by Caponnetto and De Vito (2005, 2007). Finally, we note that the use of vector-valued kernels for multi-category classification has not been analyzed yet, though we will see that it is implicit in methods such as multi-category Support Vector Machines (Lee et al. 2004). Algorithms for multi-category classification include so-called single machines methods, as well as techniques that reduce the multi-class problems to a family of binary problems, e.g. one-versus-all and all-versus-all (see the papers by Tewari and Bartlett 2005 and Rifkin and Klautau 2004, for discussions and references). In our study we consider the results by Tewari and Bartlett (2005) and Rifkin and Klautau (2004) as starting points for theoretical and practical considerations. The former work shows that naïve extensions of binary classification algorithms to multiple classes might lead to inconsistent methods, and provide sufficient conditions for a multi-class method to be Bayes consistent (see also the work by Zhang 2004). The latter work presents a thorough experimental analysis, supporting the fact that a finely tuned one-versus-all (OVA) scheme yields performances that are comparable or better than more complicated approaches in most practical situations.

The main contribution of this paper is a complete analysis of a class of regularized kernel methods for multi-output learning. Some of the theoretical results are specific for vector-valued learning as a natural extension of the classical scalar setting, but many of the computational ideas we discuss apply to general multi-output problems. The description and motivation of the considered algorithms differ from those of penalized empirical risk algorithms. Each algorithm has a natural definition in terms of filtering the spectrum (i.e. the collection of eigenvalues) of the kernel matrix, designed to suppress contributions corresponding to small eigenvalues. This justifies calling these methods spectral regularization. The rationale behind them is the connection between learning theory and regularization of ill-posed problems (De Vito et al. 2005) and, more generally, the results showing the relation between stability and generalization (Poggio et al. 2004; Bousquet and Elisseeff 2002). Indeed, our main result is an excess risk bound that ensures generalization properties and consistency of spectral regularization. Although we perform the theoretical analysis in a unified framework, the specific form of the filter enters the bound.

The various methods have different computational properties. As we show, both in theory and practice, iterative algorithms, that can be seen as extensions of L2 boosting (Bühlmann and Yu 2002), can outperform Tikhonov regularization from the computational point of view while preserving its good learning performance. The complexity analysis we provide takes into account the specific form of the kernel as well as the regularization parameter choice step. The empirical performance of spectral filtering methods are tested in multi-task and vector-valued learning problems both for toy and real data.

Finally, we give a theoretical discussion on the application of vector field learning techniques in the context of multi-category classification. We show how to formulate a multi-class problem as a vector-valued learning problem and discuss the role played by the coding strategy. The difference between a one-versus-all approach and an approach where the correlation among classes is taken into account is clear within the vector-valued framework. Bayes consistency of spectral filtering methods easily follow from the aforementioned excess risk bound. Some of the material in this paper has been presented by Baldassarre et al. (2010). The conference paper contains only the discussion on vector field learning with no proofs and limited experimental analysis.

The plan of the paper follows: in Sect. 2 we set our notation and recall some basic concepts; in Sect. 3 we present the class of algorithms under study and the finite sample bound on the excess risk; in Sect. 4 we discuss computational issues in relation to the choice of the kernel; in Sect. 5 we illustrate applications to multi-class classification and multi-task learning. Experimental analysis is conducted in Sect. 6 and we conclude in Sect. 7 proposing some future work. Two appendices complete this paper: the first contains an extensive review of kernels for multi-output learning, the second is devoted to the proofs of the main results.

2 Learning vector-valued functions with kernels: basic concepts

We start by setting the notation and recalling some elementary facts. We consider vector-valued learning and present the setup of the problem, as well as the basic notions behind the theory of vector-valued reproducing kernels.

2.1 Supervised learning as function approximation

The problem of supervised learning amounts to inferring an unknown functional relation given a finite training set of input-output pairs \(\mathbf{z}=\{(x_{i},y_{i})\}_{i=1}^{n}\) that are randomly sampled and noisy. More precisely, the training points are assumed to be identically and independently distributed according to a fixed, but unknown probability measure ρ(x,y)=ρ X (x)ρ(y|x) on \(\mathcal{Z}=\mathcal{X}\times\mathcal{Y}\), where usually \(\mathcal {X}\subseteq\mathbb{R}^{p}\) and \(\mathcal{Y}\subseteq\mathbb{R}\). Here we are interested in vector-valued learning where \(\mathcal {Y}\subseteq \mathbb{R}^{d}\). A learning algorithm is a map from a training set z to an estimator \(f_{\mathbf{z}}:\mathcal{X}\to\mathcal{Y}\). A good estimator should generalize to future examples and, if we choose the square loss, this translates into the requirement of having small expected risk (or error)

$$\mathcal{E}(f) = \int_{\mathcal{X}\times\mathcal{Y}}\bigl\|{y-f(x)}\bigr\|_d^2d\rho(x,y),$$

where ∥⋅∥ d denotes the Euclidean norm in ℝd. In this framework the ideal solution is the minimizer of the expected risk, that is the regression function \(f_{\rho}(x) = \int_{\mathcal{Y}}y\rho(y|x)\), but cannot be directly calculated since ρ is unknown. Further, the search for a solution is often restricted to some space of hypotheses \(\mathcal{H}\). In this case the best attainable error is \(\mathcal{E}(f_{\mathcal{H}})=\inf_{f\in \mathcal{H}}\mathcal{E}(f)\). The quality of an estimator can be assessed considering the distribution of the excess risk, \(\mathcal{E}(f_{\mathbf{z}})-\mathcal{E}(f_{\mathcal{H}})\), and in particular, we say that an estimator is consistent if

$$\lim_{n\to\infty}\mathrm{P} \bigl[ \mathcal{E}(f_\mathbf{z})-\mathcal {E}(f_\mathcal{H})\ge\varepsilon\bigr]=0$$

for all positive ε, where P[A] is the probability of the event A. A more quantitative result is given by finite sample bounds,

$$\mathrm{P} \bigl[ \mathcal{E}(f_\mathbf{z})- \mathcal {E}(f_\mathcal {H})\ge\varepsilon(\eta, n) \bigr] \le1-\eta, \quad0<\eta\le1.$$

We add two remarks on related problems that we discuss in the following. The first is multi-task learning.

Remark 1

In multi-task learning (Evgeniou et al. 2005; Caponnetto et al. 2008; Micchelli and Pontil 2004) the goal is to learn several correlated scalar problems simultaneously. For each task j=1,…,d we are given a training set of examples \(S_{j}=\{(x_{ij},y_{ij})\}_{i = 1}^{n_{j}}\). The examples are often assumed to belong to the same space \(\mathcal{X}\times\mathcal{Y}\) and, if this is the case, vector-valued learning corresponds to the case where the inputs are the same for all tasks.

The second problem is multi-category classification.

Remark 2

It is well-known that binary classification can be seen as a regression problem where the output values are only ±1. In the multi-class case the naïve idea of assigning a label y∈{1,2,…,d} to each class introduces an artificial ordering among the classes. A possible way to solve this issue is to assign a “code” to each class, for example class 1 can be coded as (1,0…,0), class 2 as (0,1…,0) etc. In this case, we can see the problem as vector-valued regression. As we discuss in Sect. 5.1, this point of view allows to show that the spectral regularization algorithms we consider are consistent multi-class algorithms.

2.2 Vector-valued RKHS

In the following we are interested in the theoretical and computational properties of a class of vector-valued kernel methods, that is, methods where the hypothesis space is chosen to be a reproducing kernel Hilbert space (RKHS). This motivates recalling the basic theory of vector-valued RKHS.

The development of the theory in the vector case is essentially the same as in the scalar case. We refer to the papers by Schwartz (1964), Micchelli and Pontil (2005), Carmeli et al. (2006) for further details and references. We consider functions having values in some Euclidean space \(\mathcal{Y}\) with scalar product (norm) \(\langle{\cdot},{\cdot} \rangle_{\mathcal{Y}}\), (\(\|{\cdot}\|_{\mathcal{Y}}\)), for example \(\mathcal{Y}\subset{\mathbb{R}}^{d}\). A RKH space, \(\mathcal{H}\), is a Hilbert space of functions \(f:\mathcal{X}\to\mathcal{Y}\), with scalar product (norm) denoted by 〈⋅,⋅〉 Γ (∥⋅∥ Γ ), such that, for all \(x \in\mathcal{X}\), the evaluation maps \(\mathit{ev}_{x}:\mathcal{H}\to{ \mathcal{Y}}\) are linear and bounded, that is

$$ \bigl\|{f(x)}\bigr\|_\mathcal{Y}=\|{\mathit{ev}_x f}\|_\mathcal{Y}\ \le C_x \|{f}\|_\varGamma, \quad C_x \in{\mathbb{R}}.$$
(1)

For all \(x, s \in\mathcal{X}\), a reproducing kernel \(\varGamma :\mathcal {X}\times\mathcal{X}\to \mathcal {B}(\mathcal{Y})\), where \(\mathcal {B}(\mathcal{Y})\) is the space of linear and bounded operators on \(\mathcal{Y}\), is defined as:

$$\varGamma(x,s):=\mathit{ev}_x \mathit{ev}_s^*$$

where \(\mathit{ev}_{x}^{*}: {\mathcal{Y}} \to\mathcal{H}\) is the adjointFootnote 1 of ev x . Note that for \(\mathcal{Y}\subset{\mathbb{R}}^{d}\), the space \(\mathcal {B}(\mathcal{Y})\) is simply the space of d×d matrices.

By definition, for all \(c\in \mathcal{Y}\) and \(x\in\mathcal{X}\), the kernel Γ has the following reproducing property

$$ \bigl\langle{f(x)},{c} \bigr\rangle_\mathcal{Y}=\bigl \langle{f},{\mathit{ev}_x^* c} \bigr\rangle_\varGamma= \langle{f},\varGamma_x c \rangle_\varGamma,$$
(2)

where \(\mathit{ev}_{x}^{*} c=\varGamma_{x} c=\varGamma(\cdot, x)c\). It follows that in (1) we have \(C_{x}\le\sup_{x\in\mathcal{X}} \|{\varGamma(x,x)}\|_{\mathcal {Y},\mathcal{Y}}\), where \(\|{\cdot}\|_{\mathcal{Y},\mathcal{Y}}\) is the operator norm. We assume throughout that

$$ \sup_{x\in\mathcal{X}} \bigl\|\varGamma(x,x)\bigr\|_{\mathcal {Y},\mathcal{Y}}=\kappa<\infty.$$
(3)

Similarly to the scalar case, it can be shown (Schwartz 1964), that for any given reproducing kernel Γ, a unique RKHS can be defined with Γ as its reproducing kernel.

In Sect. 20 and in Appendix A, we discuss several examples of kernels corresponding to vector-valued RKHS (other examples have also been proposed, Micchelli and Pontil 2005; Evgeniou et al. 2005). To avoid confusion in the following we denote with K scalar kernels and with Γ reproducing kernels for vector-valued RKHS.

Remark 3

It is interesting to note that, when \(\mathcal{Y}={\mathbb{R}}^{d}\), any matrix-valued kernel Γ can be seen as a scalar kernel, \(Q: (\mathcal{X},\varPi)\times(\mathcal{X},\varPi) \to{\mathbb{R}}\), where Π is the index set of the output components, i.e. Π={1,…,d}. More precisely, we can write Γ(x,x′) ℓq =Q((x,),(x′,q)). See Hein and Bousquet (2004) for more details.

3 Learning vector-valued functions with spectral regularization

In this section, we present the class of algorithms under study. First, we briefly recall the main features of Tikhonov regularization for scalar and vector problems. On the one hand, this allows us to point out the role played by vector-valued RKHS and, on the other hand, it will help us introducing the spectral regularization methods of which Tikhonov is a special case. Second, we discuss the general framework of spectral methods as well as several examples of algorithms. Third, we state our main theoretical result: a finite sample bound on the excess risk for all spectral methods in a unified framework.

3.1 Tikhonov regularization from the scalar to the vector case

In this section, we start from the Tikhonov regularization in the scalar setting to illustrate the extension to the general vector-valued case. In particular, we are interested in the role played by the kernel matrix.

In the scalar case, given a training set \(\mathbf{z}=\{(x_{i},y_{i})\}_{i=1}^{n}\), Tikhonov regularization in a RKHS \(\mathcal{H}\), with kernel K, corresponds to the minimization problem

$$\min_{f \in\mathcal{H}} \Biggl\{\frac{1}{n} \sum_{i=1}^{n}\bigl(y_i-f(x_i)\bigr)^2 +\lambda\|{f}\|_\mathcal{H}^2 \Biggr\},$$

where the regularization parameter λ>0 controls the trade-off between data fitting and regularization.

Its solution is given by

$$ f_\mathbf{z}^\lambda(\cdot)=\sum _{i=1}^{n} c_iK(x_i,\cdot), \quad c_i \in{\mathbb{R}} \ \forall i=1,\dots, n$$
(4)

where the coefficients c=(c 1,…,c n ), satisfy

$$ (\mathbf{K}+\lambda nI) {\mathbf{c}}= {\mathbf{y}},$$
(5)

with K ij =K(x i ,x j ), y=(y 1,…,y n ) and I is the n×n identity matrix.

The final estimator f z is determined by a parameter choice λ n =λ(n,z), so that \(f_{\mathbf{z}}=f_{\mathbf{z}}^{\lambda}\).

In the case of vector-valued output, i.e. \({\mathcal{Y}}\subset{\mathbb{R}}^{d}\), the simplest idea is to consider a naïve extension of Tikhonov regularization, reducing the problem to learning each component independently. Namely, the solution is assumed to belong to

$$ {\mathcal{H}=\mathcal{H}^1\times\mathcal{H}^2\dots\times\mathcal{H}^d,}$$
(6)

where the spaces \(\mathcal{H}^{1}, \mathcal{H}^{2},\dots, \mathcal{H}^{d}\) are endowed with norms \(\|{\cdot}\|_{\mathcal{H}^{1}},\dots, \|{\cdot}\|_{\mathcal {H}^{d}}\). Then f=(f 1,…,f d) and \(\|{f}\|_{\varGamma}^{2}=\sum_{j=1}^{d}\|{f^{j}}\|_{\mathcal{H}^{j}}^{2}\). Tikhonov regularization amounts to solving the following problem

$$ {\min_{f \in\mathcal{H}} \Biggl\{ \frac{1}{n}\sum_{i=1}^{n}\bigl\| {y_i-f(x_i)}\bigr\|_d^2 +\lambda\|{f}\|_\varGamma^2\Biggr\} }$$
(7)

that can be rewritten as

$$\min_{f^1\in\mathcal{H}^1, \dots, f^d \in\mathcal{H}^d} \Biggl\{\frac{1}{n} \sum _{i=1}^{n} \sum_{j=1}^d\bigl(y^j_i-f^j(x_i)\bigr)^2 +\lambda\sum_{j=1}^{d}\bigl\|{f^j}\bigr\|_{\mathcal{H}^j}^2 \Biggr\}.$$

From the above expression, it is clear that the solution of the problem is equivalent to solving d independent scalar problems. Within the framework of vector-valued kernels, assumption (6) corresponds to a special choice of a matrix-valued kernel, namely a kernel of the form

$$\varGamma\bigl(x,x'\bigr)=\operatorname{diag}\bigl(K_1\bigl(x,x'\bigr), \dots, K_d\bigl(x,x'\bigr)\bigr).$$

Assuming each component to be independent to the others is a strong assumption and might not reflect the real functional dependence among the data. Recently, a regularization scheme of the form (7) has been studied by Micchelli and Pontil (2005) for general matrix-valued kernels. In this case, there is no straightforward decomposition of the problem and one of the main results of Micchelli and Pontil (2005) shows that the regularized solution can be written as

$$ {f_\mathbf{z}^\lambda(\cdot)=\sum _{i=1}^{n} \varGamma(\cdot,x_i)c_i,\quad c_i \in{\mathbb{R}}^d\ \forall i=1,\dots, n. }$$
(8)

The coefficients can be concatenated in a nd vector \(\mathbf{C}=(c_{1}^{\top},\dots, c_{n}^{\top})^{\top}\) and satisfy

$$ (\boldsymbol{\Gamma}+\lambda n I ) \mathbf{C}=\mathbf{Y},$$
(9)

where \(\mathbf{Y}= (y_{1}^{\top}, \dots, y_{n}^{\top})^{\top}\) is the nd vector where we concatenated the outputs and the kernel matrix Γ is a d×d block matrix, where each block is a n×n scalar matrix, so that Γ is a nd×nd scalar matrix, while I is the nd×nd identity matrix.

Remark 4

We observe that the kernel matrix corresponding to (9) has a block diagonal structure. Indeed, the presence of off-diagonal terms reflects the dependence among the components.

3.2 Beyond Tikhonov: regularization via spectral filtering

In this section, we present the class of regularized kernel methods which we apply to multi-output learning, referring to the papers by Lo Gerfo et al. (2008), Bauer et al. (2007) for the scalar case. We call these methods spectral regularization because they achieve a stable, hence generalizing, solution by filtering the spectrum (i.e. the set of eigenvalues) of the kernel matrix, discarding or attenuating its unstable components, that is the directions corresponding to small eigenvalues. Each algorithm corresponds to a specific filter function and, in general, there is no natural interpretation in terms of penalized empirical risk minimization.Footnote 2 More precisely, the solution of (unpenalized) empirical risk minimization can be written as in (8), but the coefficients are given by

$$ {\boldsymbol{\Gamma}\mathbf{C}=\mathbf{Y}. }$$
(10)

Comparing the above expression to (9), we see that adding a penalty to the empirical risk has a stabilizing effect from a numerical point of view, since it suppresses the weights of the components corresponding to the small eigenvalues of the kernel matrix. This allows to look at Tikhonov regularization as performing a low pass filtering of the kernel matrix, where high frequencies correspond to the small eigenvalues.

The interpretation of regularization as a way to restore stability is classical in ill-posed inverse problems, where many algorithms, besides Tikhonov regularization, are used (Engl et al. 1996). The connection between learning and regularization theory of ill-posed problems (De Vito et al. 2005) motivates considering spectral regularization techniques. In the scalar case this was done by Lo Gerfo et al. (2008), Bauer et al. (2007), Caponnetto (2006). The idea is that other regularized matrices g λ (Γ) besides (Γ+λnI)−1 can be defined. Here the matrix-valued function g λ (Γ) is described by a scalar function g λ using spectral calculus. More precisely, if

$$\boldsymbol{\Gamma}=\mathbf{U}\mathbf{S}\mathbf{U}^*$$

is the eigendecomposition of Γ with \(\mathbf{S}=\operatorname{diag}(\sigma_{1},\dots ,\sigma_{n})\) containing its eigenvalues, then

$$g_\lambda(\mathbf{S})=\operatorname{diag}\bigl(g_\lambda(\sigma_1),\dots,g_\lambda(\sigma_n)\bigr)$$

and

$$g_\lambda(\boldsymbol{\Gamma})=\mathbf{U}g_\lambda(\mathbf{S})\mathbf{U}^*.$$

For example, in the case of Tikhonov regularization \(g_{\lambda}(\sigma )=\frac {1}{\sigma+n\lambda}\).

Suitable choices of filter functions g λ define estimators of the form (8) with coefficients given by

$$ \mathbf{C}=g_\lambda(\boldsymbol{\Gamma})\mathbf{Y}.$$
(11)

From the computational perspective, in the following we show that many filter functions allow to compute the coefficients C without explicitly computing the eigendecomposition of Γ.

Clearly not all filter functions are admissible. Roughly speaking, an admissible filter function should be such that g λ (Γ) approximates Γ −1 as λ decreases and its condition number should increase as λ decreases, while a formal definition is given below. Note that, since we assume the kernel to be bounded, \(\sup _{x\in\mathcal{X}} \|\varGamma(x,x)\|_{\mathcal{Y},\mathcal {Y}}=\kappa< \infty\), its eigenvalues will be in the interval [0,κ 2] and we can consider a filter to be a real-valued function defined on [0,κ 2].

Definition 1

We say that a filter g λ :[0,κ 2]→ℝ, 0<λκ 2, is admissible if the following conditions hold

  • There exists a constant D such that

    $$ {\sup_{0 < \sigma\leq\kappa^2} \bigl|\sigma g_\lambda (\sigma)\bigr| \leq D }$$
    (12)
  • There exists a constant B such that

    $$ {\sup_{0 < \sigma\leq\kappa^2} \bigl|g_\lambda (\sigma)\bigr| \leq \frac{B}{\lambda} }$$
    (13)
  • There exists a constant γ such that

    $$ {\sup_{0 < \sigma\leq\kappa^2} \bigl|1 - g_\lambda (\sigma)\sigma \bigr| \leq \gamma}$$
    (14)
  • There exists a constant \(\overline{\nu} > 0\), namely the qualification of the filter g λ such that, for all \(\nu\in(0,\overline{\nu}]\)

    $$ {\sup_{0 < \sigma\leq\kappa^2} \bigl|1 - g_\lambda (\sigma)\sigma \bigr|\sigma^{\nu} \leq\gamma_\nu\lambda^\nu, }$$
    (15)

    where the constant γ ν >0 does not depend on λ.

The above conditions are well-known in the context of regularization for ill-posed problems. Shortly, the first two conditions ensure the regularization operator induced by a filter to be bounded and with condition number controlled by the regularization parameter λ. The last two conditions are more technical and govern the approximation properties of each filter. We refer the interested reader to the works of Lo Gerfo et al. (2008), Bauer et al. (2007) for the computation of the constants for the different filters and a more thorough discussion.

3.3 Examples of spectral regularization algorithms

We now describe several examples of algorithms that can be cast in the above framework.

L2 Boosting

We start describing in some details vector-valued L2 Boosting. In the scalar setting this method has been interpreted as a way to combine weak classifiers corresponding to splines functions at the training set points (Bühlmann and Yu 2002) and is called Landweber iteration in inverse problem literature (Engl et al. 1996). The method can also be seen as the gradient descent minimization of the empirical risk on the whole RKHS, with no further constraint. Regularization is achieved by early stopping of the iterative procedure, hence the regularization parameter is the number of iterations.

The coefficients (11) can be found by setting C 0=0 and considering for i=1,…,t the following iteration

$$\mathbf{C}^i=\mathbf{C}^{i-1}+\eta\bigl(\mathbf{Y}-\boldsymbol{\Gamma}\mathbf{C}^{i-1}\bigr),$$

where the step size η can be chosen to make the iterations converge to the minimizer of the empirical risk—see (16) below. It is easy to see that this is simply gradient descent if we use (8) to write the empirical risk as

$$\| \boldsymbol{\Gamma}\mathbf{C}-\mathbf{Y}\|^2.$$

The corresponding filter function can be found noting that

and indeed one can prove by induction (Engl et al. 1996; Lo Gerfo et al. 2008) that the solution at the t-th iteration is given by

$$\mathbf{C}^t= \eta\sum_{i=0}^{t-1}(I-\eta\boldsymbol{\Gamma})^i \mathbf{Y}.$$

Then, the filter function is \(G_{\lambda}(\sigma)= \eta\sum_{i=0}^{t-1}(I-\eta\sigma)^{i}\). Interestingly, this filter function has another interpretation that can be seen recalling that \(\sum_{i=0}^{\infty} x^{i}=(1-x)^{-1}\), for 0<x<1. In fact, a similar relation holds if we consider matrices rather than scalars, so that, if we replace x with 1−η Γ, we get

$$\boldsymbol{\Gamma}^{-1}=\eta\sum_{i=0}^{\infty}(I-\eta\boldsymbol{\Gamma})^i.$$

The filter function of L2 boosting corresponds to the truncated power series expansion of Γ −1. The last reasoning also shows a possible way to choose the step-size. In fact we should choose η so that

$$ {\|I-\eta\boldsymbol{\Gamma}\|<1,}$$
(16)

where we use the operator norm.

Next, we briefly discuss three other methods.

Accelerated L2 Boosting or ν-method

This method can be seen as an accelerated version of L2 boosting. The coefficients are found by setting C 0=0, ω 1=(4ν+2)/(4ν+1), \(\mathbf{C}^{1}=\mathbf{C}^{0}+ \frac{\omega_{1}}{n} (\mathbf{Y}-\boldsymbol{\Gamma}\mathbf {C}^{0} )\) and considering for i=2,…,t the iteration given by

The parameter ν is usually set to 1. The filter function is G t (σ)=p t (σ) with p t a polynomial of degree t−1. The derivation of the filter function is considerably more complicated and is given by Engl et al. (1996). This method was proven to be faster than L2 boosting (Lo Gerfo et al. 2008) since the regularization parameter is the square root of the iteration number rather than the iteration number itself. In other words the ν-method can find in \(\sqrt{t}\) steps the same solution found by L2 boosting after t iterations.

Iterated Tikhonov

This method is a combination of Tikhonov regularization and L2 boosting where we set C 0=0 and consider for i=0,…,t−1 the iteration (Γ+nλI)C i=Y+ C i−1. The filter function is:

$$G_{\lambda}(\sigma)= \frac{(\sigma+\lambda)^t-\lambda^t}{\sigma (\sigma+\lambda)^t}.$$

This method is motivated by the desire to circumvent some of the limitations of Tikhonov regularization, namely a saturation effect that prevents exploiting the smoothness of the target function beyond a given critical value. (Engl et al. 1996; Lo Gerfo et al. 2008 provide further details.)

Truncated singular values decomposition

This method is akin to a projection onto the first principal components in a vector-valued setting. The number of components depends on the regularizing parameter. The filter function is defined by G λ (σ)=1/σ if σλ/n and 0 otherwise.

Although the spectral algorithms have a similar flavour, they present different algorithmic and theoretical properties. This can be seen, for example, comparing the computational complexities of the algorithms, especially if we consider the computational cost of tuning the regularization parameter. Some considerations along this line are given at the end of Sect. 4, whereas theoretical aspects are discussed in the next section.

3.4 Excess risk for spectral regularization

The main result of this section is a finite sample bound on the excess risk for all filters that leads to consistency. We first need to make some preliminary assumptions. The input space is assumed to be a separable metric space (not necessarily compact) and the output space is assumed to be a bounded set in ℝd, that is \(\sup_{y \in\mathcal{Y}}\|y\|_{d} = M <\infty\). For the sake of simplicity, we also assume that a minimizer of the expected risk on \(\mathcal{H}\) exists and denote it with \(f_{\mathcal{H}}\). Also, recall the definition (15) of the qualification constant \(\overline{\nu}\) for a spectral filter g λ . We are now ready to state our main theorem.

Theorem 1

Assume \(\overline{\nu} \geq\frac{1}{2}\) and \(\|{f_{\mathcal{H}}}\|_{\varGamma}\le R\). Choose the regularization parameter as

$$ \lambda_n := \lambda(n) =\frac {1}{\sqrt{n}} 2 \sqrt{2}\kappa^2\mathrm{log}\frac{4}{\eta},$$
(17)

so that lim n→∞ λ n =0.

Then, for every η∈(0,1), we have with probability at least 1−η

$$ \mathcal{E}\bigl(f_{\mathbf{z}}^{\lambda _n}\bigr)-\mathcal{E}(f_\mathcal{H})\le\frac{C\log{4/\eta}}{\sqrt{n}},$$
(18)

where \(C = 2 \sqrt{2}(\gamma+ \gamma_{\frac{1}{2}})^{2} \kappa^{2} R^{2} + 2\sqrt{2} (M+R)^{2}(B+\sqrt{BD})^{2}\).

The above result generalizes the analysis of Caponnetto and De Vito (2007) and Bauer et al. (2007); the latter also includes the computation of the constants corresponding to each algorithm. The proof is provided in the appendix and is largely based on previous work (Caponnetto and De Vito 2007; Mathé and Pereverzev 2002) to which we refer for further details. We add here three remarks. First, the above result leads to consistency, see Theorem 4 in the appendix. Indeed, it is easy to prove it by choosing λ n , such that lim n→∞ λ n =0 and \(\lim_{n \to\infty}\lambda_{n}\sqrt{n}= +\infty\), and setting \(\eta_{n} = 4e^{-\frac{\lambda_{n} \sqrt {n}}{2\sqrt {2}\kappa^{2}}}\), so that (17) is satisfied and lim n→∞ η n =0. Furthermore, even when the expected risk does not achieve a minimum in \(\mathcal{H}\), one can still show that there is a parameter choice ensuring convergence to \(\inf_{f\in\mathcal{H}} \mathcal{E}(f)\) (Caponnetto 2006). If the kernel is universal (Steinwart 2002; Caponnetto et al. 2008), then universal consistency (Devroye et al. 1996) is ensured. Second, if we strengthen the assumptions on the problem, we can obtain faster convergence rates. If \(L_{\varGamma}f(s)=\int_{\mathcal{X}}\varGamma (x,s)f(s)d\rho_{X}(s)\) is the integral operator with kernel Γ, we can consider the assumptionFootnote 3 \(f_{\rho}=L_{\varGamma}^{r} u\) for some uL 2(X,ρ) (the space of square integrable functions). In this case by choosing \(\lambda_{n}=n^{-\frac{1}{2r+1}}\) we can replace the rate n −1/2 with \(n^{-\frac{2r}{2r+1}}\), which is optimal in a minimax sense (Caponnetto and De Vito 2007). Third, the latter parameter choice depends on the unknown regularity index r and the question arises whether we can achieve the same rate choosing λ without any prior information, namely adaptively. Indeed, this is the case since we can directly apply the results of De Vito et al. (2008).

4 Kernels and computational aspects

In this section, we discuss the computational properties of the various algorithms on the basis of the parameter strategy considered and the kernel used. Towards this end, we first recall some examples of kernels defining vector-valued RKHS and their connection to the regularizer choice, while a more complete list of kernels is reported in Appendix A. Secondly, we show how for a specific class of kernels, the vector-valued problem can be reduced to a series of scalar problems, which are more efficient to solve.

4.1 Kernels for multi-output learning

We first introduce an important class of kernels that allow to decouple the role played by input and output spaces, then we discuss two kernels that will be extensively used in the experiments.

4.1.1 Decomposable kernels and regularizers

A crucial practical question is which kernel to use in a given problem. Unlike the scalar case, in the vector-valued case there are no natural off-the-shelf kernels. There are no obvious extensions of Gaussian or polynomial kernels and the choice of the kernel is considerably more difficult. In the context of scalar Tikhonov regularization, it is known that choosing an appropriate penalty function, a regularizer, corresponds to choosing a kernel function (Smola et al. 1998). This is the point of view that has been mainly considered for multi-output functions, especially in the context of multi-task learning. Couplings among the different outputs are explicitly incorporated in the penalty. In the following, we review several regularizer choices from the perspective of matrix-valued kernels. This allows to use spectral algorithms other than Tikhonov regularization and it shows a common structure among different regularizers. Clearly, a matrix-valued kernel can also be directly defined without passing through the definition of a regularizer and two examples are given in the next section.

A general class of matrix-valued kernels (Evgeniou et al. 2005), that can be related to specific regularizers, is composed of kernels of the form:

$$ {\varGamma\bigl(x,x'\bigr) = K\bigl(x,x'\bigr)A}$$
(19)

where K is a scalar kernel and A a positive semidefinite d×d matrix that encodes how the outputs are related. This class of kernels, sometimes called decomposable kernels, allows to decouple the role played by input and output spaces. The choice of the kernel K depends on the desired shape of the function with respect to the input variables, while the choice of the matrix A depends on the relations among the outputs. This information can be available in the form of prior knowledge on the problem at hand or can be potentially estimated from data.

The role of A can be better understood by recalling that any vector-valued function belonging to a RKHS can be expressed as f(x)=∑ i Γ(x,x i )c i =∑ i K(x,x i )Ac i with c i ∈ℝd, so that the -th component is

$$f^\ell(x) = \sum_i \sum _{t=1}^d K(x,x_i)A_{\ell t}c_i^t,$$

with \(c^{t}_{i} \in{\mathbb{R}}\). Each component is thus a different linear combination of the same coefficients \(\{c_{i}\}_{i=1}^{n}\) and depends on the corresponding row of the matrix A. If A is the d-dimensional identity matrix I, the linear combinations depend on the corresponding components of the coefficients c i and therefore each component f is independent to the others. The norm of the vector-valued function can also be expressed in terms of the coefficients c i and the matrix A,

Now for the considered kernels, the similarity between the components can be evaluated by their pairwise scalar products:

$$ {\bigl\langle f^\ell,f^q\bigr \rangle_K = \sum_{ij} \sum _{ts} K(x_i,x_j) A_{\ell t}c_i^tA_{qs}c_j^s.}$$
(20)

Given the simple calculations above, we immediately have the following proposition (Sheldon 2008).

Proposition 1

Let Γ be a product kernel of the form in (19). Then the norm of any function in the corresponding RKHS can be written as

$$ {\|f\|_\varGamma^2=\sum _{\ell,q=1}^d A_{\ell q}^\dagger\bigl\langle f^\ell,f^q\bigr\rangle_K, }$$
(21)

where A is the pseudoinverse of A.

The above result immediately leads to a RKHS interpretation of many regularizers. Examples of kernels belonging to this class are given in Appendix A. Next, we discuss two kernels that we use in the experiments and that are directly defined without having to introduce a regularizer.

4.1.2 Divergence free and curl free fields

The following two matrix-valued kernels apply only for vector fields whose input and output spaces have the same dimensions. Macêdo and Castro (2008) tackled the problem of reconstructing divergence-free or curl-free vector fields via the SVR method, with ad-hoc matrix-valued kernels based on matrix-valued radial basis functions (RBF) (Narcowich and Ward 1994). These kernels induce a similarity between the vector field components that depends on the input points, and therefore cannot be reduced to the form Γ(x,x′)=K(x,x′)A.

The derivation of these kernels is explained in Appendix A. The divergence-free matrix-valued kernel is defined as

$$ {\varGamma_{df}\bigl(x,x'\bigr) =\frac{1}{\sigma^2}e^{-\frac{\|x - x'\|^2}{2\sigma^2}}A_{x,x'} }$$
(22)

where

$$A_{x,x'}= \biggl( \biggl( \frac{x-x'}{\sigma} \biggr)\biggl(\frac{x-x'}{\sigma} \biggr)^T +\biggl((d-1)-\frac{\|x-x'\|^2}{\sigma^2} \biggr)\mathbf{I}\biggr).$$

The curl-free kernel is defined as

$$ \varGamma_{cf}\bigl(x,x'\bigr) =\frac{1}{\sigma^2}e^{-\frac{\|x - x'\|^2}{2\sigma^2}} \biggl (\mathbf{I}- \biggl( \frac{x-x'}{\sigma}\biggr) \biggl(\frac{x-x'}{\sigma} \biggr)^T \biggr) .$$
(23)

It is possible to consider a convex linear combination of these two kernels to obtain a kernel for learning any kind of vector field, while at the same time allowing to reconstruct the divergence-free and curl-free parts separately (see the paper by Macêdo and Castro 2008 and the experiments in Sect. 6 for more details).

4.2 Eigen-decomposition for matrix-valued kernels

Here we argue that the learning problem can be solved in a more efficient way, if the kernel is decomposable, Γ(x,y)=K(x,y)A. Indeed we show that, for this class of kernels, we can use the eigen-system of the matrix A to define a new coordinate system where the vector-valued problem decomposes into a series of scalar problems.

We start observing that, if we denote with u 1,…,u d the eigenvectors of A, we can write the vector \(\mathbf{C}=(c_{1}^{\top},\dots, c_{n}^{\top})^{\top}\), with c i ∈ℝd, as

$$\mathbf{C}= \sum_{j=1}^d\tilde{c}^j \otimes u_j,$$

where \(\tilde{c}^{j}=( \langle{c_{1}},{u_{j}} \rangle_{d}, \dots , \langle{c_{n}},{u_{j}} \rangle_{d})^{\top}\) and ⊗ is the tensor product.

Similarly

$$\mathbf{Y}= \sum_{j=1}^d\tilde{y}^j \otimes u_j,$$

with \(\tilde{y}^{j}=( \langle{y_{1}},{u_{j}} \rangle_{d}, \dots,\langle{y_{n}},{u_{j}} \rangle_{d})^{\top}\). The above transformations are simply rotations in the output space. Moreover, for the considered class of kernels, the kernel matrix Γ is given by the tensor product of the n×n scalar kernel matrix K and A, that is Γ=KA.

If we denote with λ i ,v i (i=1,…,n), the eigenvalues and eigenvectors of K and with σ j (j=1,…,d) the eigenvalues of A, we have the following equalities

Since the eigenvectors u j are orthonormal, it follows that:

$$ {\tilde{c}^j = g_\lambda(\sigma_j\mathbf{K}) \tilde{y}^j, \quad\mbox{for all}\ j = 1, \ldots, d. }$$
(24)

The above equation shows that in the new coordinate system we have to solve d essentially independent problems. Indeed, after rotating the outputs (and the coefficients), the only coupling is the rescaling of each kernel matrix by σ j . For example, in the case of Tikhonov regularization, the j-th component is found solving

$$\tilde{c}^j = (\sigma_j \mathbf{K}+\lambda I)^{-1} \tilde{y}^j=\biggl( \mathbf{K}+\frac {\lambda }{\sigma_j} I\biggr)^{-1} \frac{\tilde{y}^j}{\sigma_j}$$

and we see that the scaling term is essentially changing the scale of the regularization parameter (and the outputs). The above calculation shows that all kernels of this form allow for a simple implementation at the price of the eigen-decomposition of the matrix A. Also, it shows that the coupling among the different tasks can be seen as a rotation and rescaling of the output points.

4.3 Regularization path and complexity

In this section, we discuss the time complexity of the different algorithms. In particular, we compare Tikhonov regularization with accelerated L2 boosting, since in the scalar case this algorithm was shown to be fast and reliable (Lo Gerfo et al. 2008). In practice, when considering the complexity of a learning algorithm that depends on one regularization parameter, we are interested in discussing the complexity corresponding to the whole regularization path. There are few algorithms for which the regularization path can be easily computed, among which SVM for classification (Hastie et al. 2004) and regression (Gunter and Zhu 2006; Wang et al. 2008) and regularization with the lasso penalty (Efron et al. 2004).

Using Tikhonov regularization with the square loss, for any new value of the regularization parameter we need to solve a linear system of dimension nd. For iterative algorithms, each iteration requires only matrix vector-multiplication and yields a solution for a value of the regularization parameter, so that at step N we have the entire regularization path up to N. Hence, in general, if we consider N parameter values we will have O(N(nd)3) complexity for Tikhonov regularization and O(N(nd)2) for iterative methods.

In the special case of kernels of the form Γ(x,x′)=K(x,x′)A the complexity of the problem can be drastically reduced. Given the result in the previous section, we can diagonalize the matrix A and then work in a new coordinate system where the kernel matrix is block diagonal and all the blocks are the same, up to a rescaling. In this case, the complexity of the multi-output algorithm is essentially the same as the one of d single scalar problems—O(Nn 3) for Tikhonov and O(Nn 2) for iterative methods—plus the cost of computing the eigen-decomposition of A which is O(d 3).

We add two comments. First, we note that for Tikhonov regularization, we can further reduce the complexity from O(Nn 3) to O(n 3) choosing the regularization parameter with Leave One Out Cross-Validation (LOO) as described by Rifkin and Lippert (2007). Second, we observe that for iterative methods we also have to take into account the cost of fixing the step size. The latter can be chosen as 2/σ max where σ max is the maximum eigenvalue of the kernel matrix induced by Γ, so that we have to add the cost of computing the maximum eigenvalue.

5 Spectral filtering for multi-category classification and multi-task learning

In this section, we discuss how to apply the proposed spectral filters to multi-category classification and multi-task learning. Firstly, we show how to reformulate a multi-category classification problem as a vector-valued regression one, discussing the role of the chosen kernel to leverage relations among the classes. Secondly, we present our main result of this section, namely Bayes consistency for all spectral regularization methods. Thirdly, we analyze the differences between vector-valued and multi-task learning and we show how to apply spectral filters to the latter.

5.1 Multi-category classification as learning a vector-valued function

In this section we study the application of vector-valued learning and spectral filtering to multi-category classification. In particular, after an introductory discussion, we prove novel finite sample bounds for the misclassification error of the estimators defined by spectral filtering.

5.1.1 Introduction

Multi-class, also called multi-category, problems are ubiquitous in applications. While a number of different algorithms have been proposed over the years, a theory of multi-class learning is at the beginning and most algorithms come with no theoretical guarantees in terms of generalization properties. Here, we show that approaches based on vector-valued learning are natural and help understanding multi-class problems.

The algorithms previously proposed in the literature can be roughly divided into three classes. The first comprises methods based on nearest neighbour strategies (Hastie et al. 2001). These techniques are appealing for their simplicity, but are considered to be prone to overfitting, especially in the presence of high dimensional data. The second class includes approaches where the multi-class problem is reduced to a family of binary classification problems, e.g. one-versus-all, or all-versus-all (also called all-pairs). Finally, the third class corresponds to the so-called single machine approaches. An extensive list of references can be found in the review by Rifkin and Klautau (2004), which gives a detailed discussion and an exhaustive experimental comparison of different methods, suggesting that one-versus-all might be a winning strategy both from the point of view of performances and computations (see discussion below).

From a theoretical point of view, the analysis of methods based on (penalized or constrained) empirical risk minimization was started by Zhang (2004) and Tewari and Bartlett (2005). A main message of these works is that straightforward extensions of binary classification approaches might lead to methods that fail to have the property of Bayes consistency. The latter can be probably considered as a minimal theoretical requirement for a good classification rule.

In this section, we argue that multi-category classification can be naturally modelled as the problem of learning a vector-valued function, obtained by associating each class to an appropriate coding vector. A basic observation supporting this approach is that when we describe the classes using a finite set of scalar labels, we are introducing an unnatural ordering among them, which is avoided by adopting a vector coding. Besides this fact, the idea of considering multi-category classification as the problem of learning a vector-valued function is appealing, since it opens the route to exploiting the relationships among the considered classes, and can be the key towards designing more efficient multi-class learning machines.

To better explain this last observation, we recall that among the proposed approaches to solve multi-class problems, one of the simplest, and seemingly effective, is the so called one-versus-all approach where a classifier is learned to discriminate each individual class from all the others. Each classifier returns a value that should quantify the affinity of an input to the corresponding output class, so that the input can be assigned to the class with highest affinity. Though extremely simple, in this method each class is learned independently to the others and the possible information about the correlation among the classes is not exploited. Indeed, in several practical problems the classes are organized in homogeneous groups or hierarchies. The intuition is that exploiting such an information might lead to better performances.

5.1.2 Multi-category classification as vector-valued learning

Here we illustrate how the framework of vector-valued learning can be used to exploit class relationships in multi-category classification.

To this end, we need to fix some basic concepts and notation. In multi-category classification the examples belong to either one of d classes, that is we can set Y={1,2,…,d} and let ρ(k|x), with k=1,…,d, denote the conditional probability for each class. A classifier is a function \(c:\mathcal{X}\to Y\), assigning each input point to one of the d classes. The classification performance can be measured via the misclassification error

$$R(c)=\mathrm{P} \bigl[ c(x)\ne y \bigr].$$

It is easy to check that the minimizer of the misclassification error is the Bayes rule, defined as

$$ {b(x)=\qopname\relax n{argmax}_{k=1,\dots,d} \rho(k|x). }$$
(25)

A standard approach for the binary case is based on viewing classification as a regression problem with binary values. Following this idea we might consider real-valued functions to fit the labels Y={1,2,…,d}, but we would force an unnatural ordering among the classes. Another possibility is to define a coding, that is a one-to-one map \(C:Y \to\mathcal{Y}\) where \(\mathcal{Y}=\{\bar{\ell}_{1},\dots,\bar{\ell}_{d}\}\) is a set of d distinct coding vectors \(\bar{\ell}_{k}\in{\mathbb{R}}^{d}\) for j=1,…,d. For example \(\bar{\ell}_{1}=(1,0,0,\dots,0)\), \(\bar{\ell }_{2}=(0,1,0,\dots,0), \ldots,\bar{\ell}_{d}=(0,0,0,\dots,1)\). Once we fix a coding, we can use algorithms for vector regression to fit the data where the outputs are given by the coding. In practice the algorithm will return an estimator that takes values in the whole space ℝd, rather then in the set of coding vectors, and we need to define a classification rule. In the binary case, a classification rule is usually defined by taking the sign of the estimator. In the vector-valued case there is no obvious strategy.

In summary the use of vector-valued learning for multi-class problems requires the following choices:

  1. 1.

    A coding scheme,

  2. 2.

    a vector learning algorithm, and

  3. 3.

    a classification rule.

If we measure errors using the squared loss, a simple calculation guides us through some of the above choices. We use upper indexes to indicate vector components, so that the squared loss can be written as \(\|{\bar{\ell}-f(x)}\|_{d}^{2}=\sum _{j=1}^{d}(\bar{\ell}^{j}-f^{j}(x))^{2}\). Note that, since the coding is one-to-one, the probability for each coding vector \(\bar{\ell}_{k}\) is given by ρ(k|x) for all k=1,…,d. The expected risk

$$\mathcal{E}(f)=\int_{\mathcal{X}\times\mathcal{Y}} \bigl\|{y-f(x)}\bigr\|_d^2d\rho(y|x) d\rho(x) =\int_{\mathcal{X}} \sum _{k=1}^d \bigl\|{\bar{\ell}_k-f(x)}\bigr\|_d^2 \rho(k|x) d\rho(x)$$

is minimized by the regression function, that we can express as

$$f_\rho(x)=\bigl(f_\rho^1(x),f_\rho^2(x),\dots, f_\rho^d(x)\bigr)=\int_\mathcal{Y}yd\rho(y|x)=\sum_{k=1}^d \bar{\ell}_k \rho(k|x).$$

Given a general coding

$$ \bar{\ell}_1=(a,b, \dots,b), \bar{\ell}_2=(b,a,\dots,b), \dots, \bar{\ell}_d=(b,b,\dots ,a),\quad a>b,$$
(26)

we can write the j-th component of the regression function as

since \(\sum_{k = 1}^{d} \rho(k|x)=1\). It follows that from the components of the regression function we can easily derive the conditional probabilities and, in particular

$$ {b(x)=\qopname\relax n{argmax}_{j=1,\dots,d} f_\rho^j(x).}$$
(27)

The above calculation is simple, but shows us three useful facts. First, vector learning algorithms approximating the regression function can be used to learn the Bayes rule for a multi-class problem. Second, in this view the choice of the coding can be quite general, see (26). Third, once we obtained an estimator for the regression function, (27) shows that the natural way to define a classification rule is to take the argmax of the components of the estimator. For loss-functions other than the squared loss the above conclusions are not straightforward and a detailed discussion can be found in the papers by Tewari and Bartlett (2005) and Zhang (2004).

In the case of the squared loss, the kernels and penalties given in Sect. 20 can be used to leverage the relations among the classes. In particular, for kernels of the form Γ=KA, the matrix A can be viewed as encoding the relationships among between the classes. The choice of A is therefore crucial. In certain problems, the matrix A can be defined using prior information. For example, in the object recognition dataset Caltech-356 (Griffin et al. 2007), there are 256 object categories and the available hand-made taxonomy relating the categories can be exploited to design the matrix A. In general, empirically estimating the matrix A is much harder in multi-category classification than in vector-valued regression, since the covariance structure of the coding vectors does not yield any useful information.

The above discussion shows that, provided a coding of the form (26), we can use spectral regularization methods to solve multi-class problems and use (27) to obtain a classification rule. Also, (27) and the fact that spectral regularization algorithms estimate the regression function, suggest that Bayes consistency can be achieved by spectral regularization methods. Indeed, the next theorem can be easily proven using Theorem 1 and results by Tewari and Bartlett (2005) and Zhang (2004).

Theorem 2

(Bayes consistency)

Assume \(\overline{\nu} \geq\frac{1}{2}\) and \(f_{\rho}\in\mathcal{H}\). Choose a sequence of regularization parameters λ n =λ(n) such that lim n→∞ λ n =0 and \(\lim_{n \to \infty}\lambda_{n}\sqrt{n} = +\infty\). If we let \(f_{\mathbf{z}}=f_{\mathbf{z}}^{\lambda_{n}}\) and \(c_{\mathbf{z}}=\qopname\relax n{argmax}_{j=1\dots,d}f_{\mathbf{z}}^{j}\), then, for all ε>0,

$$ {\lim_{n\to\infty} \mathrm{P} \bigl[R(c_{\mathbf{z}})-R(b)>\varepsilon\bigr]=0, }$$
(28)

where b is the Bayes rule (25).

We add three comments. First, the proof of the above result is given in the Appendix and is based on the bound given in Theorem 1 together with a so-called comparison result relating expected risk and misclassification error. More precisely, we use a result given in Corollary 26 of the paper by Zhang (2004) (see also the results by Tewari and Bartlett 2005) to show that for the squared loss

$$R(c)-R(b)\le\psi\bigl(\mathcal{E}(f) -\mathcal{E}(f_\rho)\bigr),$$

where ψ is a decreasing function that goes to zero in the origin. Second, we note that the above result does not allow us to derive the convergence rates, since they would depend on the specific form of ψ. We leave to future work the task of investigating whether we can actually derive specific convergence rates. Third, in the above result we made the simplifying assumption that f ρ is in \(\mathcal{H}\). In fact, if the kernel is universal (Caponnetto et al. 2008) such an assumption can be dropped and (universal) Bayes consistency can be proven with similar techniques (Caponnetto 2006).

5.1.3 Comparison to other approaches

Different strategies to exploit prior information in multi-class classification have been proposed in different, but not unrelated settings. In particular, in the following we briefly discuss multi-category SVM, structured learning and error correcting output code strategies (Lee et al. 2004; Dietterich and Bakiri 1995; Szedmak and Shawe-Taylor 2005; Tsochantaridis et al. 2005).

Multi-category SVM

Results similar to ours are available for the modified hinge loss yielding the multi-category SVM studied by Lee et al. (2004): in this case the target function is the Bayes rule itself. It may be interesting to note that the results of Lee et al. (2004) require a sum-to-zero coding, e.g., a=1 and b=−1/(d−1), which is not needed in our setting. The multi-category SVM has a straightforward interpretation in terms of vector-valued regression. In fact, in our notation, given a scalar kernel K, the method corresponds to the problem

$$\min_{f^j\in\mathcal{H}_K} \Biggl\{\frac{1}{n} \sum_{i=1}^{n}\sum_{j=1}^d V\bigl(f^j(x),y_i^j\bigr) +\lambda\bigl\|{f^j}\bigr\|^2_K\Biggr\} \quad \text{s.t.}\quad \sum_{j=1}^df^j(x_i)=0, \quad i=1,\dots,n,$$

where V is the modified hinge loss.

It is clear that we are considering a reproducing kernel Hilbert space, \(\mathcal{H}\) of vector valued functions f=(f 1,…,f d), with no coupling among the components, since \(\|f\|_{\mathcal{H}}= \sum_{j=1}^{d} \|f^{j}\|_{K}\). The only coupling among the components of the estimator is enforced by the sum-to-zero constraint. If we drop such a constraint and consider the squared loss we have the following problem

$$\min_{f^j\in\mathcal{H}} \Biggl\{\frac{1}{n} \sum_{i=1}^{n}\sum_{j=1}^{d}\bigl(\bar{\ell}^j-f^j(x_i)\bigr)^2+\lambda \bigl\|{f^j}\bigr\|^2_K\Biggr\}.$$

For a general coding of the form (26), the optimization can be done independently for each component and corresponds to classifying a given class against all the others. It is then obvious that, by taking the maximum of each component, we recover the simple one-versus-all scheme, albeit with a common regularizing parameter for all classes.

Structured learning

In structured learning the empirical risk minimization approach is extended to a very general class of problems where the outputs are structured objects. In particular, in the case of multi-category classification, Tsochantaridis et al. (2005) propose to use a feature map on input and output which is in fact the product of a feature map on the input and a feature map on the output. This latter approach is essentially equivalent to using the decomposable kernels discussed in Sect. 20. The development of Szedmak and Shawe-Taylor (2005) does not explicitly use the machinery of vector valued RKHS and it would be interesting to investigate more formally the connections between the two approaches while considering the general structure learning problem beyond multi-output. Error bounds in the context of structured learning are available (McAllester 2007), but to the best of our knowledge an in-depth discussion of multi-category classification is not available.

Error correcting output codes

Error correcting output code (ECOC) strategies (for example, Dietterich and Bakiri 1995), differ from the approach we described here in the way in which the correlation among tasks is exploited. More precisely, instead of simply considering the argmax of the one-versus-all output, more sophisticated strategies are considered. This kind of approaches are interesting as they try to take advantage of the full information contained in the different binary classifiers. On the other hand, they are hard to compare to our study and, more generally, to analyze within the framework of statistical learning theory.

5.2 Spectral regularization in multi-task learning

In this section, we briefly discuss the use of spectral regularization methods for a general multi-task problem. As we mentioned in the introduction, the latter can be seen as a generalization of vector-valued learning where, in particular, each output component might have samples of different cardinalities. Among many references, we mention the original paper by Caruana (1997), the works using regularization approaches (see references in the introduction and Sect. 20) and also Bayesian techniques using Gaussian processes (Bonilla et al. 2007; Chai et al. 2009).

In the following, we use the notation introduced in Remark 1. To use spectral regularization techniques for multi-task problems we need to slightly adapt the derivation we proposed for learning vector-valued functions. This is essentially due to the fact that, although we can simply view tasks as components of a vector-valued function, now each task can have different input points. The description of vector-valued RKHS given in Remark 3 turns out to be useful, since it allows to work component-wise.

Recall that according to Remark 3 we can view vector-valued RKHS as defined by a (joint) kernel Q:(X,Π)×(X,Π)→ℝ, where Π={1,…,d} is the index set of the output components. A function in this RKHS is

$$f(x,t) = \sum_{i} Q\bigl((x,t),(x_i,t_i)\bigr) c_i,$$

with norm

$$\| f \|_Q^2 = \sum_{i,j} Q\bigl((x_j,t_j),(x_i,t_i)\bigr)c_ic_j.$$

The functions f(⋅,t)=f t(⋅) are simply the components corresponding to each task and the above notation can be thought of as a component-wise definition of a vector-valued function.

In view of the above representation, if n j training points are provided for each task j=1,…,d, it is natural to rewrite the empirical error

$$\sum_{j=1}^d \frac{1}{n_j} \sum _{i=1}^{n_j} \bigl(y^j_i-f^j(x_i)\bigr)^2$$

as

$$\frac{1}{n} \sum_{i=1}^{n}\bigl(y_i-f(x_i,t_i)\bigr)^2$$

where we consider a common training set {(x 1,y 1,t 1),…,(x n ,y n ,t n )} with \(n=\sum_{j=1}^{d} n_{j}\), where, for i=1,…,n, example (x i ,y i ) belongs to task t i .

The representer theorem ensures that the solution of empirical risk minimization is of the form

$$f(x,t) = f_t(x) = \sum_{i=1}^nQ\bigl((x,t),(x_i,t_i)\bigr) c_i$$

with coefficients given by

$$\boldsymbol{\Gamma}\mathbf{C}= \mathbf{Y}$$

where C=(c 1,…,c n ), Γ ij =Q((x i ,t i ),(x j ,t j )) and Y=(y 1,…,y n ).

Directly inverting the matrix Γ leads to an unstable solution with very poor generalizing performance, in other words it overfits the training data. The spectral filters proposed in this paper tackle these issues by filtering its unstable components and are an alternative to Tikhonov regularization. The solution is obtained as

$$\mathbf{C}= g_\lambda(\boldsymbol{\Gamma})\mathbf{Y},$$

where g λ is one of the spectral filter described in Sect. 3.3.

We conclude this section observing that in general, differently to vector-valued regression, the matrix Γ is not a block matrix. In particular, in the case when the kernel is Q((x,t),(x′,t′))=K(x,x′)A t,t the kernel matrix is no longer the Kronecker product between the scalar kernel matrix K and A. This implies that it is no longer possible to reduce the complexity of the problem using the technique described in the end of Sect. 4.3. Therefore iterative methods might be considerably more efficient than Tikhonov regularization, as we will show with some experiments in the next section.

6 Empirical analysis

In this section we present some empirical analysis using spectral regularization algorithms. We first consider an academic example aimed at showing a computational comparison of the various spectral filters while illustrating the difference between multi-task and vector-valued learning. Secondly, we present some artificial examples of 2D vector fields for which our approach outperforms regressing on each component independently with the same filter function. On these fields, we also compare the proposed approach with a state-of-the-art sparsity-enforcing method proposed by Obozinski et al. (2007) and discuss its drawbacks. Finally, we consider a real-world case where our methods perform faster than standard Tikhonov regularization, while achieving a performance comparable to the best in the literature. Note that our simulations of 2D vector fields recall the flow of an incompressible fluid. A common practical problem in experimental physics is that of estimating a velocity field from scattered spatial measurements. Using kernel functions tailored for physical vector fields, see Sect. 20, we show how this problem can be effectively solved.

6.1 Simulated data

Vector-valued regression vs. multi-task learning

We consider an academic situation where each task is given by the same function plus a task specific perturbation. More precisely, we study the case where the input space is the interval [0,1] and we have four tasks. Each task t is given by a target function f t =f com +αf pert,t corrupted by normal noise of variance 0.01. The target function common to all tasks is f com =sin(2πx). The weight α is set to be equal to 0.6. The perturbation function is a weighted sum of three Gaussians of width σ=0.1 centred at x 1=0.05, x 2=0.4 and x 3=0.7. We have designed the task-specific weights of the perturbation in order to yield tasks that are still strongly related by the common target function, but also present local differences, as shown in Fig. 2. It may appear that the tasks are simply shifted versions of the common sine function and that an approach based on computing the sample covariance might be able to estimate the phase differences. This is indeed not the case, since the perturbations added to each task are local and defined by Gaussian functions. Notwithstanding the simplicity of this example, we believe it is illustrative of the different behaviours of the multi-task and vector-valued settings and it allows us to compare the computational properties of three spectral filters in a controlled setting.

Fig. 2
figure 2

The four tasks/components (before being affected by Gaussian noise of variance 0.01) used to compare multi-task and vector-valued learning. The tasks are generated perturbing the common task (a sine function) with three Gaussians of width 0.1, centred in x 1=0.05, x 2=0.4 and x 3=0.7. The Gaussians are multiplied with task specific coefficients

We first verified that the predictive performance of the spectral filters is very similar (results not shown for brevity), in accordance with similar observations for the scalar case (Lo Gerfo et al. 2008). We then chose the Accelerated L2 Boosting (see Sect. 3.3, in the following referred to as ν-method) to illustrate the typical behaviour. In the multi-task case, each task is sampled in different input points, whereas in the vector-valued case the input points are the same for all the components. We used the kernel (19), Γ(x,x′)=K(x,x′)A, with K being a scalar Gaussian kernel and A=ω 1+(1−ω)I, where 1 is the 4×4 matrix whose entries are all equal to 1 and I is the 4×4 identity matrix. This kernel imposes all components to be similar to their mean, depending upon the parameter ω (see Appendix A for further details). The parameter ω and the regularization parameter were selected on a validation set of the same size of the training set. The performance of the algorithm is measured by the mean squared error (MSE) on an independent test set, as a function of the number of training set points available for each task/component. To evaluate the average performance and its variance, we run the simulation 10 times for each training and validation set size, resampling both sets.

We show the results for multi-task learning case in Fig. 3 (left panel), where we compare the error obtained with the matrix-valued kernel and the error of learning each task independently with a Gaussian kernel of the same width. We observe that exploiting the coupling among the tasks is significantly advantageous. The median of the selected values for the kernel parameter ω is 0.6, indicating that the validation process selects an intermediate correlation between the tasks. The results for vector-valued learning are given in Fig. 3 (right panel), where we see that there is no gain in using a non-diagonal kernel.

Fig. 3
figure 3

Results for the multi-task case (left) and for the vector-valued case (right) using the ν-method with a maximum of 200 iterations. Solid lines represent average test error, while dotted lines show the average test error plus/minus one standard deviation of the corresponding error. The test error is evaluated on an independent test set of 1000 examples as the mean squared error on all examples, counting the components of the vector-valued function as different points. For the multi-task case, the training points are sampled independently for each task, whereas in the vector-valued case the training points are the same for all the components. For each training set cardinality, the experiment was run 10 times with different sampling of the training and validation examples

In Fig. 4 (left panel) we report the time required to select the optimal regularization parameter on the validation set in the multi-task case. The vector-valued case presented the same behaviour (graph not shown). The algorithms are Tikhonov with 25 regularization parameter values, Landweber with a maximum of 1000 iterations and ν-method with a maximum of 200 iterations. The number of parameters was chosen so that the validation error achieves the minimum within the range. As expected from the complexity consideration of Sect. 4.3, the ν-method is outperforming the other methods. In Fig. 4 (right panel) we report the time required to select the optimal regularization parameter via Leave One Out Cross-Validation (LOO) in the vector-valued scenario. In this case it is possible to exploit the results of Sect. 4.2 and the closed form solution for the LOO error for Tikhonov regularization. Indeed, Tikhonov regularization combined with these two results is faster than the iterative algorithms, which require to evaluate the entire regularization path for each LOO loop.

Fig. 4
figure 4

Time needed to select the best regularization parameter for different algorithms and settings. Note the log scale on the time axis. In the left panel the times required to select the regularization parameter for the multi-task setting with respect to the number of training examples are reported. The regularization parameter is chosen on a validation set of the same size of the training set. On the right are shown the times needed to select the regularization parameter via Leave One Out Cross-Validation on the training set only. We implemented the optimization described in Sect. 4.2 and the closed form solution to compute the LOO errors for Tikhonov. The range of the parameters evaluated is 25 values for Tikhonov and a maximum of 1000 iterations for Landweber and 200 iterations for the ν-method. The computations were performed with MATLAB on a notebook with 2 GB of RAM and a 2 GHz Intel Core 2 Duo Processor

2D vector field—1

The following set of simulations are aimed at showing the advantages of using the divergence and curl-free kernels, (22) and (23) respectively, for the estimation of a general 2-dimensional vector field defined on a 2-dimensional input space. By adopting a convex combination of the two kernels, weighted by a parameter \(\tilde {\gamma}\), it is possible to reconstruct the divergence-free and curl-free parts of the field (Macêdo and Castro 2008).

The vector field is generated from a scalar field defined by the sum of 5 Gaussians centred at (0,0), (1,0), (0,1), (−1,0) and (0,−1) respectively. The covariances are all set to be diagonal, namely 0.45I (I is the 2×2 identity matrix). By computing the gradient of the scalar field, we obtain an irrotational (curl-free) field. The vector field perpendicular to the latter (computed applying a π/2 rotation) is a solenoidal (divergence-free) field. We consider a convex combination of these two vector fields, controlled by a parameter γ. One instance of the resulting field, for which γ=0.5, is shown in Fig. 5.

Fig. 5
figure 5

Visualization of the first artificial 2-dimensional vector field for γ=0.5

We compare our vector-valued regression approach with estimating each component of the field independently. We use the ν-method, which is the fastest algorithm when the matrix-valued kernel is not of the form Γ=KA. We adopt a 5-fold cross-validation to select the optimal number of iterations and the parameter \(\tilde{\gamma}\). The scalar kernel is a Gaussian kernel of width 0.8.

Firstly, we consider the noiseless case. The vector field is constructed specifying a value of the parameter γ, which we vary from 0 to 1 at 0.1 increments. The field is then computed on a 70×70 points grid over the square [−2,2]×[−2,2]. The models are trained on a uniform random sample of points from this grid and their predictions on the whole grid (except the training points) compared to the correct field. The number of training examples is varied from 10 to 200 and for each cardinality of the training set, the training and prediction process is repeated 10 times with different samplings of the training points.

Following Barron et al. (1994), we use an angular measure of error to compare two fields. If \(v_{o} = (v_{o}^{1}, v_{o}^{2})\) and \(v_{e} =(v_{e}^{1}, v_{e}^{2})\) are the original and estimated fields, we consider the transformation \(v \to\tilde{v} = \frac{1}{\|(v^{1}, v^{2}, 1)\|}(v^{1}, v^{2},1)\). The error measure is then

$$ \mathit{err} = \arccos(\tilde{v}_e \cdot \tilde{v}_o).$$
(29)

This error measure was derived by interpreting the vector field as a velocity field and it is convenient because it handles large and small signals without the amplification inherent in a relative measure of vector differences.

The results for the noiseless case are reported in Fig. 6, which clearly shows the advantage of using a vector-valued approach with the combination of curl-free and divergence-free kernels. We present only the results for the field generated with γ=0 and γ=0.5 since for the remaining fields the errors are within these two examples. The prediction errors of the proposed approach via the ν-method are always lower than the errors obtained by regressing on each component independently, even when the training set is quite large. The average value of the estimated parameter \(\tilde{\gamma}\) converges to the true value of γ as the number of training points increases, indicating that it is possible for the model to learn the field decomposition in an automatic way, see Fig. 7.

Fig. 6
figure 6

Vector field 1—noiseless case. Test errors for the proposed vector-valued approach and for learning each component of the field independently as a function of the number of training points used for learning. Solid lines represent average test error, while dotted lines show the average test error plus/minus one standard deviation of the corresponding error. The test error is evaluated according to the error measure (29)

Fig. 7
figure 7

Vector field 1—noiseless case. The solid lines represent the averages of the estimated kernel parameter \(\hat{\gamma}\) that governs the balance between the divergence-free and the curl-free matrix-valued kernels. The dotted lines represent the values of the parameter γ that was used to design the vector field. The learning algorithm estimates this values correctly, allowing to separately reconstruct the irrotational and solenoidal parts of the field

We then consider the case with normal noise whose standard deviation is independent from the signal and is chosen to be 0.3. We follow the same experimental protocol adopted for the noiseless case. The results are reported in Fig. 8 and indicate that also in the presence of noise the proposed approach consistently outperforms regressing on each component independently.

Fig. 8
figure 8

Vector field 1—noise with standard deviation 0.3. Test errors for the proposed vector-valued approach and for learning each component of the field independently as a function of the number of training points used for learning. Solid lines represent average test error, while dotted lines show the average test error plus/minus one standard deviation of the corresponding error. The test error is evaluated according to the error measure (29)

It is now interesting to apply this approach to a vector field that is not directly given as the sum of a divergence-free and a curl-free part, but that satisfy the hypotheses of Helmholtz decomposition of a vector field.

2D vector field—2

The Helmholtz theorem states that a vector field which is twice continuously differentiable and which vanishes faster than 1/r at infinity (r is the distance from the origin) can be decomposed as the sum of a divergence-free part and a curl-free part. Therefore, if we are dealing with such a vector field, we expect to be able to estimate it via a combination of the divergence-free and curl-free kernels. This second artificial experiment aims at showing that it is indeed possible to obtain a better estimate using these kernels when the vector field satisfies the assumptions of the Helmholtz theorem. We compare our approach with a state-of-the-art sparsity-enforcing method (Obozinski et al. 2007), which, being a linear model, requires particular basis fields dictionaries for dealing with non-linear vector fields. In the following, we duplicate the preprocessing chain as in the papers by Mussa-Ivaldi (1992) and Haufe et al. (2009), corresponding to two different dictionaries of basis fields, in order to make our results comparable. An outcome of our experiments is that the method of Obozinski et al. (2007) is computationally much slower and critically depends on the choice of the dictionary (i.e. feature map).

We generated a vector field on a grid of 70×70 points within [−2,2]×[−2,2], whose components are given by

In order to enforce the decay at infinity the field is multiplied with a Gaussian function centred at the origin and of width 1.2. The field without noise is shown in Fig. 9.

Fig. 9
figure 9

Visualization of the second artificial vector field without noise

We followed an experimental protocol similar to the one adopted for the previous artificial experiment. In this case there is no field parameter to vary, but only the amount of noise, which we consider proportional to the signal. This means that for each point of the field, the standard deviation of the noise added to the field in that point is proportional to the magnitude of the field. The model parameters are selected on a validation set of the same size of the training set, instead of performing the costly 5-fold cross-validation. Our approach consists in using the ν-method with a convex combination of the divergence-free and curl-free kernels, (22) and (23) respectively, controlled by a parameter γ, which is selected on the validation set alongside the optimal number of iterations. For the weight balancing the two kernels, we explored 11 values, equally spaced between 0 and 1, and we set the maximum number of iterations to 700, which was also used for regressing on each field component independently. We set the width of the Gaussian part of the matrix-valued kernels and the width of the Gaussian kernel for scalar regression to 0.8.

For comparison, we use the algorithm proposed by Mosci et al. (2008) for minimizing the functional of the sparsity-enforcing method of Obozinski et al. (2007). The method consists in a linear multi-task model, a dictionary of basis fields and a two-step procedure. The first step consists in the selection of dictionary elements (or features) uniformly across tasks, which is followed by a regularized least squares step for optimally estimating the coefficients for the selected features. The algorithm depends on two regularization parameters, τ and λ. The first weighs the 1 penalty on the norms of the coefficient vectors for each task and is responsible for obtaining sparse solutions. The second parameter is the 2 penalty on the coefficients of the regularized least squares step on the selected features. Both these parameters were selected on the validation set among a geometric series of 30 values between 10−8 and 1. Since the vector field is obviously non-linear, we consider two different feature maps from ℝ2 to a higher dimensional Hilbert space, where we can treat the estimation problem as linear. These feature maps are based on dictionaries of basis functions.

The first dictionary contains basis vector fields with null-divergence or null-curl, centred on the nodes of a grid of L=17×17 lines spaced Δ=0.25 in either direction. Following Mussa-Ivaldi (1992), the elements of the curl-free field basis are obtained as the gradient of Gaussian potential functions centred on the nodes of the grid, in our case, ϕ(x,x j )=−2(xx j )G(x,x j )/σ 2, where G(x,x j )=exp(−∥xx i 2/σ 2). To ensure a significant overlap between neighbouring fields, we set σ 2=2Δ. The divergence-free basis, φ(x,x j ) is obtained from the curl-free one by a π/2 rotation, so that φ 1(x,x j )=−ϕ 2(x,x j ) and φ 2(x,x j )=ϕ 1(x,x j ). In the paper by Mussa-Ivaldi (1992), the estimated vector field is a linear combination of the basis fields, \(f(x) = \sum_{j=1}^{L} c_{j} \phi(x,x_{j}) + \sum_{j=1}^{L} d_{j} \varphi (x,x_{j})\), so that each component f t of the field depends on the same coefficients c j and d j . Conversely, we allow each component to depend on a different set of coefficients \(c_{j}^{t}\) and \(d_{j}^{t}\)

$$f^t(x) = \sum_{j=1}^{L}c_j^t \phi^t(x,x_j) + \sum _{j=1}^L d_j^t\varphi^t(x,x_j), \quad t = 1, 2 .$$

This approach no longer permits to consider the vector field as the linear combination of the elements of the field basis, because we are in fact adopting a different scalar basis for each component. In other words, each task is given by a linear model on a different set of features. Obviously, it is no longer possible to decompose the estimated field in its divergence-free and curl-free parts.

The second dictionary we consider is the one proposed and used by Haufe et al. (2009) for the estimation of electric currents in the brain from scattered EEG/MEG measurements. In this case, the vector field is modelled as a combination of elements of a field basis, c j (to be estimated), with weights given by spherical Gaussians b j,s (x) centred on L points x j and characterized by S widths σ s ,

$$f(x) = \sum_{j=1}^L \sum _{s=1}^S \mathbf{c}_{j,s}b_{j,s}(x) .$$

We can consider the Gaussians as determining feature map ϕ:ℝ2→ℝLS,

$$\phi(x) = \bigl[b_{1,1}(x)\quad b_{1,2}(x)\quad \ldots\quad b_{1,S}(x)\quad \ldots\quad b_{L,1}(x)\quad \ldots \quad b_{L,S}(x)\bigr]^T ,$$

allowing to write the field as f(x)=(x) where C=[c 1,1 c 1,2c 1,S c L,1c L,S ] is the coefficient matrix. We computed the spherical Gaussians on the same grid of 17×17 points used for the previous dictionary, and chose four different values of standard deviation, σ s =0.2×2s−1,s=1,…,4. All these choices were made arbitrarily balancing the number and locality of the basis functions. In order to keep things simple, we kept the dictionary fixed as we varied the number of examples available for training and model selection. One could argue that a data-driven dictionary could allow for increased accuracy, but this analysis was beyond the scope of our experimental assessment.

In Fig. 10 (solid line), we report the test errors obtained using the proposed ν-method spectral filter with the convex combination of divergence-free and curl-free matrix-valued kernels. The dotted line shows the test errors achieved by regressing on each component of the field independently, with the same spectral filter and with a Gaussian kernel of the same width used for the matrix-valued kernels. It is evident that for estimating this general vector field, a vector-valued approach is still advantageous, even though the gain in performance deteriorates with the amount of noise. In fact, the noise disrupts the smoothness of the field, which no longer can be exactly decomposed as the sum of a divergence-free and a curl-free part. Computationally, the ν-method applied to vector-valued regression is just slightly more expensive than for scalar regression, as shown in Fig. 12.

Fig. 10
figure 10

Vector field 2. Test errors for the proposed vector-valued approach and for learning each component of the field independently with the ν-method spectral filter, in the noiseless case (left) and when the standard deviation of the noise is equal to 20% of the field magnitude (right). The test error is evaluated according to the error measure (29)

In Fig. 11 we compare the test errors of the proposed vector-valued approach with the sparsity-enforcing multi-task method using the two dictionaries described above. We observe that when we use a dictionary consisting of elementary vector fields with null-divergence or null-curl, we obtain similar results as using the corresponding matrix-valued kernels. On the other hand, if a more general dictionary is used, the results are slightly worse and this may be due to a more critical dependence on the tuning of the dictionary parameters, e.g., number of nodes and standard deviations of the Gaussians.

Fig. 11
figure 11

Vector field 2. Test errors for the proposed vector-valued approach and for the sparsity-enforcing method (Obozinski et al. 2007) in the noiseless case (left) and when the standard deviation of the noise is equal to 20% of the field magnitude (right). The first dictionary is the one proposed by Mussa-Ivaldi (1992), while the second is adapted from Haufe et al. (2009). The test error is evaluated according to the error measure (29)

Figure 12 reports the computation times for training and model selection for all the different methods assessed for each replicate of the experiment. We observe that the proposed vector-valued approach using the ν-method spectral filter is significantly faster than the sparsity-enforcing multi-task method and comparable to regressing on each component independently. The differences in computation time between the two dictionaries can be partly explained by the different size of the feature maps, which in the second case is twice as big, since we consider L×S=17×17×4=1156 Gaussians, instead of L divergence-free and L curl-free bases.

Fig. 12
figure 12

Vector field 2. Computation time for the proposed vector-valued approach, for the multi-task feature selection method and for learning each component of the field independently in the noiseless case. The computations were performed with MATLAB on a notebook with 2 GB of RAM and a 2 GHz Intel Core 2 Duo Processor

6.2 Real data

School data

This dataset from the Inner London Education AuthorityFootnote 4 has been used in previous works on multi-task learning (Bakker and Heskes 2003; Evgeniou et al. 2005; Argyriou et al. 2008a) and has become a standard benchmark over the recent years. It consists of the examination scores of 15362 students from 139 secondary schools in London during the years 1985, 1986 and 1987. Hence, there are 139 tasks, corresponding to predicting student performance in each school. The input data for each student consist of school attributes and personal attributes. The school attributes are: percentage of students eligible for free school meals, percentage of students in VR band one (highest band in a verbal reasoning test), school gender (male, female or mixed) and school denomination. Student specific attributes are: gender, VR band (can take values 1, 2 or 3) and ethnic group (among 13 possible values). Following the literature, we converted the categorical attributes using one binary variable for each possible attribute value, but we only considered student specific attributes. Each student is thus characterized by a feature vector of 19 bits. The school attributes could be used to define a similarity score between the schools, which we reserve to possible future work. For now we are only interested in comparing the results and computation times of the Landweber, ν-method and direct Tikhonov algorithms using the simple common similarity kernel (33).

We randomly selected only 60% of students from each school and divided their data equally into three sets: training, validation and test. Each set has 3124 students and on average 22 students per school. The validation set is used to select the regularizing parameter and the value of the parameter ω for the kernel (33). On the test set we evaluated the generalizing performance of the three algorithms using the measure of explained variance (Bakker and Heskes 2003). Explained variance is defined as one minus the mean squared test error over the total variance of the data (across all tasks). We opted for a Gaussian scalar kernel whose width was chosen to be the mean distance of the k nearest neighbours to each training point, where k is set to be 20% of the cardinality of the training set. We repeat this procedure ten times, with different random sampling of students from each school, to evaluate the stability of the error estimates obtained with each filter.

In Table 1 we report the test performance and the time needed to select the optimal parameters on the validation set (without taking into account the time needed to compute the kernel matrices since these are the same for all algorithms). The range of the parameter ω is [0,1] and was sampled at 0.1 steps. The three algorithms perform consistently and improve on the results of Argyriou et al. (2008a)—which obtain a performance of 26.4%±1.9%—despite being trained only on 20% of the available data, plus an additional 20% for parameter selection. The results of Argyriou et al. (2008a) were achieved using 75% of the data for training and adopting a 15-fold cross-validation to select the regularizing parameter. In the previous works no computation time is reported, while from our results the ν-method is almost two orders of magnitude faster than Tikhonov and more than one order of magnitude faster than Landweber. Obviously the validation time depends on the number of iterations or the number of values of the regularizing parameter to evaluate. For Landweber, after a first preliminary assessment, we opted for a maximum of 3000 iterations while for the ν-method a maximum of only 150 iterations. For Tikhonov, we choose 30 values sampled geometrically in the interval [10−5,10−2] and we performed a Singular Value Decomposition of the kernel matrix to more efficiently compute the regularized inverse with different regularization parameters.

Table 1 Performance as measured by the explained variance and model selection time for the Landweber, ν-method and Tikhonov algorithms on the School dataset. The multi-task feature learning method proposed by Argyriou et al. (2008a) obtains a performance of 26.4%±1.9%. The computations were performed with MATLAB on a notebook with 2 GB of RAM and a 2 GHz Intel Core 2 Duo Processor

7 Conclusion

In this paper, we studied the problem of learning vector-valued functions using a class of regularized kernel methods called spectral regularization. Tikhonov regularization and (vector-valued) L2 boosting are examples of methods falling in our framework. Computational issues comparing the implementations of the different algorithms on the basis of the kernel are discussed. Some of the algorithms, in particular iterative methods, provide interesting computational alternatives to Tikhonov regularization and in the experiments were shown to be much faster. A finite sample bound for all the methods was proven in a unified framework that shows their different theoretical properties. Finally, we analyzed the problem of multi-class classification from a vector-valued learning perspective, discussing the role played by the coding strategy and Bayes consistency.

One outcome from the experiments is that the kernels proposed so far seem to be interesting in the context of multi-task regularization and for vector fields that satisfy the assumptions of the Helmholtz theorem, but potentially unable to capture the functional relations describing real vector-valued functions.

Future work will focus on the problem of defining new kernels for learning vector-valued functions and their role in exploiting the correlations among classes in multi-category classification.