# Focused multi-task learning in a Gaussian process framework

- 1.7k Downloads
- 10 Citations

## Abstract

Multi-task learning, learning of a set of tasks together, can improve performance in the individual learning tasks. Gaussian process models have been applied to learning a set of tasks on different data sets, by constructing joint priors for functions underlying the tasks. In these previous Gaussian process models, the setting has been symmetric in the sense that all the tasks have been assumed to be equally important, whereas in settings such as transfer learning the goal is *asymmetric*, to enhance performance in a target task given the other tasks. We propose a focused Gaussian process model which introduces an “explaining away” model for each of the additional tasks to model their non-related variation, in order to focus the transfer to the task-of-interest. This focusing helps reduce the key problem of *negative transfer*, which may cause performance to even decrease if the tasks are not related closely enough. In experiments, our model improves performance compared to single-task learning, symmetric multi-task learning using hierarchical Dirichlet processes, transfer learning based on predictive structure learning, and symmetric multi-task learning with Gaussian processes.

## Keywords

Gaussian processes Multi-task learning Transfer learning Negative transfer## 1 Introduction

In classification and regression tasks it is common that there are too few training data available from the task of interest to learn a good model for the task. Trying to learn a flexible model with many parameters from few data may result in overlearning where the model mistakes artifacts of the specific available samples as actual properties of the underlying distribution; alternatively, a simple model with few parameters might yield less overlearning but may also be unable to represent the properties of the distribution needed for good classification or regression performance. Learning the classification or regression model from the data of the current task alone is called *single-task learning*.

The problem of having too few data is particularly pressing in data-analysis settings characterised by the “small *n*, large *p*” problem of having a large dimensionality *p* and small sample size *n*. In this paper we will use functional neuroimaging as one of the case studies, and in functional Magnetic Resonance Imaging (fMRI) in particular the number of volume elements (voxels) *p* in which brain activity is measured is huge. Another well-known example of “small *n*, large *p*” conditions is genome-wide measurements of gene expression or other cellular data, where it may be of interest to measure a large number of variables *p* (for instance genes) in parallel. In these applications the number of samples *n* (the number of stimuli per subject in an fMRI study or the number of biological samples in a gene expression study) is small because of the cost of one measurement, or availability of relevant subjects or samples. In patient studies of a brain disorder, for instance, there are practical limitations on how many patients can be accessed and measured, and in experimental neuroscience the problem is that the larger the number of replications and variants needed, the less new neuroscience can be done per measurement.

### Gaining more data from related tasks

The few training data available from the task of interest are *representative* in the sense that they are typically assumed to come from the same distribution as future test data. Even though there are few representative data available from the task of interest, there may be more data available from other *potentially related tasks*. The distribution of data in these other tasks is not the same as in the task of interest, but may be similar to it. If the distributions in several tasks are similar to each other, it may be possible to use the data from the other tasks to help learn each individual task. It has been shown that transferring knowledge between several potentially related learning tasks can improve performance. This scenario, termed *multi-task learning* (Caruana 1997) or *transfer learning* (Thrun 1996), has gained considerable attention in the machine learning community in recent years (see Pan and Yang 2010 for a recent review).

Transfer has usually been studied for regression or classification tasks; multi-task regression and classification scenarios arise in several application domains. For example, if the task of interest is to classify gene expression profiles of patients as having a particular cancer type, then other data sets of patients having other cancer types can be related tasks. If the task is to classify scientific articles from a conference into subject categories, then classifying articles from related conferences can by used as related tasks. In this paper we study a classification task in which the goal is to predict the stimulus given brain measurements of a certain user, utilising the measurements of other users on the same and different stimuli as related tasks; in this setting a multi-task learning setup is useful because, when generalising across subjects, the brain physiology and function are sufficiently similar that different brains can be matched, but the matching is only approximate.

### Multi-task learning by hierarchical modelling

Learning from several tasks is often done by constructing a *hierarchical model* over all tasks, where model parameters within each task are related to the corresponding parameters in the other tasks through an upper-level prior distribution. For example, if there are several related linear regression tasks, they can be learned together by assuming their regression weights are drawn from a common upper-level prior, which effectively constrains the weights to be similar across the tasks. Data from all tasks then affect the learning of the upper-level prior parameters, which in turn affect the learning of the parameters within each task; this effectively provides additional indirect evidence for learning the parameters of each task. Parameters in each task may then be learned closer to their actual underlying values, or in Bayesian learning may be inferred with less remaining uncertainty. This kind of learning process across tasks is sometimes called *sharing statistical strength* (Teh et al. 2005, 2006). Sharing statistical strength between tasks can potentially compensate for having very few samples in the desired learning task, and can make the inference more robust to noise.

### Negative transfer

Learning several tasks together may not always be beneficial. As usual, both single-task learning and multi-task learning can suffer from misspecifying the model within the tasks, however, in multi-task settings there is an additional potential danger: in order to enable learning from data across the tasks, assumptions must be made about the possible model relationships between tasks. Transfer of knowledge between different tasks is useful only when the tasks are related; misspecifying the possible kinds relationships between model parameters across tasks, or misspecifying relationships to be likely where they are unlikely in reality, can distort the model learned for a target task rather than providing additional statistical strength. The phenomenon where providing other tasks to help learning actually ends up hurting the learning is called *negative transfer* (see, e.g., Rosenstein et al. 2005).

Negative transfer can happen in particular if the distributions in some of the other tasks are in reality not similar to a task of interest: then learning the tasks together in a hierarchical fashion, and assuming the parameters to be similar, can harmfully constrain the parameters for the task of interest. The precise effect of the negative transfer depends on the assumed task relationships and the model families used within the tasks. The harmful effect will typically be strongest for inputs where observations from the other tasks strongly outnumber those from the task of interest. For such inputs, the model learned for the task of interest may mistakenly predict outputs to be similar to the other tasks.

A crucial part of multi-task learning algorithms lies in the modelling of task relatedness, through the specification and the learning of the dependency structure between tasks. We discuss learning the dependency structure in an asymmetric setting: we propose a method that aims to avoid negative transfer by a flexible dependency structure where the dependency can be different from a task of interest to each other task, and even to different parts of the other tasks.

### 1.1 Symmetric and asymmetric multi-task learning

In general, existing multi-task learning approaches use a symmetric dependency structure between tasks. This type of set-up, which we term *symmetric multi-task learning*, assumes that all tasks are of equal importance. The set of related tasks is learned jointly, with the aim of improving over learning the tasks separately (the *no transfer* case), averaged over all tasks.

However, a common learning scenario is to learn a specific task (*primary task*), while incorporating knowledge learned through other similar tasks (*secondary tasks*). An asymmetric scenario is natural especially when future test data will come only from the task of interest; for example, the term *transfer learning* is often used to denote setting where several tasks have been learned at an earlier time and their knowledge is transferred to help a new task at hand. In one transfer learning setting, the task of interest may be to classify scientific papers for the current and next year of a particular conference, whereas the data of the related other tasks may be documents from earlier years of the conference when the topics prevalent in the papers were different, and from other conferences where even the scopes of the conferences are different. When classifying presence of a particular disease in patients based on gene expression, historical data sets from related other diseases may be used as related tasks but the aim is to learn to classify the disease of interest rather than the other diseases which are only used as sources of related information. In the neuroscience scenario that we use as a case study in this paper, we are interested in learning about a specific patient’s response to a stimulus, but we can transfer information from other patients’ responses to related stimuli to improve learning. The data of the other patients may be historical measurements from persons who are not currently participating in the neuroscience study. An asymmetric setting can also happen in multi-task learning when tasks are learned simultaneously, but one of them is more interesting than others: for example, in a neuroscience scenario one task may be to detect an interesting stimulus based on the brain response whereas other tasks may be to detect ordinary stimuli.

The asymmetric learning setting requires the assumption of an asymmetric dependency structure between tasks. Existing approaches include reweighting-based methods (Wu and Dietterich 2004; Bickel et al. 2008, 2009) or learning of shared feature spaces. An alternative has been to, in effect, use a symmetric multi-task learning method in an asymmetric mode, by using the model learned from auxiliary tasks as a prior for the target task (Marx et al. 2005; Raina et al. 2005; Xue et al. 2007).

Inspired by the Gaussian process (GP) models used earlier for symmetric multi-task learning, we propose a novel and simple dependency structure for asymmetric multi-task learning using GPs. This focuses on learning a target task and learns to avoid negative transfer; this can be done conveniently in the GP formulation, by adding task-specific processes which “explain away” irrelevant properties. At the same time, flexibility of the GP framework is preserved.

This paper extends our conference paper (Leen et al. 2011). This journal version adds a new study of the tolerance of our focused multi-task learning model against irrelevant functions that are shared between the secondary tasks but not with the primary task, a new comparison between our model and a symmetric multi-task learning method based on Gaussian processes, and a new extension of our model to allow several shared Gaussian process functions.

## 2 Dependency structure in multi-task learning with Gaussian processes

Supervised learning tasks such as classification and regression can be viewed as function approximation problems given the task inputs and targets; accordingly, multi-task learning can be viewed as learning multiple related functions. The Gaussian process (GP) framework provides a principled and flexible approach for constructing priors over functions.

In brief, a GP is a prior over input-output functions that does not restrict outputs to a particular parametric function of input coordinates (such as a sinusoid or a polynomial); instead, for any fixed set of input points the prior for the corresponding set of outputs is represented as a multidimensional Gaussian distribution. The GP prior over the whole input-output function is then specified by a mean function and a covariance function; often a zero-mean prior is used. Typical covariance functions such as the squared exponential covariance function specify that nearby inputs should *a priori* have strongly related outputs. Given the GP prior and a set of observed input-output samples, Bayesian inference is used to infer the posterior over the possible underlying functions. If the observation model is Gaussian the inference of the posterior can be done analytically. The inferred posterior can be used for example to predict output values. Even though the mean and covariance function that specify the GP prior can be fairly simple functions, the inferred posterior can very flexibly represent complicated functions.

The GP framework has subsequently been applied successfully to multi-task learning problems (Yu and Tresp 2005; Bonilla et al. 2008; Alvarez and Lawrence 2009). A crucial element of these models is the way in which the dependency structure between the multiple functions is encoded through the construction of the covariance function. However, current GP approaches do not address the problem of asymmetric multi-task learning, and only consider symmetric dependency structures, which we review in the following subsection.

### 2.1 Symmetric dependency structure

*N*distinct inputs,

**X**=[

**x**

_{1},…,

**x**

_{ N }]

^{⊤}, and

*M*tasks, such that \(y^{t}_{i}\) is the target for input

*i*in task

*t*. We denote the vector of outputs for task

*t*as \(\mathbf{y}^{t} = [y_{1}^{t},\ldots,y_{N}^{t}]^{\top}\), and the

*N*×

*M*vector of outputs for all

*M*tasks, as

**y**=[(

**y**

^{1})

^{⊤},…,(

**y**

^{ M })

^{⊤}]

^{⊤}. Here we consider a set of tasks which all have the same input, for ease of notation, but the problem setting can easily be generalised to different inputs for each task. In the GP approach to the problem, it is assumed that there is a latent function underlying each task,

*f*

^{1},…,

*f*

^{ M }. Denoting the latent function evaluated at input

*i*for task

*t*as

*f*

^{ t }(

**x**

_{ i }), a (zero mean) GP prior is defined over the latent functions, with a covariance function of the form where 〈⋅〉 denotes expectation and 〈

*f*

^{ t }(

**x**)

*f*

^{ t′}(

**x**′)〉 is the usual definition of a covariance function, expectation of the product of the two outputs

*f*

^{ t }(

**x**) and

*f*

^{ t′}(

**x**′), where the expectation is taken over the Gaussian process prior. On the right-hand side

*k*

^{ T }is a covariance function over tasks, specifying the intertask similarities, and

*k*

^{ x }is a covariance function over inputs. For regression tasks, the observation model is \(y_{i}^{t} \sim\mathcal{N}(f^{t}(\mathbf{x}_{i}),\sigma^{2}_{t})\), where \(\sigma^{2}_{t}\) is the noise variance in task

*t*.

The covariance function *k* ^{ x } over inputs can be any typical function used in Gaussian processes, such as a radial basis function \(k^{x}(\mathbf{x},\mathbf{x}')\propto \exp(-\|\mathbf{x}-\mathbf{x}'\|_{U}^{2}/2)\) where *U* denotes a metric, usually a Euclidean or Mahalanobis metric. In the experiments we use such a radial basis function, detailed in Sect. 5.

Bonilla et al. (2008) define *k* ^{ T } as a “free-form” covariance function, where \(k^{T}(i,j)= \mathbf{K}^{T}_{i,j}\) indexes a positive semidefinite intertask similarity matrix **K** ^{ T }. Other methods such as that of Yu et al. (2007) have included a parameterised similarity matrix over task descriptor features, but this could be restrictive in modelling similarities between tasks. These types of priors essentially assume that each of the task latent functions is a linear combination of a further set of latent functions, known as intrinsic correlation models in the geostatistics field (see, e.g., Wackernagel 1994). This idea was further generalised by Alvarez and Lawrence (2009) to generating the task latent functions by convolving a further set of latent functions with smoothing kernel functions.

### 2.2 Predictive mean for symmetric multi-task GP

**x**

_{∗}in task

*j*, for the multi-task GP formulation of Bonilla et al. (2008), is given by where \(\mathbf{k}_{j}^{T}\) is the

*j*th column of task similarity matrix

**K**

^{ T }, ⊗ is the Kronecker product, and \(\mathbf {k}_{*}^{x} = [k(\mathbf{x}_{*},\mathbf{x}_{1}),\ldots,k(\mathbf{x}_{*},\mathbf{x}_{N})]^{\top}\) is the vector of covariances between the test input

**x**

_{∗}and the training inputs. The

*k*

^{ x }(

**X**,

**X**) is the matrix of covariance function values between all training input points, and

**D**is an

*M*×

*M*diagonal matrix where the (

*t*,

*t*)th element is \(\sigma^{2}_{t}\).

*M*×

*N*vector

**w**=

**Σ**^{−1}

**y**, and divide it into

*M*blocks of

*N*elements: \(\mathbf{w} = [\mathbf{w}_{1}^{\top},\ldots,\mathbf{w}_{M}^{\top}]^{\top}\). We can then rewrite (2) as where \(\mu_{*}^{m} = (\mathbf{k}_{*}^{x})^{\top}\mathbf{w}_{m}\) can be interpreted as the posterior mean of the latent function at

**x**

_{∗}for task

*m*; thus (2) is a weighted sum of posterior means for all tasks, and the weights \(\{\mathbf{K}^{T}_{m,i}\}_{m=1}^{M}\) are covariances between task

*j*and all tasks. Since

**K**

^{ T }is positive semidefinite, the sharing of information between tasks is naturally symmetric, and all tasks are treated equally. However, we are interested in an asymmetric setup, where we learn a primary task together with several secondary tasks. Rather than modelling the relationships between secondary tasks, we want to focus on the aspects relevant to learning the primary task.

### 2.3 Asymmetric dependency structure

In the previous symmetric learning problem, the tasks were modelled as conditionally independent on a set of *M* (i.i.d.) underlying functions, which capture the shared structure between all tasks. In this section, we derive an asymmetric version of a GP framework for multi-task learning, by constraining the secondary tasks to be conditionally independent given the primary task, such that the shared structure between all secondary tasks is due to the primary task function.

Similarly to the previous notation, let us denote the inputs to each task as **X**. Suppose that there is one primary task, with targets \(\mathbf{y}^{p} = [y_{1}^{p},\ldots,y_{N}^{p}]^{\top}\), with underlying latent function values **f** ^{ p }=[*f* ^{ p }(**x** _{1}),…,*f* ^{ p }(**x** _{ N })]^{⊤}. Suppose there are *M*−1 secondary tasks, where the targets for the *i*th secondary task are denoted by \(\mathbf{y}^{s_{i}} = [y_{1}^{s_{i}},\ldots,y_{N}^{s_{i}}]^{\top}\). The corresponding latent function values are \(\mathbf{f}^{\,s_{i}} = [f^{s_{i}}(\mathbf{x}_{1}),\ldots,f^{s_{i}}(\mathbf{x}_{N})]^{\top}\).

*f*

^{ p }for the primary task. Here, potentially related secondary tasks can help to learn

*f*

^{ p }; conversely if we know

*f*

^{ p }, this could help to learn the functions underlying the secondary tasks \(\{f^{s_{i}}\}\). First we define a joint prior over the primary and secondary task function values. We start by making the assumption that the secondary task functions \(\{f^{s_{i}}\}\) can be decomposed into a “shared” component (which is shared with the primary task) and a “specific” component. That is, for the

*n*th input, Further we assume that \(f^{s_{i},\mathrm{shared}} = \rho_{s_{i}} f^{p}\), that is, the shared component is correlated with the primary task function. This may seem like a restrictive assumption but assuming linear relationships between task functions has been proved to be successful by, e.g., Wackernagel (1994) and Bonilla et al. (2008). Now we can place a shared prior over each \(f^{s_{i},\mathrm{shared}}\) and

*f*

^{ p }. The corresponding graphical model is presented in Fig. 1.

#### 2.3.1 Sharing between primary and secondary task functions

Since the functions in the secondary tasks are composed of a shared and specific component as shown in (4), we can define the covariance function separately for both types of components. We first discuss the covariance between the primary function *f* ^{ p } and the shared components \(f^{s_{i},\mathrm{shared}}\) of the secondary tasks *s* _{ i }.

*t*and

*t*′ be indices of two tasks, each of which can be the primary task or one of the secondary tasks. We place a zero mean Gaussian process prior on

*f*

^{ p }, with covariance function

*k*

^{ p }, such that the prior on the shared function is also a GP, with covariance function where

*ρ*

_{ t }is the correlation of task

*t*with the primary task, and

*ρ*

_{ p }=1, and

*f*

^{ t }can denote either the primary task function or the shared component in any of the secondary tasks. Denoting the shared components of the task functions for the

*M*−1 secondary tasks as \(\mathbf{f}^{\,s,\mathrm{shared}} = [(\mathbf{f}^{\,s_{1},\mathrm{shared}})^{\top},\ldots,(\mathbf {f}^{\,s_{M-1},\mathrm{shared}})^{\top}]^{\top}\), the joint distribution over the shared function values is given by where the expression on the right-hand side is of the form \(\mathcal{GP}(\mathbf{f}; 0, \mathbf{K})\) which denotes a Gaussian process prior with mean 0 and covariance matrix

**K**evaluated at function value

**f**. Here in particular

**K**

_{ pp }is the matrix of covariance function values from (5) between the primary task points,

**K**

_{ sp }evaluated between secondary and primary, and

**K**

_{ ss }between secondary task inputs, where the matrices

**K**

_{ sp }and

**K**

_{ ss }represent variation due to the shared components in the secondary tasks.

#### 2.3.2 Explaining away secondary task-specific variation

We next treat the covariance between the specific components of the secondary tasks, and then put the types of covariances together to form the total covariance between the tasks.

**K**

^{spec}. The covariance functions \(\mathbf{K}^{\mathrm{spec}}_{s_{i}}\) have parameters specific to each secondary task

*s*

_{ i }, and the specific functions over all secondary tasks are then drawn as \(\mathbf{f}^{\,s,\mathrm{specific}} \sim \mathcal{GP}(0,\mathbf{K}^{\mathrm{spec}})\). This creates flexible models for the secondary tasks, which can “explain away” variation that is specific to a secondary task, and unshared with the primary task. The full secondary task functions are then generated according to (4) as

**f**

^{ s }=

**f**

^{ s,shared}+

**f**

^{ s,specific}; since the shared components are independent of the specific components, the covariance of the full functions is just the sum of covariances of the shared and specific components. The model, which we call the Focused GP-multitask learning model, takes the form

### 2.4 Sparse approximation to the focused GP-multitask learning model

Learning the hyperparameters for the covariance functions in (7) will be computationally expensive since it involves the inversion of the full covariance matrix of the Gaussian process prior across all points in all tasks, which is a matrix of size (*M*×*N*)×(*M*×*N*). Inverting such a matrix takes *O*(*M* ^{3} *N* ^{3}) time in the general case; however, if the matrix has a sufficiently simple structure the inversion can be computed faster. In this section, we derive an approximation to this covariance matrix based on assumptions about the sharing between secondary and primary tasks. The idea is to approximate the matrix, preserving the main part of our intended asymmetric multitask dependency structure, but simplifying it enough so that we can apply the *Woodbury identity* to compute the inversion. Note that this approximation is not crucial to our method: if there are few enough data to compute the full inverse, the approximation can be omitted.

**A**corresponds to the covariance block within the primary task,

**D**corresponds to the covariance block between the secondary tasks, and

**B**and

**C**are the cross-covariance blocks from the primary task to the secondary tasks. On the right-hand side, the only large matrix that needs to be inverted is

**D**which corresponds to the covariance between secondary tasks; we must find an approximation for this covariance that will be efficient to invert.

**f**

^{ p }and

**f**

^{ s }can be evaluated as the product of the value of a conditional prior and the value of a marginal prior, so that

*p*(

**f**

^{ p },

**f**

^{ s })=

*p*(

**f**

^{ s }|

**f**

^{ p })

*p*(

**f**

^{ p }). In particular, if the GP prior is of the form in (7) then by standard Gaussian identities we can write it in the equivalent form where \(\boldsymbol{\varLambda}=\mathbf{K}_{ss} -\mathbf{K}_{sp}\mathbf {K}_{pp}^{-1}\mathbf{K}_{sp}^{\top}\). The first GP term on the right-hand side is which is the GP predictive likelihood on the secondary task function values, after training on the primary task. The second GP term on the right-hand side is simply the marginal prior

*p*(

**f**

^{ p }) in the primary task. We will now approximate the first term by a simpler form, by approximating

*as a diagonal matrix: we simply set the diagonal of*

**Λ***to the diagonal of \(\mathbf{K}_{ss} -\mathbf{K}_{sp}\mathbf{K}_{pp}^{-1}\mathbf{K}_{sp}^{\top}\), and set off-diagonal elements of*

**Λ***to zero. Then the total covariance matrix*

**Λ***+*

**Λ****K**

^{spec}in the conditional prior

*p*(

**f**

^{ s }|

**f**

^{ p }) is a simple block-diagonal matrix.

**f**

^{ p }and

**f**

^{ s }, similar to (7) but with the approximation taken into account. To do this we must simply recompute the marginal covariance matrix of

**f**

^{ s }, as the integral \(p(\mathbf{f}^{\,s}) = \int_{\mathbf{f}^{p}} p(\mathbf{f}^{\,s}|\mathbf {f}^{p})p(\mathbf{f}^{p}) d\mathbf{f}^{p}\). This yields This yields the final prior on all the task functions as Since the above prior uses the reduced rank (the rank = number of primary task inputs) approximation to the covariance matrix, we can use the Woodbury identity to efficiently calculate the inverse and determinant. In particular, by the Woodbury identity the inverse of the bottom-right block is

*+*

**Λ****K**

^{spec}is block-diagonal and the term inside the brackets is a small matrix of the same size as

**K**

_{ pp }, having the same number of rows as there are samples in the primary task. This inverse can then be inserted in the general block-matrix inverse equation, to yield the inverse of the complete covariance matrix. A similar efficient computation can be done to compute the determinant. We call this a “sparse” approximation because several entries of

*were approximated as zero to reduce the rank of the full covariance matrix; however, note that the reduced-rank full matrix itself in (11) remains non-sparse.*

**Λ**#### 2.4.1 Influence of the primary observations on secondary task predictions

*f*

^{ p }(the primary task function) after observing the primary task function values

**f**

^{ p }, evaluated at all the secondary task inputs, with an added “specific” component modelled by

**K**

^{spec}. The mean prediction \(\mathbf{K}_{sp}\mathbf{K}_{pp}^{-1}\mathbf{f}^{p}\) is similar to a standard GP predictive equation, with the difference that according to the definition of

**K**

_{ sp }in (5) the posterior mean for each secondary task

*s*is weighted by

*ρ*

_{ s }, which models the correlation with the primary task. To illustrate this, for secondary task

*l*, the posterior mean \(\bar{\mathbf {f}}^{\,l,\mathrm{shared}}\) given

**f**

^{ p }becomes where we have used the notation:

**X**

_{ i }is the set of input points for task

*i*, and \(\boldsymbol{\mu}_{l}^{p}\) is the posterior mean given covariance function

*k*

^{ p }and observations

**f**

^{ p }, evaluated at

**X**

_{ l }. Controlling

*ρ*

_{ s }therefore directly controls the amount of influence the primary task predictions have on predictions in the secondary tasks, and hence the amount of influence the secondary task observations have on learning the primary task function. Learning

*ρ*

_{ s }during training can help to avoid negative transfer from secondary tasks to the primary task.

### 2.5 Hyperparameter learning

**x**and targets

*y*. The hyperparameters needed for the covariance functions depend on the form of the covariance function; in the experiments we use a squared exponential form described in Sect. 5. For regression, the observation model is \(y_{i}^{t} \sim \mathcal{N}(f^{t}(\mathbf{x}_{i}),\sigma^{2}_{t})\), where \(\sigma^{2}_{t}\) is the noise variance in task

*t*. The log marginal likelihood of the observed data has the same form as usual in Gaussian process regression: denoting the vector of observed outputs over all tasks as

**y**and the corresponding matrix of inputs as

**X**, we have where

*=*

**Σ****K**+

**Σ**_{noise},

**K**is the covariance matrix on the right-hand side of (11) and

**Σ**_{noise}is a diagonal matrix where the diagonal entries corresponding to task

*t*are \(\sigma^{2}_{t}\). If all tasks have the same number of inputs then

**Σ**_{noise}is the same as

**D**⊗

**I**in (2). The hyperparameters can be learned by optimising the above marginal log-likelihood with respect to the hyperparameters, which can be done by gradient methods (here we used standard conjugate gradient optimisation).

**x**

_{∗}from the primary task is given by: where

*k*

^{shared}(

**x**

_{∗},

**X**) is the vector of covariances between the test input and the shared functions in the training inputs (primary and all secondary inputs): for training input

**x**in task

*t*, the corresponding element in the vector has value

*k*

^{ p }(

**x**

_{∗},

**x**)

*k*

^{ T }(

*p*,

*t*)=

*k*

^{ p }(

**x**

_{∗},

**x**)

*ρ*

_{ t }.

The classification case is similar, we simply use a probit noise model \(p(y_{i}^{t}\mid f_{i}^{t}) = \varPhi(y_{i}^{t}(f_{i}^{t} + b))\), where \(f_{i}^{t}\) is the predicted function value for point *i* in task *t*, *Φ* is the cumulative distribution function for a standard Gaussian \(\mathcal{N}(0,1)\), and *b* is a bias parameter. For the binary classification experiments in Sect. 5.2, we make an approximation to the model likelihood using Expectation Propagation (Minka 2001).

## 3 Related work and discussion

The focused multi-task GP model that we have derived in the previous sections is designed for asymmetric multitask learning scenarios; we construct a joint GP prior over the functions underlying the tasks, that assumes an asymmetric dependency structure. Our approach uses a simple idea to bias the model to focus on learning the underlying function for the primary task, rather than modelling and learning all the tasks symmetrically. The dependency structure does this by decomposing the underlying task functions for the secondary tasks as “shared” and “specific” components. The shared components are from a joint GP prior with the primary task function. These are conditioned on the primary task function values according to (9) which biases the shared variation between tasks to be due to the primary task function, and a task specific weight which is learned during training. We additionally assume that each of the secondary task functions can also be explained by a process specific to it, by defining a block diagonal covariance structure over the secondary tasks. This allows the model to “explain away” secondary task specific variation and focus the model on learning the primary task.

Recently there has been interest in asymmetrical GP multi-task learning (Chai 2009), where generalisation errors for the multi-task GP of Bonilla et al. (2008) were derived for an asymmetrical multi-task case, with one primary and one secondary task. However, this work did not derive a new model for asymmetric multi-task learning, and focused on analysing the symmetric model. In the next section we will analyse our asymmetric model in a similar manner.

When deriving the sparse approximation to our model, the sparse GP method of Snelson and Ghahramani (2006) bears similarities to our model. In Snelson and Ghahramani (2006), the covariance matrix of the GP was decomposed into a reduced rank matrix. This model assumes that there are a set of *M* pseudo-inputs, which along with their function values (pseudo-targets) act as a pseudo data set. This provides a compact summary of the real data (*N* data points; *M*≪*N*). The covariance function is parameterised by the pseudo-input locations, which are learned during the optimisation, by deriving the likelihood function for the real data as a predictive likelihood, given the pseudo data set. In our focused multi-task GP, we can interpret our sparse approximation as a special case of the sparse GP model; the pseudo-input locations are fixed as the inputs to the primary task, such that they are a compact representation of the shared function underlying the primary and secondary tasks. The distribution over the secondary task functions can be viewed as the predictive distribution given the primary task function values. In Sect. 2.4 we then assume the compact representation suffices to represent the shared component of variation inside the secondary tasks, so that the remaining off-diagonal elements in the matrix * Λ* were approximated as zero.

In this paper we make the simplifying assumption that the task of interest is entirely composed of the shared function, and that there are no other strong shared functions between other tasks. This model already proves useful in a challenging fMRI task, demonstrating that the idea of asymmetric modelling with explaining-away yields useful results, and it can be extended to more general asymmetric modelling in later stages. For instance, there may be detrimental shared variation between other tasks, which may harm learning of the primary task. In Sect. 6 we briefly study the effect of such detrimental shared variation on our current model. The model could be extended by adding additional GP functions which are shared between other tasks but not with the primary task. The overall model can then learn which shared function is a better explanation. As the number of tasks increases, the number of possible sharing configurations increases (shared functions between 2,3,…,*M* tasks) and the complexity of the model quickly increases. This will be studied in further work.

## 4 Examining the generalisation error for asymmetric and symmetric models

To examine the effect of the processes that are specific to a secondary task, we look at the generalisation error on the primary task for the asymmetric two tasks case in a similar manner to Chai (2009). We investigate the influence of *ρ*, the degree of “relatedness” between the two tasks.

We want to study a continuum where single-task learning is one extreme, pooling all data into one task is the other extreme, and the asymmetric model lies in between them. Note that setting *ρ*=0 in the asymmetric and symmetric cases reduces to single-task learning. For our model, we will study the case where the covariance of the specific function in the secondary tasks has its overall scale set to (1−*ρ* ^{2}); then the extreme of *ρ*=1 will reduce to pooling all data into one task, in both the symmetric and asymmetric cases. This corresponds to using an overall scale of 1 and multiplying the resulting specific covariance by (1−*ρ* ^{2}); we use this notation to make the influence of *ρ* explicit.

Suppose that we have training inputs **X** _{ P } for the primary task, and **X** _{ S } for the secondary task. The covariance matrices **C** _{sym} and **C** _{asym}, for the symmetric and asymmetric cases respectively, of the noisy training data are given by:

**Symmetric case**

**Asymmetric case**where we have used the notation \(\mathbf{K}_{AB}^{p}\) to denote the matrix of covariance values, due to

*k*

^{ p }, evaluated between

**X**

_{ A }and

**X**

_{ B }. In both the symmetric and asymmetric case, the top-left terms in the covariances in (15) and (16) are simply the covariance within the primary task and the cross-terms are due to the assumed correlation of strength

*ρ*between the primary and secondary task. For the asymmetric case, the covariance matrix for the secondary task comes from the “shared” covariance function

*k*

^{ p }with the primary task, and a “specific” covariance function

*k*

^{ s }. The relationship between the primary and secondary tasks due to the

*ρ*’s comes directly from (1) and (5) for the symmetric and asymmetric cases respectively; additionally, the multiplier (1−

*ρ*

^{2}) in the bottom-right term of

**K**

^{asym}(

*ρ*) corresponds to setting the magnitude of the specific covariance functions to (1−

*ρ*

^{2}) to achieve a continuum between single-task learning and pooling all tasks.

### 4.1 Generalisation error for a test point **x** _{∗}

**x**

_{∗}for the primary task (due to the noise free

*f*

^{ p }) is also the generalisation error for

**x**

_{∗}. The posterior variance at

**x**

_{∗}for the primary task is: where

*k*

_{∗∗}is the prior variance at

**x**

_{∗},

*k*

^{ p }(

**x**

_{∗},

**x**

_{∗}), and \(\mathbf{k}_{*}^{\top}= (k^{p}(\mathbf{x}_{*},\mathbf{X}_{p})\hspace{0.2cm} \rho k^{p}(\mathbf{x}_{*},\mathbf{X}_{s}))\). We note that the target values

*y*do not affect the posterior variance at the test locations, and have omitted the dependence on

**X**

_{ P },

**X**

_{ S }and \(\sigma^{2}_{n}\) in the notation for \(\sigma_{\mathrm{sym}}^{2}(\mathbf {x}_{*},\rho), \sigma_{\mathrm{asym}}^{2}(\mathbf{x}_{*},\rho)\) for clarity.

**x**

_{∗}in Fig. 2, given two observations for the primary task, and three observations of the secondary task (see figure for more details). Following the setup of Chai (2009), we use a squared exponential covariance function with lengthscale 0.11 for

*k*

^{ p }, noise variance \(\sigma_{n}^{2} = 0.05\), and, for the asymmetric setup, a squared exponential covariance function with lengthscale 1 for

*k*

^{ s }.

Each plot contains 6 curves corresponding to *ρ* ^{2}=[0,1/8,1/4,1/2,3/4,1], and the dashed line shows the prior noise variance. The training points from the primary task (⋄) create a depression that reaches the prior noise variance for all the curves. However, the depression created by the training points for the secondary task (∘) depends on *ρ*. For the single task learning case (*ρ*=0), there is no knowledge transferred from the secondary task. As *ρ* increases, the generalisation error at the secondary task test points decreases. For the intermediate *ρ* ^{2} values (i.e., not 0 or 1 (full correlation)), our asymmetric model gives a smaller posterior variance than the symmetric model at secondary task locations, and therefore suggests better generalisation error.

### 4.2 Intuition about the generalisation errors

**A**(

*ρ*): where the multiplier (1−

*ρ*

^{2}) in the asymmetric case is equivalent to setting the overall scale of \(\mathbf{K}^{s}_{SS}\) to (1−

*ρ*

^{2}), which is here done to establish a continuum from single-task learning at

*ρ*=0 and pooling all tasks at

*ρ*=1. If

**A**(

*ρ*)

_{asym}⪯

**A**(

*ρ*)

_{sym}then: where we have used the Banachiewicz inversion formula to evaluate the matrix inversions in (17) and (18), and we have defined \(\mathbf{v}(\rho) = \rho(k^{p}(\mathbf{X}_{S},\mathbf{x}_{*}) - \mathbf{K}_{SP}^{p}(\mathbf{K}_{PP}^{p} + \sigma_{n}^{2}\mathbf {I})^{-1}k^{p}(\mathbf{X}_{P},\mathbf{x}_{*}))\).

The asymmetric model has more flexibility than the symmetric model in the modelling of the secondary task, since it uses both *f* ^{ p } and *f* ^{ s }, rather than just *f* ^{ p }. We expect that **A**(*ρ*) for the asymmetric version would be smaller than for the symmetric since the additional flexibility should allow more accurate modelling of the covariances between the secondary task points, and hence the asymmetric generalisation error should be smaller than the symmetric.

## 5 Experiments

In this section, we demonstrate the performance of the focused multi-task GP model on a synthetic regression problem in Sect. 5.1. In Sect. 5.2, we compare our model’s performance with alternative models on an asymmetric multi-task classification problem on fMRI data. In all experiments, we use squared exponential covariance functions with automatic relevance determination (ARD) prior: \(k(\mathbf{x},\mathbf{x}') = \sigma_{s}^{2} \exp(-\frac{1}{2}\sum_{d} (\mathbf{x}_{d} - \mathbf{x}'_{d})^{2}/l_{d}^{2})\), where \(\sigma_{s}^{2}\) is the overall scale and *l* _{ d } is the lengthscale for the *d*th input dimension, initialised to 1. This prior is used for both primary and secondary task functions. With this choice, the hyperparameters to be learned are the lengthscales *l* _{ d } and overall scale \(\sigma_{s}^{2}\), with separate parameters for the covariance function of the shared components and for each covariance function of each specific component; additionally the parameters of the observation model are learned (noise variance \(\sigma_{t}^{2}\) for each task in a regression setting), and the task similarity coefficients *ρ* _{ t }.

### 5.1 Synthetic data

We use synthetic data to demonstrate how our focused multitask model learns a regression function (the primary task) in conjunction with several related regression problems (the secondary tasks). The model is able to learn the primary task even where there is missing data, by using the shared signal learned from the secondary tasks. We also show how the number of secondary tasks affects the learning of the primary task: as the number of secondary tasks increases, the mean squared error of the predictions on the test set decreases. In this section we show this behaviour in a synthetic experiment where the generation of data follows our model and the ground truth usefulness of the secondary tasks is known; in the next section we will show that good performance is also achieved for asymmetric learning in a real-life case study with several secondary tasks.

**x**, 100 samples evenly spaced on the interval [−5,5]. The primary task function is generated from \(\mathbf{f}^{p} \sim \mathcal{GP}(0,\mathbf{K}_{p})\), where the kernel function is squared exponential with length scale 1 and overall scale \(\sigma_{s}^{2}=1\). The secondary task function in each secondary task

*s*

_{ m }is generated according to \(\mathbf{f}^{\,s_{m}} \sim \mathcal{GP}(\alpha_{m}\mathbf{f}^{p}, \beta_{m}\mathbf{K}^{\mathrm {spec}}_{s_{m}})\): i.e. the mean is a scaled version (by

*α*

_{ m }) of the primary task function. Each specific kernel function \(\mathbf{K}^{\mathrm{spec}}_{s_{m}}\) is squared exponential with lengthscale 1, and

*α*

_{ m }is drawn at random from \(\mathcal{N}(0,1)\),

*β*

_{ m }at random from [0,1]. We assume a Gaussian observation noise model.

Figure 4(b) shows the mean of the posterior distribution (marked as “ppf”; black color in the online version) over the primary task function for one of the runs, for different numbers of secondary tasks. We also plot the true underlying primary function (marked as “true”; blue line in the online version), showing that the model can predict the missing part of the primary task function by transferring information from secondary tasks. The prediction gets nearer to the true underlying primary task function, as the number of relevant secondary tasks increases. Figure 4(a) shows that the mean squared error on the test set decreases as the number of secondary tasks increases.

### 5.2 fMRI data

In this section, we evaluate the performance of our model on fMRI data, obtained from Malinen et al. (2007). We consider the task of predicting whether a subject is reacting to a particular stimulus “touch”, given the fMRI data. We aim to improve the learning of this primary task by learning it in conjunction with other, related tasks from the other subjects in the experiment. We also include some less related tasks in the secondary task set to show how our model can overcome negative transfer, and focus on the relevant shared signal. The main goal of the experiment is to show that good performance of asymmetric learning can be achieved not only for the artificial data of the previous section but also in a real-life case study, and to moreover show that the asymmetric learning will outperform state of the art alternative methods.

The fMRI data comes from six healthy young adults who participated in two identical sessions, in which they received a continuous 8-min sequence comprising of auditory, visual and tactile stimuli in blocks of 6×33 s. The stimuli of different senses never overlapped. Whole-head volumes were acquired with a Signa VH/i 3.0 T MRI scanner (General Electric, Milwaukee, WI) using a gradient EPI sequence (TR = 3 s, TE = 32 ms, FOV = 20 cm, flip = 90^{∘}, 64×64×44 voxels with resolution 3×3×3 mm^{3}). In each session, 165 volumes were recorded with the 4 first time points excluded from further analysis. Preprocessing of the fMRI data included realignment, normalisation with skull stripping, and smoothing. For additional details on the measurements and applied preprocessing, see Ylipaavalniemi et al. (2009). After preprocessing, the dimensionality was reduced to 40 by spatial independent component analysis (ICA) that identified spatial brain activation patterns related to various aspects of the stimuli. For each adult, the resulting data is 161 sets of ICA features (40 dimensional), which can be classified according to one of 6 stimuli (“touch”, “auditory” (tones, history, instruction), “visual” (faces, hands, buildings)).

Asymmetrical multi-task set up for fMRI data study

Subject | Classification task |
---|---|

1 (primary) | “touch” against all |

2 (secondary) | “touch” against all |

3 (secondary) | “touch” against all |

4 (secondary) | “touch” against all |

5 (secondary) | “auditory” (instruction) against all |

6 (secondary) | “visual” (buildings) against all |

We compare the focused multi-task learning approach (“focused MT-GP”) with five reference models. The first baseline model is single task learning using GP classification (“single task GP”), trained only on the samples of the primary task. The second (“pooled GP”) learns a GP classification model from the training examples from all tasks (i.e. treating all data as a single task). For “pooled GP” we use a sparse approximation when the number of training examples >300, using 30 pseudo-inputs. We also compare to three state-of-the-art methods, one transfer learning method and two (symmetric) multi-task learning methods: the predictive structure learning method of Ando and Zhang (2005, “AZ”), the symmetric multi-task learning with Dirichlet process priors method (“DP-MT”) from Xue et al. (2007), and the symmetric multi-task GP method (“MT-GP”) from Bonilla et al. (2008); the symmetric multi-task GP method was previously discussed in Sects. 2.1 and 2.2. For the “AZ” method, we fix the dimension of the shared predictive structure heuristically to *h*=26, after performing PCA across all the training samples (primary and secondary) and find the dimension of the subspace that explains 80 % of the variance.

We evaluate the methods using a fixed number of training examples in the primary task (64 and 161), while varying the number of training examples in each secondary task (ranging from 4 to 160), over 5 repetitions. We change the amount of secondary task data to investigate how the models’ performance is affected by the tasks (2–4) that may help learning on the primary task, and the more unrelated tasks (5–6). Note that the number of secondary tasks is fixed, only the amount of data in the secondary tasks changes. Due to the class imbalance in the data, when randomly picking a subset of secondary training task examples, we ensure that there is at least one positive and one negative example. For the GP-based methods, we also fix the bias parameter *b*=*Φ* ^{−1}(*r*) of the probit noise model, where *r* is the ratio of positive samples to negative samples in the training data.

Pooling of samples seems to always be a bad choice on this data. We also find that the symmetric models (MT-GP and DP-MT) perform poorly: both work only roughly equally to single-task learning for small numbers of secondary task data and the performance worsens as amount of secondary data increases. Hence it seems that the secondary data here differs from primary data to the extent of causing negative transfer. AZ seems to work better but at most on the same level as single task learning. More work would be needed for model selection, however, which might improve performance.

Focused MT-GP seems able to leverage on the secondary tasks, clearly outperfoming others including single task learning when the amount of data in the primary task is small. Multitask learning is most relevant when the primary task has little data; Focused MT-GP performs well in this scenario. When there is more primary data single task learning improves rapidly, although in Fig. 5 focused MT-GP still outperforms it. Focused MT-GP seems to need more than a few samples in the secondary tasks in order to perform well; the explanation is probably that for this data it is hard to distinguish between useful and negative transfer, and more data is needed to make the choice. Bad performance of pooling and symmetric multi-task approaches supports this interpretation. We will investigate the effect of small sample sizes on negative transfer in future work; the current result already shows that the asymmetric learning works well and outperforms other methods given a reasonable number of samples in secondary tasks.

## 6 Investigating effect of negative transfer in our model

Negative transfer essentially happens when a model mistakes non-related properties of a secondary task as being related. Although this might happen with small sample sizes even in a well-specified model, it may become much more prevalent if the model assumptions are incorrect. Although our asymmetric learning model involves flexible assumptions about task relationships, it is important to examine how well the model performs when the assumptions are violated. In this section we study the effect of violating the model assumptions, and the resulting negative transfer, in a controlled setting.

Our model is based on the assumption that there is an asymmetrical sharing structure within the data, with the emphasis on learning the sharing between the primary and secondary tasks. However, if there is strong shared structure between the secondary tasks which is not shared with the primary task, this could cause the model to learn that shared structure rather than the primary task function, yielding negative transfer to the primary task.

*f*

_{ a }, which is shared between secondary tasks only, is generated from a GP with squared exponential function and lengthscale (1/3). The secondary task functions for the secondary tasks

*s*

_{ m }are generated according to \(\mathbf{f}^{\,s_{m}} \sim\mathcal{GP}((s-1)\alpha _{m,1}\mathbf{f}_{p} + s\alpha_{m,2}\mathbf{f}_{a}, \beta_{m}\mathbf{K}^{\mathrm{spec}}_{s_{m}})\) where

**f**

_{ a }are the values generated for the shared noise function. Each specific kernel function is squared exponential with lengthscale 0.5. The

*α*

_{ m,1},

*α*

_{ m,2},

*β*

_{ m }are drawn uniformly at random from [0,1]. The

*s*is an indicator function to show whether the secondary task shares

*f*

_{ a }. We also add Gaussian noise generated from \(\mathcal{N}(0,0.01)\). We generate 10 secondary task functions, and vary the number of secondary tasks that share

*f*

_{ a }(the number which have

*s*=1) from 0 to 10, and use 10 replications. Figure 6(a) shows the mean squared error between the true underlying function and the predictive posterior mean over the test inputs, for each value of

*s*; the mean squared error remains low up to 6 tasks with the shared noise function. Figure 6(b) shows the correlation coefficient

*ρ*averaged over the secondary tasks, as the strength of

*f*

_{ a }increases; the average correlation coefficient decreases when ever more tasks use the shared noise function, showing that the model correctly learns that many of the secondary tasks are not useful for the primary task. Figures 6(c–f) show the posterior means for the test inputs for each run.

Overall, in this experiment our method appears reasonably tolerant against the presence of the non-useful shared noise function, maintaining good performance as the number of tasks featuring that function rises: as shown in Fig. 6(a), performance remains stable even up to 6 tasks featuring the shared noise function.

## 7 Asymmetric vs. symmetric multi-task learning

In the fMRI case study of Sect. 5.2 our asymmetric multi-task learning method outperformed several comparison methods, including the most closely related symmetric multi-task learning approach, the method of Bonilla et al. (2008) which is based on Gaussian processes and is here called “Symmetric multi-task GP”. Unlike our method, Symmetric MT-GP treats all tasks as equally important. The mathematical formulation of Symmetric MT-GP has been briefly discussed in Sects. 2.1 and 2.2. In this section we show that both our asymmetric model and the symmetric model will perform well for certain domains of problems, and both should be part of the multi-task learning “toolbox”.

We compare the performance of our method and the symmetric MT-GP on a continuum of multi-task learning problem domains. At the left end of the continuum, the problems follow the assumptions of our focused multi-task learning GP (“focused MT-GP”), and at the right end the problems follow the assumptions of symmetric MT-GP. For each domain, and each learning problem in the domain, the performance of the methods is evaluated by mean-square error over test samples in the primary task.

In detail, we evaluate the performance of the methods at 10 points along the continuum of domains, and we generate 30 multi-task learning problems from each domain along the continuum. All the learning problems are regression problems similar to Sect. 5.1: each problem contains 10 one-dimensional regression tasks (data sets), where the first task is the primary task and others are secondary tasks. Each secondary task has 50 input samples uniformly distributed along the interval [0,1]. The primary task has fewer samples, and moreover all primary task samples in the middle interval [0.25,0.75] have been left out of the training data, leaving 15 samples on average in the primary task. This design was chosen to highlight the multi-task learning ability of the methods: because training samples in the primary task are not provided for the middle of the input interval in the primary task, the primary function along the middle interval can only be learned well by learning across tasks.

In each task, the outputs for the inputs are generated from a weighted sum of Gaussian process functions, plus observation noise from a Gaussian observation noise model. The weighting of the GP functions is generated according to the domain: at the left end of the domain continuum the functions follow the Focused multi-task GP model, so that the primary task uses a single GP function \(\mathbf{f}^{p} \sim \mathcal{GP}(0,\mathbf{K}_{p})\), and in each secondary task *s* _{ m } the GP function \(\mathbf{f}^{\,s_{m}} = \alpha_{m} \mathbf{f}^{p} + \mathbf{f}^{\,s_{m},\mathrm{specific}}\) is a sum of the primary function and a specific function \(\mathbf{f}^{s_{m},\mathrm{specific}} \sim\mathcal{GP}(0,\mathbf{K}^{\mathrm {spec}}_{s_{m}})\), where the multiplier *α* _{ m } is drawn uniformly from [0,1]. All kernels **K** _{ p } and \(\mathbf{K}^{\mathrm{spec}}_{s_{m}}\) are squared exponential kernels with length scale 0.05. At the right end of the continuum, *K*=10 GP functions are shared across all tasks, so that \(\mathbf{f}_{k} \sim \mathcal{GP}(0,\mathbf{K}_{k})\) for *k*=1,…,10, and each primary or secondary task uses four of the 10 functions with random weights: \(\mathbf{f}^{p} = \sum_{k=1}^{K} w_{p,k} \mathbf{f}_{k}\) and \(\mathbf{f}^{\,s_{m}} = \sum_{k=1}^{K} w_{s,k}^{m} \mathbf{f}_{k}\) where a randomly chosen subset of four weights *w* _{ p,k } are drawn uniformly from [0,1] and the other six weights are zero, and similarly for weights \(w_{s,k}^{m}\) in each secondary task. This generative process at the right end of the continuum follows the assumptions of the Symmetric multi-task GP. In the intermediate domains the weights are generated with both procedures and linearly mixed together, yielding a smooth transition along the continuum from one kind of learning problems to the other. We ran both our Focused multi-task GP method and the Symmetric multi-task GP method for all problems in all domains and computed the error on test data from the primary task of each problem.^{1}

## 8 Model with several shared components

In our asymmetric learning model, only one shared function was used between tasks; this was already sufficient to yield very good performance in the previous experiments. We now point out that our model is not limited to one shared function: it is easy to extend our Focused multi-task Gaussian process model to incorporate more than one shared function that contribute to the primary task and which can be shared differently among different secondary tasks.

Note that in the extended model we will present in this section, there is still only one primary task; the difference is that the model can now handle more complex sharing between the primary task and the secondary tasks since the component functions of the primary task can each be shared differently with the secondary tasks.

*L*shared functions

*f*

^{ p,l },

*l*=1,…,

*L*, the primary task GP function is a simple sum of the shared functions: \(f^{p} = \sum_{l=1}^{L} f^{p,l}\). In each secondary task

*s*

_{ i }, the GP function is a weighted sum of the shared functions plus a task-specific function: \(f^{s_{i}} = f^{s_{i},\mathrm{specific}}+\sum_{l=1}^{L} \rho^{s_{i},l}f^{p,l} \). Note that each secondary task uses each shared function with a different multiplier \(\rho^{s_{i},l}\), allowing different secondary tasks to share different kinds of shared functions, and the variation not shared with the primary task is again explained away with the task specific GP function \(f^{s_{i},\mathrm{specific}}\).

^{2}Figure 8 shows the setup. The corresponding covariance function between outputs \(y^{s_{i}}_{j}\) and \(y^{s_{i'}}_{j'}\) in two secondary tasks

*s*

_{ i }and

*s*

_{ i′}is where the function

*δ*is 1 if the task indices

*s*

_{ i }and

*s*

_{ i′}are the same and zero otherwise; the covariance between a secondary task and the primary task (say

*s*

_{ i }is the primary task) is the same except that the multipliers \(\rho^{s_{i},l}\) are replaced by ones and the function

*δ*is replaced by zero; and the covariance within the primary task is simply

*L*times the covariance function

*k*

^{ p }over inputs. For simplicity we did not apply the computational speedup approximation used in (9) here, so (23) is directly used to compute the covariance function. This covariance function defines the GP prior over all tasks; Bayesian inference over the functions then proceeds as before, and the hyperparameters of the GP prior can again be learned by maximising marginal log likelihood. Our model with a single shared component is a special case of this model with

*L*=1.

*x*) and 0.8cos(30

*x*); the secondary tasks each share the two sinusoids with different strengths and also contain a specific function generated from a Gaussian process; a small amount of Gaussian observation noise is added to data of each task. The training data are shown in Fig. 9(left). We use the extended version of our model, here with two shared functions, to learn the primary task well. The resulting prediction is shown in Fig. 9(right) and yields mean squared error 0.005 over 101 equally-spaced test points, whereas a model learned with only one shared function would yield a larger mean squared error 0.022.

In our fMRI case study the model with a single shared component already yielded good results, but as shown here the flexibility provided by multiple shared components may be useful in future application domains.

## 9 Conclusion

We derived a multi-task Gaussian process learning method, the “focused multi-task GP”, designed for asymmetrical multi-task learning scenarios, to facilitate improved learning on a primary task through the transfer of relevant information from a set of potentially related secondary tasks. The novel dependency structure was formulated based on the GP predictive distribution over the secondary tasks given the primary task, and constraining the secondary tasks to be conditionally independent. After observing the primary task, the primary task function can be used to predict a part of each secondary task, depending on the degree of task relatedness, which is learned during the optimisation. The model also permits each secondary task to have its own task-specific variation which is unshared with the primary task, and this flexibility should cause the model to focus on modelling the primary task function well. We demonstrated the model on synthetic data and an asymmetrical multi-task learning problem with fMRI data, and showed improved performance over baseline approaches, and a state of the art transfer learning and multi-task learning method. We also experimentally demonstrated the performance of the model with increasing non-useful shared variation. We demonstrated that the model outperforms the comparable symmetric multi-task approach over several problem domains, and overall showed that both symmetric and asymmetric models should be part of the multi-task learning “toolbox”. Lastly we presented an extension of the model to several components shared with the primary task, and demonstrated its good performance in an initial experiment.

The key idea in the model is to make simplifying conditional independence assumptions about the relationships of the secondary tasks, but compensate the simplicity by adding a flexible “explaining away” model for each secondary task to reduce negative transfer. This structure is expected to perform well when the data fulfills the independence assumptions, but additionally due to the “explaining away” models reasonably well also in the ubiquitous case where the data does not exactly fit either this model or its alternatives. The performance was demonstrated empirically in this paper, and also analysed briefly. More theoretical analysis of the power of the “explaining away” models is still needed.

## Footnotes

- 1.
For Symmetric multi-task GP we used its authors’ implementation available at http://users.cecs.anu.edu.au/~u4882938/code.html, with random initialisations of the multitask matrix and fixed initialisation for other parameters. For our method, we did not use the computational speedup approximation of (9) since the data sets are fairly small, and we placed simple flat priors for the hyperparameters of the GPs. Both methods were run in Matlab and we kept their running times roughly equal (221 s per problem for our method, 279 s per problem for the symmetric multi-task GP); in that time, we were able to run three runs per problem of Focused multi-task GP from random initialisations, taking the run with the best internal cost function value, and one run per problem of Symmetric multi-task GP.

- 2.
Note that there is no need to consider more than one specific function for a secondary task: if several GP functions specific to some secondary task exist, their total contribution to the secondary task function is a weighted sum equivalent to a single specific GP function.

## Notes

### Acknowledgements

The authors belonged to the Adaptive Informatics Research Centre, a national CoE of the Academy of Finland, and S.K. and J.P. currently belong to the Finnish Centre of Excellence in Computational Inference Research COIN (grant no 251170). J.P. was supported by the Academy of Finland, decision numbers 123983 and 252845. This work was also supported in part by the PASCAL2 Network of Excellence, ICT 216886, and by the Tekes Multibio project.

## References

- Alvarez, M., & Lawrence, N. D. (2009). Sparse convolved Gaussian processes for multioutput regression. In D. Koller, D. Schuurmans, Y. Bengio & L. Bottou (Eds.),
*Advances in neural information processing systems*(Vol. 21, pp. 57–64). Cambridge: MIT Press. Google Scholar - Ando, K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data.
*Journal of Machine Learning Research*,*6*, 1817–1853. MathSciNetzbMATHGoogle Scholar - Bickel, S., Bogojeska, J., Lengauer, T., & Scheffer, T. (2008). Multi-task learning for HIV therapy screening. In A. McCallum & S. Roweis (Eds.),
*Proceedings of the 25th annual international conference on machine learning (ICML 2008)*(pp. 56–63). Madison: Omnipress. CrossRefGoogle Scholar - Bickel, S., Sawade, C., & Scheffer, T. (2009). Transfer learning by distribution matching for targeted advertising. In D. Koller, D. Schuurmans, Y. Bengio & L. Bottou (Eds.),
*Advances in neural information processing systems*(Vol. 21, pp. 145–152). Cambridge: MIT Press. Google Scholar - Bonilla, E. V., Chai, K. M. A., & Williams, C. K. I. (2008). Multi-task Gaussian process prediction. In J. C. Platt, D. Koller, Y. Singer & S. Roweis (Eds.),
*Advances in neural information processing systems*(Vol. 20, pp. 153–160). Cambridge: MIT Press. Google Scholar - Chai, K. M. A. (2009). Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams & A. Culotta (Eds.),
*Advances in neural information processing systems*(Vol. 22, pp. 279–287). Cambridge: MIT Press. Google Scholar - Leen, G., Peltonen, J., Kaski, S. (2011). Focused multi-task learning using Gaussian processes. In D. Gunopulos, T. Hofmann, D. Malerba & M. Vazirgiannis (Eds.),
*Machine learning and knowledge discovery in databases, proceedings of ECML PKDD 2011*(Part II, pp. 310–325). Berlin: Springer. CrossRefGoogle Scholar - Malinen, S., Hlushchuk, Y., & Hari, R. (2007). Towards natural stimulation in fMRI—issues of data analysis.
*Neuroimage*,*35*(1), 131–139. CrossRefGoogle Scholar - Marx, Z., Rosenstein, M. T., & Kaelbling, L. P. (2005). Transfer learning with an ensemble of background tasks. In
*Inductive transfer: 10 years later, NIPS 2005 workshop*. Google Scholar - Minka, T. (2001). Expectation propagation for approximative Bayesian inference. In J. S. Breese & D. Koller (Eds.),
*Proceedings of the 17th conference in uncertainty in artificial intelligence*(pp. 362–369). San Fransisco: Morgan Kaufmann. Google Scholar - Pan, S. J., & Yang, Q. (2010). A survey on transfer learning.
*IEEE Transactions on Knowledge and Data Engineering*,*22*, 1345–1359. CrossRefGoogle Scholar - Raina, R., Ng, A. Y., & Koller, D. (2005). Transfer learning by constructing informative priors. In
*Inductive transfer: 10 years later, NIPS 2005 workshop*. Google Scholar - Rosenstein, M. T., Marx, Z., & Kaelbling, L. P. (2005). To transfer or not to transfer. In
*Inductive transfer: 10 years later, NIPS 2005 workshop*. Google Scholar - Snelson, E., & Ghahramani, Z. (2006). Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf & J. Platt (Eds.),
*Advances in neural information processing systems*(Vol. 18, pp. 1257–1264). Cambridge: MIT Press. Google Scholar - Teh, W., Seeger, M., & Jordan, M. I. (2005). Semiparametric latent factor models. In R. G. Cowell & Z. Ghahramani (Eds.),
*Proceedings of AISTATS 2005, the tenth international workshop on artificial intelligence and statistics*(pp. 333–340). Society for artificial intelligence and statistics. Available electronically at http://www.gatsby.ucl.ac.uk/aistats/. Google Scholar - Teh, Y. W., Jordan, I., Beal, J., & Blei, D. M. (2006). Hierarchical Dirichlet processes.
*Journal of the American Statistical Association*,*101*(476), 1566–1581. MathSciNetzbMATHCrossRefGoogle Scholar - Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In D. S. Touretzky, M. C. Mozer & M. E. Hasselmo (Eds.),
*Advances in neural information processing systems*(Vol. 8, pp. 640–646). Cambridge: MIT Press. Google Scholar - Wackernagel, H. (1994). Cokriging versus kriging in regionalized multivariate data analysis.
*Geoderma*,*62*, 83–92. CrossRefGoogle Scholar - Wu, P., & Dietterich, T. G. (2004). Improving SVM accuracy by training on auxiliary data sources. In R. Greiner & D. Schuurmans (Eds.),
*Proceedings of the 21st international conference on machine learning (ICML 2004)*(pp. 871–878). Madison: Omnipress. Google Scholar - Xue, Y., Liao, X., & Carin, L. (2007). Multi-task learning for classification with Dirichlet process priors.
*Journal of Machine Learning Research*,*8*, 35–63. MathSciNetzbMATHGoogle Scholar - Ylipaavalniemi, J., Savia, E., Malinen, S., Hari, R., Vigário, R., & Kaski, S. (2009). Dependencies between stimuli and spatially independent fMRI sources: towards brain correlates of natural stimuli.
*Neuroimage*,*48*, 176–185. CrossRefGoogle Scholar - Yu, K., & Tresp, V. (2005). Learning to learn and collaborative filtering. In
*Inductive transfer: 10 years later, NIPS 2005 workshop*. Google Scholar - Yu, K., Chu, W., Yu, S., & Tresp, V. Z. X. (2007). Stochastic relational models for discriminative link prediction. In B. Schölkopf, J. Platt & T. Hoffman (Eds.),
*Advances in neural information processing systems*(Vol. 19, pp. 1553–1560). Cambridge: MIT Press. Google Scholar