# Fast rates by transferring from auxiliary hypotheses

- 361 Downloads

## Abstract

In this work we consider the learning setting where, in addition to the training set, the learner receives a collection of auxiliary hypotheses originating from other tasks. We focus on a broad class of ERM-based linear algorithms that can be instantiated with any non-negative smooth loss function and any strongly convex regularizer. We establish generalization and excess risk bounds, showing that, if the algorithm is fed with a good combination of source hypotheses, generalization happens at the fast rate \(\mathcal {O}(1/m)\) instead of the usual \(\mathcal {O}(1/\sqrt{m})\). On the other hand, if the source hypotheses combination is a misfit for the target task, we recover the usual learning rate. As a byproduct of our study, we also prove a new bound on the Rademacher complexity of the smooth loss class under weaker assumptions compared to previous works.

### Keywords

Fast-rate generalization bounds Transfer learning Domain adaptation Rademacher complexity Smooth loss functions Strongly-convex regularizers## 1 Introduction

In the standard supervised machine learning setting the learner receives a set of labeled examples, known as the training set. However, very often we have additional information at hand that could be beneficial to the learning process. One such an example is the use of unlabeled data drawn from the marginal distributions, which gives rise to the semi-supervised learning setting (Chapelle et al. 2006). Another example is when the training data is coming from a related problem, as in multi-task learning (Caruana 1997), domain adaptation (Ben-David et al. 2010; Mansour et al. 2009), and transfer learning (Pan and Yang 2010; Taylor and Stone 2009). Among others, there is the use of structural information, such as taxonomy, different views on the same data (Blum and Mitchell 1998), or even a sort of privileged information (Vapnik and Vashist 2009; Sharmanska et al. 2013). In recent years all these directions have received considerable empirical and theoretical attention.

In this work we focus on a less theoretically studied direction in the use of supplementary information—learning with *auxiliary hypotheses*, that is classifiers or regressors originating from other tasks. In particular, in addition to the training set we assume that the learner is supplied with a collection of hypotheses and their predictions on the training set itself. The goal of the learner is to figure out which hypotheses are helpful and use them to improve the prediction performance of the trained classifier. We will call these auxiliary hypotheses the *source* hypotheses and we will say that helpful ones accelerate the learning on the *target* task. We focus on the linear setting, that is, we train a linear^{1} classifier and the source hypotheses are used additively in the prediction process, weighted by arbitrary weights. This generalizes the setting in which the outputs of the source hypotheses are concatenated with the feature vector, a widely used heuristic (Bergamo and Torresani 2014; Li et al. 2010; Tommasi et al. 2014).

The scenario described above is related to the Transfer Learning (TL) and Domain Adaptation (DA) ones, or learning effectively from a possibly small amount of data by reusing prior knowledge (Thrun and Pratt 1998; Pan and Yang 2010; Taylor and Stone 2009; Ben-David et al. 2010). However, transferring from hypotheses offers an advantage compared to the TL and DA frameworks, where one requires access to the data of the *source* domain. For example, in DA (Ben-David et al. 2010), one employs large unlabeled samples to estimate the relatedness of source and target domains to perform the adaptation. Even if unlabeled data are abundant, the estimation of adaptation parameters can be computationally prohibitive. This is the case, for example, when a large number of domains is involved or when one acquires new domains incrementally.

A recently proposed setting, closer to the one we consider, is Hypothesis Transfer Learning (HTL) (Kuzborskij and Orabona 2013; Ben-David and Urner 2013), where the practical limitations of TL and DA are alleviated through indirect access to the *source domain* by means of a *source hypothesis*. Also, in the HTL setting there are no restrictions on how the source hypotheses can be used to boost the performance on the target task.

Albeit empirically the setting considered in this paper has already been extensively exploited in the past (Yang et al. 2007; Orabona et al. 2009; Tommasi et al. 2010; Luo et al. 2011; Kuzborskij et al. 2013). A first theoretical treatment of this setting was given by Kuzborskij and Orabona (2013), where we analyzed a linear HTL algorithm that solves a regularized least-squares problem with a single fixed, unweighted, source hypothesis. We proved a polynomial generalization bound that depends on the performance of the fixed source hypothesis on the target task.

### 1.1 Our contributions

We extend the formulation in Kuzborskij and Orabona (2013), with a general regularized Empirical Risk Minimization (ERM) problem with respect to any non-negative smooth loss function, not necessarily convex, and any strongly convex regularizer. We prove high-probability generalization bounds that exhibit *fast rate*, that is \(\mathcal {O}(1/m)\), of convergence whenever any *weighted combination* of multiple source hypotheses performs well on the target task. In addition, we show that, if the combination is perfect, the error on the training set becomes deterministically equal to the generalization error. Furthermore, we analyze the excess risk of our formulation, and conclude that a good source hypothesis also speeds up the convergence to the performance of the best-in-the-class. As a byproduct of our study, we prove an upper bound on the Rademacher complexity of a smooth loss class that provides extra information compared to that of Lipschitz loss classes. Our analysis, which might be of independent interest, is an alternative to the analysis of Srebro et al. (2010) and it holds under much weaker assumptions.

The rest of the paper is organized as follows. In the next section we make a brief review of the previous work. Next, we formally state our formulation in Sect. 4 and present the main results right after, in Sect. 5. In Sect. 5.1 we discuss the implications and compare them to the body of literature in learning with fast rates and transfer learning. Next, in Sect. 6, we present the proofs of our main results. Section 7 concludes the paper.

## 2 Related work

Kuzborskij and Orabona (2013) showed that the generalization ability of the regularized least-squares HTL algorithm improves if the supplied *source* hypothesis performs well on the target task. More specifically, we proposed a key criterion, *the risk of the source hypothesis on the target domain*, that captures the relatedness of the source and target domains. Later, Ben-David and Urner (2013) showed a similar bound, but with a different quantity capturing the relatedness between source and target. Instead of considering a general source hypothesis, they have confined their analysis to the linear hypothesis class. This allowed them to show that the target hypothesis generalizes better when it is close to the good source hypothesis. From this perspective it is easy to interpret the source hypothesis as an initialization point in the hypothesis class. Naturally, given a starting position that is close to the best in the class, one generalizes well.

Prior to these works there were few studies trying to understand the learning with auxiliary hypotheses subject to different conditions. Li and Bilmes (2007) have analyzed a Bayesian approach to HTL. Employing a PAC-Bayes analysis they showed that given a prior on the hypothesis class, the generalization ability of logistic regression improves if the prior is informative on the target task. Mansour et al. (2008) analyzed a setting of *multiple source hypotheses* combination. There, in addition to the source hypotheses, the learner receives unlabeled samples drawn from the source distributions, that are used to weight and combine these source hypotheses. They have studied the possibility of learning in such a scenario, however, they did not address the generalization properties of any particular algorithm.

Unlike these works, we focus on the generalization ability of a large family of HTL algorithms that generate the target predictor given a set of multiple source hypotheses. In particular, we analyze Regularized Empirical Risk Minimization with the choice of any non-negative smooth loss and any strongly convex regularizer. Thus our analysis covers a wide range of algorithms, explaining their empirical success. One category of those, prevalent in computer vision (Kienzle and Chellapilla 2006; Yang et al. 2007; Tommasi et al. 2010; Aytar and Zisserman 2011; Kuzborskij et al. 2013; Tommasi et al. 2014), employs the principle of biased regularization (Schölkopf et al. 2001). For example, instead of penalizing large weights by introducing the term \(\Vert \mathbf {w}\Vert ^2\) into the objective function, one enforces them to be close to some “prior” model, that is \(\Vert \mathbf {w}- \mathbf {w}^{\text {prior}}\Vert ^2\). This principle also found its applications in other fields, such as NLP (Daumé III 2007; Daumé III et al. 2010), and electromyography classification (Orabona et al. 2009; Tommasi et al. 2013). Many empirical works have also investigated the use of the source hypotheses in a “black box” sense, sometimes not even posing the problem as transfer learning (Duan et al. 2009; Li et al. 2010; Luo et al. 2011; Bergamo and Torresani 2014), and recently in conjunction with deep neural networks (Oquab et al. 2014).

In the literature there are several other machine learning directions conceptually similar to the one we consider in this work. Arguably, the most well known one is the Domain Adaptation (DA) problem. The standard machine learning assumption is that the training and the testing sets are sampled from the same probability distribution. In such case, we expect that a hypothesis generated by the learner from that training set will lead to sensible predictions on the testing set. The difficulty arises when training and testing distributions differ, that is we have a training set sampled from the *source domain* and testing set from the *target domain*. Clearly, the hypothesis generated from the source domain can perform arbitrarily badly on the target one. The paradigm of DA, addressing this issue has received a lot of attention in recent years (Ben-David et al. 2010; Mansour et al. 2009). Although, this framework is different from the one we study in this work, we identify similarities and compare our findings with the theory of learning from different domains in Sect. 5.2.

## 3 Definitions

In this section we introduce the definitions used in the rest of the paper.

We denote random variables by capital letters. The expected value of a random variable distributed according to a probability distribution \(\mathcal {D}\) is denoted by \({{\mathrm{\mathbb {E}}}}_{X \sim \mathcal {D}}[X]\) and the variance is denoted by \(\mathrm {Var}_{X \sim \mathcal {D}}[X]\). The small and capital bold letters will stand respectively for the vectors and matrices, e.g. \(\mathbf {x}= [x_1, \ldots , x_d]^{\top }\) and \(\mathbf {A}\in \mathbb {R}^{d_1 \times d_2 }~\).

Denoting by \(\mathcal {X}\) and \(\mathcal {Y}\) respectively the input and output space of the learning problem, the training set is \(S=\{(\mathbf {x}_i,y_i)\}_{i=1}^m\), drawn i.i.d. from the probability distribution \(\mathcal {D}\) defined over \(\mathcal {X}\times \mathcal {Y}\). Without the loss of generality we will have \(\mathcal {X}= \{\mathbf {x}: \Vert \mathbf {x}\Vert \le 1\}\) and we will focus on the problems where \(\mathcal {Y}= [-C, C]\).

*y*. The

*risk*of a hypothesis

*h*, with respect to a probability distribution \(\mathcal {D}\), and the

*empirical risk*measured on the sample

*S*are then defined as

*target*domain, unless stated otherwise. We capture the smoothness of the loss function via following definition.

### 3.1 *H*-smooth loss function

*H*

*-smooth*iff,

### 3.2 Strongly convex function

*S*and its expectation are defined as

## 4 Transferring from auxiliary hypotheses

*source hypotheses*\(\{h^{\text {src}}_i : \mathcal {X}\mapsto \mathcal {Y}\}_{i=1}^n\) within the framework of Regularized Empirical Risk Minimization (ERM). These problems typically involve a criterion for source hypothesis selection and combination with the goal to increase performance on the

*target task*(Yang et al. 2007; Tommasi et al. 2014; Kuzborskij et al. 2015). Indeed, some source hypotheses might come from tasks similar to the target task and the goal of an algorithm is to select only relevant ones. In this work we will consider source combination

### 4.1 Regularized ERM for transferring from auxiliary hypotheses

*H*-smooth loss function and let \(\varOmega : \mathcal {H}\mapsto \mathbb {R}_+\) be a \(\sigma \)-strongly convex function w.r.t. a norm \(\Vert \cdot \Vert \). Given the target training set \(S = \{(\mathbf {x}_i, y_i)\}_{i=1}^m\), \(\lambda \in \mathbb {R}_+\), source hypotheses \(\{h^{\text {src}}_i\}_{i=1}^n\), and parameters \(\varvec{\beta }\) obeying \(\varOmega (\varvec{\beta }) \le \rho \), the algorithm generates the

*target hypothesis*\(h_{\hat{\mathbf {w}}, \varvec{\beta }}\), such that

One special example covered by our analysis, commonly applied in transfer learning, is the *biased regularization* (Schölkopf et al. 2001). Consider the following least-squares based algorithm.

### 4.2 Least-squares with biased regularization

### Claim

Least-Squares with Biased Regularization is a special case of the Regularized ERM in (1).

### Proof

Albeit practically appealing, the formulation (3) is limited in the fact that the source hypotheses must be a linear predictor living in the same space of the target predictor. Instead, the formulation in (1) naturally generalizes the biased regularization formulation, allowing to treat the source hypothesis as “black box” predictors.

## 5 Main results

In this section, we present the main results of this work: generalization and excess risk bounds for the Regularized ERM. In the next section we discuss in detail the implications of these results, while we defer the proofs to the subsequent sections.

The first bound demonstrates the utility of the perfect combination of source hypotheses, while the second lets us observe the dependency on the arbitrary combination. In particular, the first bound explicitates the intuition that given a perfect source hypothesis learning is not required. In other words, when \(R^{\text {src}}=0\) we have that the empirical risk becomes equal to the risk with probability one.

### Theorem 1

*m*-sized training set

*S*sampled i.i.d. from the target domain, source hypotheses \(\{h^{\text {src}}_i : \Vert h^{\text {src}}_i\Vert _\infty \le 1 \}_{i=1}^n\), any source weights \(\varvec{\beta }\) obeying \(\varOmega (\varvec{\beta }) \le \rho \), and \(\lambda \in \mathbb {R}_+\). Assume that \(\ell (h_{\hat{\mathbf {w}}, \varvec{\beta }}(\mathbf {x}), y) \le M\) for any \((\mathbf {x}, y)\) and any training set. Then, denoting \(\kappa = \frac{H}{\sigma }\) and assuming that \(\lambda \le \kappa \), we have with probability at least \(1 - e^{-\eta }, \ \forall \eta \ge 0\)

Now we focus on the consistency of the HTL. Specifically, we show an upper bound on the excess risk of the Regularized ERM, which depends on \(R^{\text {src}}\), that is the risk of the combined source hypothesis \(h^{\text {src}}_{\varvec{\beta }}\) on the target domain. We observe that for a small \(R^{\text {src}}\), the excess risk shrinks at a fast rate of \(\mathcal {O}(1/m)\). In other words, good prior knowledge guarantees not only good generalization, but also fast recovery of the performance of the best hypothesis in the class.

This bound is similar in spirit to the results of localized complexities, as in the works of Bartlett et al. (2005), Srebro et al. (2010), however we focus on the linear HTL scenario rather than a generic learning setting. Later, in Sect. 5.1, we compare our bounds to these works and show that our analysis achieves superior results.

### Theorem 2

*m*-sized training set

*S*sampled i.i.d. from the target domain, source hypotheses \(\{h^{\text {src}}_i : \Vert h^{\text {src}}_i\Vert _\infty \le 1\}_{i=1}^n\), any source weights \(\varvec{\beta }\) obeying \(\varOmega (\varvec{\beta }) \le \rho \), and \(\lambda \in \mathbb {R}_+\). Then, denoting \(\kappa = \frac{H}{\sigma }\), assuming that \(\lambda \le \kappa \le 1\), and setting the regularization parameter

### 5.1 Implications

We start by discussing the effect on the generalization ability of the source hypothesis combination. Intuitively, a good source hypothesis combination should facilitate transfer learning, while a reasonable algorithm must not fail if we provide it with the bad one. That said, a natural question to ask here is, what makes a good or bad source hypothesis? As in previous works in transfer learning and domain adaptation, we capture this notion via a quantity that has two-fold interpretation: (1) the performance of the source hypothesis combination on the target domain; (2) relatedness of source and target domains. In the theorems presented in the previous sections we denoted it by \(R^{\text {src}}\), that is the risk of the source hypothesis combination on the target domain. In this section we will consider various regimes of interest with respect to \(R^{\text {src}}\).

#### 5.1.1 When the source is a bad fit

First consider the case when the source hypothesis combination \(h^{\text {src}}_{\varvec{\beta }}\) is useless for the purpose of transfer learning, for example, \(h^{\text {src}}_{\varvec{\beta }}(\mathbf {x}) = 0\) for all \(\mathbf {x}\). This corresponds to learning with no auxiliary information. Then we can assume that \(R^{\text {src}}\le M\), and from Theorem 1 we obtain \( R(h_{\hat{\mathbf {w}}}) - \hat{R}_S(h_{\hat{\mathbf {w}}}) \le \mathcal {O}\left( 1/ (\sqrt{m} \lambda ) \right) \). This rate matches the one in the analysis of regularized least-squares (Vito et al. 2005; Bousquet and Elisseeff 2002), which is a special case of the smooth loss function that the Regularized ERM employs. On the other hand, Srebro et al. (2010) showed a better worst-case rate \(\mathcal {O}(1/\sqrt{m \lambda })\). However, their framework builds upon a worst case Rademacher complexity which does not involve the expectation over the sample and does not lead to the dependency on \(R^{\text {src}}\) we have obtained in Theorem 1. We will discuss this problem in details later.

#### 5.1.2 When the source is a good fit

Here we would like to consider the behavior of the algorithm in the finite-sample and asymptotic scenarios. We first look at the regime of small *m*, in particular \(m = \mathcal {O}(1/R^{\text {src}})\). In this case, the fast rate term will dominate the bound, and we obtain the convergence rate of \(\mathcal {O}( \sqrt{\rho } / (m \sqrt{\lambda }) )\). In other words, we can expect faster convergence when *m* is small, where “small” depends on \(R^{\text {src}}\), the quality of combined source hypotheses. Now consider the asymptotic behavior of the algorithm, particularly when *m* goes to infinity. In such case, the algorithm exhibits a rate of \(\mathcal {O}\left( R^{\text {src}}/ \sqrt{m} \lambda + \sqrt{(R^{\text {src}}\rho ) / m \lambda }\right) \), so \(R^{\text {src}}\) controls the constant factor of the rate. Hence, the quantity \(R^{\text {src}}\) governs the transient regime for small *m* and the asymptotic behavior of the algorithm, predicting faster convergence in both regimes when it is small.

#### 5.1.3 When source is a perfect fit

It is conceivable that the source hypothesis exploited is the perfect one, that is \(R^{\text {src}}= 0\). In other words, the source hypothesis combination is a perfect predictor for the target domain. Theorem 1 implies that \(R(h_{\hat{\mathbf {w}}, \varvec{\beta }}) = \hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }})\) with probability one. We note that for many practically used smooth losses, such as the square loss, this setting is only realistic if the source and target domains match and the problem is noise-free. However, we can observe \(R^{\text {src}}= 0\), for example, when the squared hinge loss, \(\ell (z,y) = \max \{0, 1 - zy\}^2\), is used and all target domain examples are classified correctly by the source hypothesis combination, case that is not unthinkable for related domains.

#### 5.1.4 Fast rates

There is a number of works in the literature investigating a rate of convergence faster than \(1/\sqrt{m}\) subject to different conditions. In particular, the localized Rademacher complexity bounds of Bartlett et al. (2005) and Bousquet (2002) can be used to obtain results similar to the second inequality of Theorem 1. Indeed, Theorem 4 shows a bound which is very similar to the localized ones, albeit with two differences. The r.h.s. of the first inequality in Theorem 4 vanishes when the loss class has zero variance. Though intuitively trivial, this allows to prove a considerable result in the theory of transfer learning as it quantifies the intuition that no learning is necessary if the source has perfect performance on the target task. Second, by applying the standard localized Rademacher complexity bounds of Bousquet (2002), and assuming the use of the Lipschitz loss function, we do not achieve a fast rate of convergence, as can be seen from Theorem 8, shown in the ‘Appendix’. We suspect that assuming the smoothness of the loss function is crucial to prove fast rates in our formulation.

Fast rates for ERM with the smooth loss have been thoroughly analyzed by Srebro et al. (2010). Yet, the analysis of our HTL algorithm within their framework would yield a bound that is inferior to ours in two respects. The first concerns the scenario when the combined source hypothesis is perfect, that is \(R^{\text {src}}= 0\). The generalization bound of Srebro et al. (2010) does not offer a way to show that the empirical risk converges to the risk with probability one—instead one can only hope to get a fast rate of convergence. The second problem is in the fact that such bound would depend on the empirical performance of combined source hypothesis. As we have noted before, the quantity \(R^{\text {src}}\) is essential because it captures the degree of relatedness between two domains. In their bounds, one cannot obtain this relationship through the Rademacher complexity term as we did in our analysis. The reason for this is the stronger notion of Rademacher complexity that is employed by that framework, involving a supremum over the sample instead of an expectation. The expectation over the sample of the target distribution is crucial here, because it allows us to quantify how well the source domain is aligned with the target domain, through the source hypothesis acting as a link. However, one can attempt to obtain the bound on the empirical risk in terms of \(R^{\text {src}}\). We prove such a bound in the ‘Appendix’, Theorem 6, and conclude that if one has a good source hypothesis or even a perfect one, the rate is \(\mathcal {O}(1/\root 4 \of {m^3})\), which is worse than ours.

### 5.2 Comparison to theories of domain adaptation and transfer learning

*adapt*, the source training set. To answer this question, the DA literature introduces the notion of domain relatedness, which quantifies the dissimilarities between the marginal distributions of corresponding domains. Practically, in some cases the domain relatedness can be estimated through a large set of unlabeled samples drawn from both source and target domains. Theories of DA (Ben-David et al. 2010; Mansour et al. 2009; Ben-David and Urner 2012; Mansour et al. 2008; Cortes and Mohri 2014) have proposed a number of such domain relatedness criteria. Perhaps the most well known are the \(d_{\mathcal {H}\varDelta \mathcal {H}}\)-divergence (Ben-David et al. 2010) and its more general counterpart, the Discrepancy Distance (Mansour et al. 2009). Typically, this divergence is explicitated in the generalization bound along with other terms controlling the generalization on the target domain. Let \(R_{\mathcal {D}^{\text {trg}}}(h)\) and \(R_{\mathcal {D}^{\text {src}}}(h)\) denote the risks of the hypothesis

*h*, measured w.r.t. the target and source distributions. Then a well-known result of Ben-David et al. (2010) suggests that for all \(h \in \mathcal {H}\)

*L*2-regularized least squares and a generalization bound involving the leave-one-out risk instead of the empirical one. The following result, obtained through an algorithmic stability argument (Bousquet and Elisseeff 2002), holds with probability at least \(1 - \delta \)

### 5.3 Combining source hypotheses in practice

Many HTL-like algorithms can be captured through the above by choosing among different loss functions and regularizers \(\varOmega \). The simplest case is just a concatenation of the source hypotheses predictions with the original feature vector. However, by choosing different regularizers and their parameters, we can treat the source hypotheses in a different way from the original features. For example, one might enforce sparsity over the source hypotheses, while using the usual L2 regularizer on the target solution \(\hat{\mathbf {w}}\).

## 6 Technical results and proofs

In this section we present general technical results that are used to prove our theorems.

First, we present the Rademacher complexity generalization bound in Theorem 4, which slightly differs from the usual ones. The difference comes in the assumption that the variance of the loss is uniformly bounded over the hypothesis class. This will allow us to state a generalization bound that obeys the fast empirical risk convergence rate subject to the small class complexity. Second, we will also show a generalization bound with a confidence term that vanishes if the complexity of the class is exactly zero.

Next, we focus on the Rademacher complexity of the smooth loss function class. We prove a bound on the empirical Rademacher complexity of a hypothesis class, Lemma 3, that depends on the point-wise bounds on the loss function. This novel bound might be of independent interest.

Finally, we employ this result to analyze the effect of the source hypotheses on the complexity of the target hypothesis class in Theorem 5.

### 6.1 Fast rate generalization bound

The proof of fast-rate and vanishing-confidence-term bounds, Theorem 4, stems from the functional generalization of Bennett’s inequality which is due to Bousquet (2002, Theorem 2.11) and we report it here for completeness.

### Theorem 3

The following technical lemma will be used to invert the right hand side of (10).

### Lemma 1

Let \(a,b>0\) such that \(b = (1+a) \log (1+a) - a\). Then \(a\le \frac{3 b}{2 \log (\sqrt{b}+1)}\).

### Proof

*a*, we need an upper bound to the Lambert function. We use Theorem 2.3 in Hoorfar and Hassani (2008), that says that

The following lemma is a standard tool (Mohri et al. 2012, (3.8)–(3.13); Bartlett and Mendelson 2003).

### Lemma 2

Now we are ready to present the proof of Theorem 4.

### Theorem 4

*S*of size

*m*be sampled i.i.d. from the probability distribution over \(\mathcal {X}\times \mathcal {Y}\). Also for any \(r \ge 0\), define the loss class with respect to the hypothesis class \(\mathcal {H}\) as,

*S*of size

*m*, with probability at least \(1 - e^{-\eta }, \ \forall \eta \ge 0\)

### Proof

*t*, and provide an upper-bound on

*v*. For the first part, recall that \(u(y) = (1+y) \log (1+y) - y\). To give an upper-bound of

*t*, we apply Lemma 1 with \(a=\frac{t}{v}\), and \(b=\frac{1}{v}\eta \). This leads to the inequalities

*v*. We first show that the variance of centered loss function, \(\sigma ^2\), is uniformly bounded by the Rademacher complexity. From the definition of variance we have

*v*by applying Lemma 2,

*v*,

### 6.2 Rademacher complexity of smooth loss class

In this section we study the Rademacher complexity of the hypothesis class populated by functions of the form (1), where the parameters \(\mathbf {w}\) and \(\varvec{\beta }\) are chosen by an algorithm with a strongly convex regularizer. For this purpose we employ the results of Kakade et al. (2008, 2012), who studied strongly convex regularizers in a more general setting. Furthermore, we will focus on the use of smooth loss functions as done by Srebro et al. (2010).

The proof of the main result of this section, Theorem 5, depends essentially on the following lemma, that bounds the empirical Rademacher complexity of a *H*-smooth loss class.

### Lemma 3

*H*-smooth loss function. Then for some function class \(\mathcal {F}\), let the loss class be

*S*of size

*m*and the set

### Proof

*H*-smooth non-negative function \(\phi : \mathbb {R}\mapsto \mathbb {R}_+\) and any \(x,z \in \mathbb {R}\),

*S*, then, by definition,

To prove Theorem 5 we will also use the following lemma in Kakade et al. (2012, Corollary 4).

### Lemma 4

Now we are ready to give the proofs of the Rademacher complexity results.

### Theorem 5

*H*-smooth loss function \(\ell : \mathcal {Y}\times \mathcal {Y}\mapsto \mathbb {R}_+\). Finally, given the set of functions \(\{f_i : \mathcal {X}\mapsto \mathcal {Y}\}_{i=1}^n\) with \(\mathbf {f}(\mathbf {x}) := [f_1(\mathbf {x}), \ldots , f_n(\mathbf {x})]^{\top }\), a combination \(f_{\varvec{\beta }}(\mathbf {x}) = \left\langle \varvec{\beta }, \mathbf {f}(\mathbf {x}) \right\rangle \), a scalar \(\alpha > 0\), and any sample

*S*drawn i.i.d. from distribution over \(\mathcal {X}\times \mathcal {Y}\), define classes

### Proof

*t*gives us

*r*. First we obtain bounds on each \(\tau _i\). We start with the bound on the loss function, exploiting smoothness. Let \(\ell (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}), y) = \phi (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}))\), where \(\phi : \mathbb {R}\mapsto \mathbb {R}\) is an

*H*-smooth function. From the definition of smoothness (Shalev-Shwartz and Ben-David 2014, (12.5)), we have that for all \(\mathbf {w}\) and \(\mathbf {v}\)

*H*-smooth non-negative function \(\phi \), we have that \(|\phi '(t)| \le \sqrt{4 H \phi (t)}\), (Srebro et al. 2010, Lemma 2.1). Now recall a property of a \(\sigma \)-strongly-convex function

*F*, that holds for its minimizer \(\mathbf {v}\) and any \(\mathbf {w}\) (Shalev-Shwartz and Ben-David 2014, Lemma 13.5),

### 6.3 Proofs of main results

### Proof of Theorem 1

To show the statement we will apply Theorem 4. In particular, we will consider any choice of \(\mathbf {w}\) and \(\varvec{\beta }\) within the set induced by a strongly-convex function \(\varOmega \). To apply Theorem 4, we need to upper bound the Rademacher complexity of the loss class \(\mathcal {L}\) and also the quantity \(r = \sup _{f \in \mathcal {L}} {{\mathrm{\mathbb {E}}}}_{(\mathbf {x},y)} [f(\mathbf {x}, y)]\).

*r*

*r*into the statement of Theorem 4, and applying the inequality \(\sqrt{a+b} \le \sqrt{a}+ \frac{b}{2\sqrt{a}}\) to the \(\sqrt{v}\) term, gives the statement. \(\square \)

### Proof of Theorem 2

## 7 Conclusions

In this paper we have formally captured and theoretically analyzed a general family of learning algorithms transferring information from multiple supplied source hypotheses. In particular, our formulation stems from the regularized Empirical Risk Minimization principle with the choice of any non-negative smooth loss function and any strongly convex regularizer. Theoretically we have analyzed the generalization ability and excess risk of this family of HTL algorithms. Our analysis showed that a good source hypothesis combination facilitates faster generalization, specifically in \(\mathcal {O}(1/m)\) instead of the usual \(\mathcal {O}(1/\sqrt{m})\). Furthermore, given a perfect source hypothesis combination, our analysis is consistent with the intuition that learning is not required. As a byproduct of our investigation, we came up with new results in Rademacher complexity analysis of the smooth loss classes, which could be of independent interest.

Our conclusions suggest the key importance of a source hypothesis selection procedure. Indeed, when an algorithm is provided with enormous pool of source hypotheses, how to select relevant ones on the basis of only a few labeled examples? This might sound similar to the feature selection problem under the condition that \(n \gg m\), however, earlier empirical studies by Tommasi et al. (2014) with hundreds of sources did not find much corroboration for this hypothesis when applying *L*1 regularization. Thus, it remains unclear if having few good sources from hundreds is a reasonable assumption.

## Footnotes

- 1.
Non-linear classifiers can be easily produced with the use of kernels.

### References

- Aytar, Y., & Zisserman, A. (2011). Tabula rasa: Model transfer for object category detection. In
*IEEE International Conference on Computer Vision (ICCV)*, (pp. 2252–2259). IEEE.Google Scholar - Bartlett, P. L., & Mendelson, S. (2003). Rademacher and gaussian complexities: Risk bounds and structural results.
*Journal of Machine Learning Research*,*3*, 463–482.MathSciNetMATHGoogle Scholar - Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities.
*Annals of Statistics*,*33*(4), 1497–1537.MathSciNetCrossRefMATHGoogle Scholar - Ben-David, S., & Urner, R. (2012). On the hardness of domain adaptation and the utility of unlabeled target samples. In
*Algorithmic learning theory*,*lecture notes in computer science*(Vol. 7568, pp. 139–153). Springer.Google Scholar - Ben-David, S., & Urner, R. (2013). Domain adaptation as learning with auxiliary information. In
*New Directions in Transfer and Multi-Task - Workshop @ Advances in Neural Information Processing Systems*.Google Scholar - Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains.
*Machine learning*,*79*(1–2), 151–175.MathSciNetCrossRefGoogle Scholar - Bergamo, A., & Torresani, L. (2014). Classemes and other classifier-based features for efficient object categorization. In
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, (pp. 99).Google Scholar - Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In
*Conference on Computational learning theory, ACM*, (pp. 92–100).Google Scholar - Bousquet, O. (2002). Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, Ecole Polytechnique.Google Scholar
- Bousquet, O., & Elisseeff, A. (2002). Stability and Generalization.
*Journal of Machine Learning Research*,*2*, 499–526.MathSciNetMATHGoogle Scholar - Caruana, R. (1997). Multitask learning.
*Machine Learning*,*28*(1), 41–75.MathSciNetCrossRefGoogle Scholar - Chapelle, O., Schölkopf, B., Zien, A., et al. (2006).
*Semi-supervised learning*(Vol. 2). Cambridge: MIT Press.CrossRefGoogle Scholar - Cortes, C., & Mohri, M. (2014). Domain adaptation and sample bias correction theory and algorithm for regression.
*Theoretical Computer Science*,*519*, 103–126.MathSciNetCrossRefMATHGoogle Scholar - Daumé III, H. (2007). Frustratingly easy domain adaptation. In
*Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics*.Google Scholar - Daumé III, H., Kumar, A., & Saha, A. (2010). Frustratingly easy semi-supervised domain adaptation. In
*Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, Association for Computational Linguistics*, (pp. 53–59).Google Scholar - Duan, L., Tsang, I. W., Xu, D., & Chua, T. (2009). Domain adaptation from multiple sources via auxiliary classifiers. In
*International Conference on Machine Learning*, (pp. 289–296).Google Scholar - Hoorfar, A., & Hassani, M. (2008). Inequalities on the lambert w function and hyperpower function.
*Journal of Inequalities in Pure and Applied Mathematics*,*9*(2), 5–9.MathSciNetMATHGoogle Scholar - Kakade, S. M., Sridharan, K., & Tewari, A. (2008). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In
*Advances in Neural Information Processing Systems*,*21*, (pp. 793–800).Google Scholar - Kakade, S. M., Shalev-Shwartz, S., & Tewari, A. (2012). Regularization techniques for learning with matrices.
*Journal of Machine Learning Research*,*13*, 1865–1890.MathSciNetMATHGoogle Scholar - Kienzle, W., & Chellapilla, K. (2006). Personalized handwriting recognition via biased regularization. In
*International Conference on Machine Learning*, (pp. 457–464).Google Scholar - Kuzborskij, I., & Orabona, F. (2013). Stability and Hypothesis Transfer Learning. In
*International Conference on Machine Learning*, (pp. 942–950).Google Scholar - Kuzborskij, I., Orabona, F., & Caputo, B. (2013). From N to N+1: Multiclass transfer incremental learning. In
*Conference on Computer Vision and Pattern Recognition*, (pp. 3358–3365).Google Scholar - Kuzborskij, I., Orabona, F., & Caputo, B. (2015). Transfer Learning through Greedy Subset Selection. In
*International Conference on Image Analysis and Processing*.Google Scholar - Li, L., Su, H., Xing, E. P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In
*Advances in Neural Information Processing Systems*,*23*, (pp. 1378–1386).Google Scholar - Li, X., & Bilmes, J. (2007). A bayesian divergence prior for classiffier adaptation. In
*International Conference on Artificial Intelligence and Statistics*, (pp. 275–282).Google Scholar - Luo, J., Tommasi, T., & Caputo B. (2011). Multiclass transfer learning from unconstrained priors. In
*International Conference on Computer Vision*, (pp. 1863–1870).Google Scholar - Mansour, Y., Mohri, M., & Rostamizadeh, A. (2008). Domain adaptation with multiple sources. In
*Advances in Neural Information Processing Systems*,*21*, (pp. 1041–1048).Google Scholar - Mansour Y., Mohri M., & Rostamizadeh, A. (2009). Domain adaptation: Learning bounds and algorithms. In
*The Conference on Learning Theory*.Google Scholar - Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012).
*Foundations of machine learning*. Cambridge: The MIT Press.MATHGoogle Scholar - Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In
*Conference on Computer Vision and Pattern Recognition*, (pp. 1717–1724).Google Scholar - Orabona, F., Castellini, C., Caputo, B., Fiorilla, A., & Sandini, G. (2009). Model Adaptation with Least-Squares SVM for Adaptive Hand Prosthetics. In
*IEEE International Conference on Robotics and Automation*, (pp. 2897–2903). IEEE.Google Scholar - Pan, S. J., & Yang, Q. (2010). A survey on transfer learning.
*IEEE Transactions on Knowledge and Data Engineering*,*22*(10), 1345–1359.CrossRefGoogle Scholar - Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. In
*Conference on Computational learning theory*, (pp. 416–426). Springer.Google Scholar - Shalev-Shwartz, S., & Ben-David, S. (2014).
*Understanding machine learning: From theory to algorithms*. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar - Sharmanska, V., Quadrianto, N., & Lampert, C. H. (2013). Learning to rank using privileged information. In
*IEEE International Conference on Computer Vision (ICCV)*, (pp. 825–832). IEEE.Google Scholar - Srebro, N., Sridharan, K., & Tewari, A. (2010). Smoothness, low noise and fast rates. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.),
*Advances in neural information processing systems*,*23*(pp. 2199–2207). Red Hook: Curran Associates, Inc.Google Scholar - Taylor, M. E., & Stone, P. (2009). Transfer leraning for reinforcement learning domains: A survey.
*Journal of Machine Learning Research*,*10*, 1633–1685.MATHGoogle Scholar - Thrun, S., & Pratt, L. (1998).
*Learning to learn*. New York: Springer.CrossRefMATHGoogle Scholar - Tommasi, T., Orabona, F., & Caputo, B. (2010). Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In
*The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR*, San Francisco, CA, USA, 13–18 June 2010, (pp. 3081–3088).Google Scholar - Tommasi, T., Orabona, F., Castellini, C., & Caputo, B. (2013). Improving control of dexterous hand prostheses using adaptive learning.
*IEEE Transactions on Robotics*,*29*(1), 207–219.CrossRefGoogle Scholar - Tommasi, T., Orabona, F., & Caputo, B. (2014). Learning categories from few examples with multi model knowledge transfer.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*36*(5), 928–941.CrossRefGoogle Scholar - Vapnik, V., & Vashist, A. (2009). A new learning paradigm: Learning using privileged information.
*Neural Networks*,*22*(5), 544–557.CrossRefMATHGoogle Scholar - Vito, E. D., Caponnetto, A., & Rosasco, L. (2005). Model selection for regularized least-squares algorithm in learning theory.
*Foundations of Computational Mathematics*,*5*(1), 59–85.MathSciNetCrossRefMATHGoogle Scholar - Yang, J., Yan, R., & Hauptmann, A. (2007). Cross-Domain Video Concept Detection Using Adaptive SVMs. In
*Proceedings of the 15th international conference on Multimedia, ACM*, (pp. 188–197).Google Scholar