Towards Understanding Transfer Learning Algorithms Using Meta Transfer Features

Li, Xin-Chun; Zhan, De-Chuan; Yang, Jia-Qi; Shi, Yi; Hang, Cheng; Lu, Yi

doi:10.1007/978-3-030-47436-2_64

Xin-Chun Li¹⁴,
De-Chuan Zhan¹⁴,
Jia-Qi Yang¹⁴,
Yi Shi¹⁴,
Cheng Hang¹⁴ &
…
Yi Lu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12085))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

5044 Accesses
2 Citations

Abstract

Transfer learning, which aims to reuse knowledge in different domains, has achieved great success in many scenarios via minimizing domain discrepancy and enhancing feature discriminability. However, there are seldom practical determination methods for measuring the transferability among domains. In this paper, we bring forward a novel meta-transfer feature method (MetaTrans) for this problem. MetaTrans is used to train a model to predict performance improvement ratio from historical transfer learning experiences, and can consider both the Transferability between tasks and the Discriminability emphasized on targets. We apply this method to both shallow and deep transfer learning algorithms, providing a detail explanation for the success of specific transfer learning algorithms. From experimental studies, we find that different transfer learning algorithms have varying dominant factor deciding their success, so we propose a multi-task learning framework which can learn both common and specific experience from historical transfer learning results. The empirical investigations reveal that the knowledge obtained from historical experience can facilitate future transfer learning tasks.

You have full access to this open access chapter, Download conference paper PDF

Investigating the Impact of Data Volume and Domain Similarity on Transfer Learning Applications

Shallow Domain Adaptation

Heterogeneous Transfer Learning from a Partial Information Decomposition Perspective

Keywords

1 Introduction

In real-world tasks, test data usually differs from training data in the aspects of distributions, features, class categories, etc. Even there are some cases that the real applied circumstances occur in different domains without sufficient labels, i.e., in these cases, we need to exploit the full usage of the original model for adapting to the target domain, thus transfer learning is proposed.

Transfer learning algorithms can be grouped into two large categories according to using deep networks or not. The first category is shallow transfer learning, such as TCA [12], GFK [6], SA [4], KMM [8], ITL [15] and LSDT [22]. These algorithms can be further classified into instance-based and subspace-based ones according to what to transfer [13]. In the category of deep transfer learning, discrepancy-based, adversarial-based, and reconstruction-based algorithms are the three main approaches [19], among which DAN [10] and RevGrad [5] are classical networks for transfer learning or domain adaptation^{Footnote 1}.

Although many transfer learning algorithms are proposed, there are still few researches devoted to the three key issues in transfer learning, that is, when to transfer, how to transfer and what to transfer [13]. In this paper, we consider the three issues as one problem, i.e., we need to answer whether tasks can be transferred (when), and moreover, how to measure the Transferability. The later one implies the methods to transfer (how) and the information that can be transferred (what). As proposed in [3], we propose a novel MetaTrans method from both aspects of Transferability and Discriminability. Transferability means the similarity between the source and target domains, and Discriminability means how discriminative are the features extracted from a specific algorithm. In order to understand the internal mechanism of transfer learning algorithms and explain why they can improve the performance a lot, we extract some critical features according to these two dominant factors, which are called Meta Transfer Features.

Inspired by meta-learning methods [21] and the recent work [20], we build a model mapping Meta Transfer Features to the transfer performance improvement ratio using historical transfer learning experiences. Different from [20], we propose a multi-task learning framework to use historical experiences, with the reason that experiences from different algorithms vary a lot.

In this work, we make three contributions as follows:

We propose a novel method MetaTrans to map Meta Transfer Features to the transfer performance improvement, from both aspects of Transferability and Discriminability.
With the built mapping, we provide a detailed analysis of the success of both shallow and deep transfer algorithms.
We propose a multi-task learning framework utilizing varying historical transfer experiences from different transfer learning algorithms as much as possible.

2 Related Works

In this section, we introduce some related works, including basic notations, theoretical analysis in transfer learning, deep domain adaptation and some recent researches.

2.1 Notations

In this work, we focus on the homogeneous unsupervised domain adaptation problem. The labeled source domain is denoted by ${\mathcal {D}}_{S}=\{{\mathbf {X}}_{S}, {\mathbf {Y}}_{S}\}$, and similarly, ${\mathcal {D}}_{T}=\{{\mathbf {X}}_{T}\}$ for the unlabeled target domain. In order to evaluate a specific transfer learning algorithm, the real labels of target domain are denoted by ${\mathbf {Y}}_{T}$. We denote by $h \in {\mathcal {H}}$ the hypothesis (a.k.a. classifier in classification tasks) mapping from sample space ${\mathcal {X}}$ to label space ${\mathcal {Y}}$.

2.2 Theoretical Bound for Transfer Learning

From the previous theoretical result for domain adaptation [1], we have the generalization error bound on the target domain of a classifier trained in the source domain as follows:

Theorem 1

Let ${\mathcal {H}}$ be a hypothesis space, and $\lambda = \min _{h \in {\mathcal {H}}}(\epsilon _{S}(h) + \epsilon _{T}(h))$ be the most ideal error of the hypothesis space on the source and target jointly, then for any $h \in {\mathcal {H}}$,

$$\begin{aligned} \epsilon _{T}(h) \le \epsilon _{S}(h) + d_{{\mathcal {H}}}(\mathcal {D}_S, \mathcal {D}_T) + \lambda . \end{aligned}$$

(1)

This bound contains three terms. The first one refers to the Discriminability of the features, being smaller if the learned features become more discriminative. The second one determines how similar are the source and target domains, the smaller the better, referred to as Transferability.

2.3 Deep Domain Adaptation

Deep domain adaptation contains adversarial-based and discrepancy-based methods. The framework of adversarial domain adaptation, such as RevGrad [5] and ADDA [18], utilizes the domain discriminator to separate the source and target domain as much as possible, that is, maximize the Transferability between domains. In addition, the task classifier component is used to maximize the performance of the source domain using the extracted features, in order to preserve the Discriminability. Similarly, discrepancy-based frameworks, such as DDC [17] and DAN [10], considering both the discrepancy loss (e.g. MMD loss) between two domains (Transferability) and the task specific loss (Discriminability).

2.4 Recent Researches

Recently, [3] analyzes the relation between Transferability and Discriminability in adversarial domain adaptation via the spectral analysis of feature representations, and proposed a batch spectral penalization algorithm to penalize the largest singular values to boost the feature discriminability. [20] proposes to use transfer learning experiences to automatically infer what and how to transfer in future tasks. [23] first addresses the gap between theories and algorithms, and then proposes new generalization bounds and a novel adversarial domain adaptation framework via the introduced margin disparity discrepancy.

3 MetaTrans Method

In this section, we introduce the proposed MetaTrans, including Meta Transfer Features and the multi-task learning framework.

3.1 Approximate Transferability

The Transferability refers to the discrepancy between two domains, and we can approximate it using different distance metrics. In this paper, we select the proxy ${\mathcal {A}}$-distance and the MMD distance as two approximations.

Proxy $\varvec{{\mathcal {A}}}$ Distance. The second term in the generalization bound in Eq. 1 is called the ${\mathcal {H}}$-divergence [9] between two domains. In order to approximate the ${\mathcal {H}}$-divergence with finite samples from source and target, the empirical ${\mathcal {H}}$-divergence is defined as

$$\begin{aligned} d_{{\mathcal {H}}}(D_S, D_T) = 2 \left( 1 - \min _{h\in {\mathcal {H}}}\left[ \frac{1}{n_S}\sum _{{\mathbf {x}}:h({\mathbf {x}})=0}I[{\mathbf {x}}\in D_S] + \frac{1}{n_T}\sum _{{\mathbf {x}}:h({\mathbf {x}})=1}I[{\mathbf {x}}\in D_T] \right] \right) \!\!, \end{aligned}$$

(2)

where $D_S$ and $D_T$ are sets sampled from the corresponding marginal distribution with the size being $n_S$ and $n_T$. $I[\cdot ]$ is the identity function.

The empirical ${\mathcal {H}}$-divergence is also called proxy ${\mathcal {A}}$ distance. We can train a binary classifier h to discriminate the source and target domain, and the classification error can be used as an approximation of the proxy ${\mathcal {A}}$ distance,

$$\begin{aligned} d_{{\mathcal {A}}}(D_S, D_T) = 2(1 - 2err(h)), \end{aligned}$$

(3)

where the err(h) is the classification error of the specific classifier.

Maximum Mean Discrepancy. Another distance commonly used to measure the difference of two domains is MMD distance [7], a method to match higher-order moments of the domain distributions. The MMD distance is defined as

$$\begin{aligned} d_{mmd} = \left\| E_{{\mathbf {x}}\in {\mathcal {D}}_S}\left[ \phi ({\mathbf {x}}) \right] - E_{{\mathbf {x}}\in {\mathcal {D}}_T}\left[ \phi ({\mathbf {x}}) \right] \right\| _{{\mathcal {H}}}, \end{aligned}$$

(4)

where $\phi $ is a function maps the sample to the reproducing kernel Hilbert space $\mathcal {H}$. In order to approximate the MMD distance from finite samples, the empirical MMD distance is defined as

$$\begin{aligned} d_{mmd} = \left\| \frac{1}{n_S}\sum _{i=1}^{n_S}\phi ({\mathbf {x}}_i) - \frac{1}{n_T}\sum _{j=1}^{n_T} \phi ({\mathbf {x}}_j) \right\| _{{\mathcal {H}}}. \end{aligned}$$

(5)

In order to get the empirical MMD distance, a kernel function is needed, and the commonly used kernel is the RBF kernel defined as $k({\mathbf {x}}, {\mathbf {x}}^\prime )=\exp \left( -\frac{\Vert {\mathbf {x}}- {\mathbf {x}}^\prime \Vert ^2}{\sigma ^2}\right) $. To avoid the trouble of selecting the best kernel bandwidth $\sigma $, we use multi-kernel MMD (MK-MMD), and the multi-kernel is defined as a linear combination of N RBF kernels with the form $\mathcal {K} = \sum _{k=1}^N \mathcal {K}_k$.

3.2 Approximate Discriminability

The Discriminability measures the discriminative ability of feature representations. We propose three approximate features including the empirical source error, the supervised discriminant criterion and the unsupervised discriminant criterion.

Source Domain Error. In the generalization bound for domain adaptation (Eq. 1), the source error is an important factor determining the target generalization error. The empirical source error is defined as

$$\begin{aligned} \epsilon _S(h) = \frac{1}{n_S}\sum _{i=1}^{n_S} l( h({\mathbf {x}}_i), y_i ), \end{aligned}$$

(6)

where $y_i$ is the real label for the i-th sample and l is the loss function.

Supervised Discriminant Criterion. According to the supervised dimension reduction methods (such as LDA), the ratio of between-class scatter and inner-class scatter implies the discriminative level of the features.

Supposing there are C classes in the source domain, and the mean vector for these classes are $\{\mathbf {\mu }_c\}_{c=1}^C$ accordingly, then we have the inner-class scatter as

$$\begin{aligned} d_{inner} = \frac{1}{n_S} \sum _{c=1}^C \sum _{j=1}^{n_c} \left\| \mathbf {x}_{cj} - \mathbf {\mu }_c \right\| _2^2, \end{aligned}$$

(7)

where the c-th class has $n_c$ samples and $\mathbf {x}_{cj}$ is the j-th sample of the c-th class. Meanwhile, the between-class scatter is defined as

$$\begin{aligned} d_{between} = \frac{1}{n_S} \sum _{c=1}^C n_c \left\| \mathbf {\mu }_{c} - \mathbf {\mu }_0 \right\| _2^2, \end{aligned}$$

(8)

where $\mathbf {\mu }_0$ is the mean center of all samples in the source domain. We approximate the source discriminability with the formulation

$$\begin{aligned} c_{sdc} = \frac{d_{between}}{d_{inner} + d_{between}} \end{aligned}$$

(9)

where $c_{sdc}$ is the notation of supervised discriminant criterion.

Unsupervised Discriminant Criterion. If no labeled data can be obtained, the supervised discriminant criterion can not be used. Towards measuring the discriminant ability of the feature representations in the target domain with no label, the unsupervised discriminant criterion can be applied. Similarly, there are two types of scatter in unsupervised discriminant criterion called the local-scatter and global-scatter.

The local-scatter is defined as

$$\begin{aligned} d_{local} = \frac{1}{n_T^2} \sum _{i=1}^{n_T} \sum _{j=1}^{n_T} \mathbf {H}_{ij} \left\| \mathbf {x}_i - \mathbf {x}_j \right\| _2^2, \end{aligned}$$

(10)

where $\mathbf {H}$ is defined as neighbor affinity matrix, being $\mathbf {K}_{ij}$ when ${\mathbf {x}}_i$ and ${\mathbf {x}}_j$ are neighbors to each other, and being 0 otherwise. $\mathbf {K}_{ij}$ is the kernel matrix item using the multi-kernel proposed as before. And similarly, the global scatter is defined as

$$\begin{aligned} d_{global} = \frac{1}{n_T^2} \sum _{i=1}^{n_T} \sum _{j=1}^{n_T} \left( \mathbf {K}_{ij} - \mathbf {H}_{ij} \right) \left\| \mathbf {x}_i - \mathbf {x}_j \right\| _2^2. \end{aligned}$$

(11)

Therefore, we use the ratio of the global scatter in the total scatter as an approximation to the discriminability of the feature representations in the target domain, which is defined as

$$\begin{aligned} c_{udc} = \frac{d_{global}}{d_{local} + d_{global}}, \end{aligned}$$

(12)

and the $c_{udc}$ is the abbreviation of unsupervised discriminant criterion.

3.3 Problem Statements

With the above approximations, the Meta Transfer Features are denoted as a five-tuple $\left( d_{{\mathcal {A}}}, d_{mmd}, \epsilon _S, c_{sdc}, c_{udc} \right) $. In transfer learning, we always focus on the performance improvement ratio brought by using a specific transfer learning algorithm compared to the case without using it. We build a machine learning model in source domain ${\mathcal {D}}_S = \{ {\mathbf {X}}_S, {\mathbf {Y}}_S \}$, and we denote it as $h_S$. Without using any transfer learning algorithms, the target domain error is defined as $\epsilon _{wo} = \frac{1}{n_T}\sum _{i=1}^{n_T} l(h_S({\mathbf {X}}_{Ti}), {\mathbf {Y}}_{Ti})$, where l is the loss function and ${\mathbf {X}}_{Ti}$ is the i-th sample in target domain. A specific transfer learning algorithm g, with the input as ${\mathbf {X}}_S, {\mathbf {X}}_T$, could output the aligned data samples as ${\hat{\mathbf {X}}}_{S}, {\hat{\mathbf {X}}}_{T}$^{Footnote 2}. The aligned source and target domains become $\{ {\hat{\mathbf {X}}}_S, {\mathbf {Y}}_S \}$ and $\{ {\hat{\mathbf {X}}}_T \}$, and then similarly, we can get the new target domain error $\epsilon _{w} = \frac{1}{n_T}\sum _{i=1}^{n_T} l(\hat{h}_S({\hat{\mathbf {X}}}_{Ti}), {\mathbf {Y}}_{Ti})$, where $\hat{h}_S$ is the model learned from new source domain samples. If $\epsilon _{w}$ is smaller than $\epsilon _{wo}$, we believe that g has made an improvement, and the ratio is defined as $r_{imp}$:

$$\begin{aligned} r_{imp} = \frac{\epsilon _{wo} - \epsilon _{w}}{\epsilon _{wo}} \end{aligned}$$

(13)

Given the source and target domains ${\mathcal {D}}_S = \{ {\mathbf {X}}_S, {\mathbf {Y}}_S \}$ and ${\mathcal {D}}_T = \{ {\mathbf {X}}_T \}$, using a transfer learning algorithm g, we can get representations ${\hat{\mathcal {D}}}_S = \{ {\hat{\mathbf {X}}}_S, {\mathbf {Y}}_S \}$ and ${\hat{\mathcal {D}}}_T = \{ {\hat{\mathbf {X}}}_T \}$. From ${\mathcal {D}}_S$ and ${\mathcal {D}}_T$, we can get a five tuple Meta Transfer Features denoted as $\left( d_{{\mathcal {A}}}, d_{mmd}, \epsilon _S, c_{sdc}, c_{udc} \right) $, and similarly, from ${\hat{\mathcal {D}}}_S$ and ${\hat{\mathcal {D}}}_T$, we can get another five tuple denoted as $\left( \hat{d}_{{\mathcal {A}}}, \hat{d}_{mmd}, \hat{\epsilon }_S, \hat{c}_{sdc}, \hat{c}_{udc} \right) $. We combine this two tuples together, and get the features denoted as $\mathbf {x}^{meta}{}$. Using these features, we want to regress the transfer improvement ratio $r_{imp}$ denoted as $\mathbf {y}^{meta}{}$.

From historical transfer learning experiences, we can get pairs of $(\mathbf {x}^{meta}{},$ $\mathbf {y}^{meta}{})$, and then we can build a model maps Meta Transfer Features to the transfer improvement ratio. With this obtained model, we can have a better understanding of the internal mechanism of transfer learning algorithms and provide some prior knowledge for future transfer learning tasks.

3.4 Multi-task Learning Framework

Considering transfer learning algorithms are designed with different mechanisms, it is not wise to build a single mapping from their experiences, losing the specialities. Additionally, we want to learn something common which can be applied to new transfer learning algorithms so that we can not train models individually. Therefore, we propose a multi-task learning framework to learn common and specific knowledge from varying transfer learning experiences.

To be specific, given the transfer learning experiences of T different algorithms denoted as $\{ \{(\mathbf {x}^{meta}_{ti}, \mathbf {y}^{meta}_{ti})\}_{i=1}^{N_t} \}_{t=1}^T$. For simplicity, we use linear regression with regularization as our mapping function. We divide mapping functions into two parts, the common and specific ones, denoted by $({\mathbf {w}}, b)$ and $\{({\mathbf {w}}_t, b_t)\}_{t=1}^T$ correspondingly. Then our optimization target is:

$$\begin{aligned} \min _{ \mathbf {\theta } } L = \frac{1}{T} \sum _{t=1}^T \sum _{i=1}^{N_t} \left( ({\mathbf {w}}+ {\mathbf {w}}_t)^T\mathbf {x}^{meta}_{ti} + b + b_t - \mathbf {y}^{meta}_{ti}\right) ^2 + \lambda R({\mathbf {w}}, \{{\mathbf {w}}_t\}_{t=1}^T), \end{aligned}$$

(14)

where $R({\mathbf {w}}, \{{\mathbf {w}}_t\}_{t=1}^T)$ is the regularization term, such as the L2-norm regularization and $\mathbf {\theta } = \{{\mathbf {w}}, b, \{{\mathbf {w}}\}_{t=1}^T, \{b\}_{t=1}^T \}$ denotes the parameters to be learned. In order to solve this problem, we use the alternative optimization strategy. First, we fix the global parameters $({\mathbf {w}}, b)$ and optimize $({\mathbf {w}}_t, b_t)$ for each task, and then we fix local parameters $({\mathbf {w}}_t, b_t)_{t=1}^T$ and optimize the $({\mathbf {w}}, b)$ alternatively.

4 Experimental Studies

In this section, we display some experiments with both synthetic and public data.

4.1 Understanding Meta Transfer Features

One of the contributions of this work is the proposed Meta Transfer Features, so we will provide some experimental results on synthetic data to understand why these features matter so much.

In order to understand the Transferability, we sample data from two 2-dim gaussian distributions as the source and target domain, which is shown in the top row of Fig. 1. From the figure, the proxy ${\mathcal {A}}$ distance (HDIV in figure) and MMD distance become larger when two domains become further. As to the Discriminability, we sample data from five gaussian distributions as five classes. From the bottom row in Fig. 1, it is shown that both the supervised and unsupervised discriminative criterion become larger with the overlap among classes becomes smaller, which means the features are more discriminative for classification.

4.2 Understanding Transfer Learning Methods

As proposed further, different transfer learning algorithms have their individual mechanisms, so we will provide experimental results for this finding.

Shallow Transfer Methods. In this section, we implement TCA [12], SA [4] and ITL [15] as examples, showing the different mechanisms among them.

In order to visualize the learned representations, we use synthetic data constructed as follows: we sample data from two 2-dim gaussian distributions as two classes in source domain (S0, S1 in Fig. 2 (a)), and then we rotate the guassian means with a definite angle, and the new means are used to sample target data (T0, T1 in Fig. 2 (a)) with the same covariance. Then we use TCA, SA and ITL to get aligned distributions in 1-dim space, and for every algorithm, we select the best parameters to get almost the same $10\%$ improvement in classification accuracy compared to the case without using this algorithm. Considering the overlap between two classes in two domains in 1-dim space, we plot them separately with different y-axis values as in Fig. 2. From this visualization result, it is obvious that ITL can get a more discriminative representation then TCA and SA, for the appearance that the samples in different classes are largely separated as shown in Fig. 2 (d). The result fits well with the information-theoretic factors considered in the designation process of ITL, and we refer readers to [15] for more details. In addition, TCA can get a better alignment between source and target domains as shown in Fig. 2 (b).

Deep Transfer Methods. Aside from the shallow transfer learning algorithms, we explore the change of Meta Transfer Features in the learning process of deep transfer learning algorithms. We take DAN [10] as an example. We use the Amazon ($\mathbf {A}$) and DSLR ($\mathbf {D}$) in Office [14] dataset as source and target domains. For each training epoch, we extract Meta Transfer Features from the hidden representations learned from DAN network, and we plot the change of these features as shown in Fig. 3 (a) (the plot is normalized with min-max normalizetion). It is obvious that MMD distance (MMD in Figure) becomes smaller and smaller with the optimization process of domain alignment mechanism in DAN, while proxy ${\mathcal {A}}$ distance (HDIV in Figure) oscillates a lot. In addition, the sdc becomes smaller, showing that features could be more confusing with the overlap between two domains becoming larger.

4.3 Prediction Results

Transfer learning experiences are constructed from sub-tasks sampled from the classical datasets: Office [14], Caltech [6], MNIST ($\mathbf {M}$) and USPS ($\mathbf {U}$). The Office and Caltech datasets have four domains in total: Amazon ($\mathbf {A}$), Caltech ($\mathbf {C}$), DSLR ($\mathbf {D}$) and Webcam ($\mathbf {W}$). For a specific source and target combination such as $\mathbf {A} \rightarrow \mathbf {C}$, we sample tasks with a subset classes in the total 10 classes. For example, we can sample a 4-classes classification task, and there are will be 210 unique tasks in total can be sampled.

For the prediction experiments, we only focus on shallow transfer learning algorithms, including RPROJ^{Footnote 3}, PCA, TCA [12], MSDA [2], CORAL [16], GFK [6], ITL [15], LSDT [22], GTL [11] and KMM [8]. These algorithms contain almost all kinds of shallow transfer learning algorithms, such as instance-based, subspace-based, manifold-based, information-based and reconstruction-based. For each sampled task, we apply all of these algorithms with random selected hyperparameters and get the $(\mathbf {x}^{meta}, \mathbf {y}^{meta})$ pairs.

We compare our proposed multi-task learning framework (Meta-MTL) with two baselines: the first one is training a single model together (Meta-Sin), and the second one is training a model for each transfer algorithm individually (Meta-Ind). We use both MSE and MAE as the evaluation criterions. The prediction results can be found in Table 1, which verifies the validity of our MTL framework. Our MTL framework can predict the transfer improvement ratio more accurate for unseen transfer tasks. It also explains that experiences from different transfer learning algorithms should not be utilized equally. The first column displays the source and domain pairs we use to obtain transfer learning experiences, and we find the ignored dataset information also matters a lot, which will be the future work to research.

Table 1. Prediction results of different methods of utilizing the transfer learning experiences.

Full size table

In addition, in order to visualize the difference among transfer learning algorithms, we use MDS to get the lower representations in 2-dim space keeping the euclidean distances among their specific weights unchanged as much as possible. We plot the relationships in Fig. 3 (b). From this figure, we can find and search some similar transfer learning methods for alternative algorithms, and meanwhile, some diverse algorithms can be used for ensemble learning. To be specific, we find MSDA and TCA may be alternative transfer learning methods in this experiment.

5 Conclusion

In this paper, we propose MetaTrans from both Transferability and Discriminability aspects and give a comprehensive understanding of both shallow and deep transfer learning algorithms. As to the use of historical transfer learning experiences, we propose a multi-task learning framework, and the experimental results show that it could utilize experiences better and predict future transfer performance improvement more accurate. Considering more meta-features, taking the dataset information into consideration or learning task embeddings are future works.

Notes

1.
In this paper, we do not focus on the difference between transfer learning and domain adaptation, we refer readers to [13] for details.
2.
We only give the most common case, some algorithms like instance-based ones will output a group of weights, and we can apply importance sampling to get new source domain samples.
3.
Dimensional reduction with Random Projection.

References

Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Advances in Neural Information Processing Systems, pp. 137–144 (2007)
Google Scholar
Chen, M., Xu, Z., Weinberger, K.Q., Sha, F.: Marginalized denoising autoencoders for domain adaptation. In: Proceedings of the 29th International Conference on Machine Learning, pp. 1627–1634 (2012)
Google Scholar
Chen, X., Wang, S., Long, M., Wang, J.: Transferability vs. discriminability: batch spectral penalization for adversarial domain adaptation. In: Proceedings of the 36th International Conference on Machine Learning, pp. 1081–1090 (2019)
Google Scholar
Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2960–2967 (2013)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 1180–1189 (2015)
Google Scholar
Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073 (2012)
Google Scholar
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems, pp. 513–520 (2007)
Google Scholar
Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., Smola, A.J.: Correcting sample selection bias by unlabeled data. In: Advances in Neural Information Processing Systems, pp. 601–608 (2007)
Google Scholar
Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 180–191 (2004)
Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 97–105 (2015)
Google Scholar
Long, M., Wang, J., Ding, G., Shen, D., Yang, Q.: Transfer learning with graph co-regularization. IEEE Trans. Knowl. Data Eng. 26, 1805–1818 (2013)
Article Google Scholar
Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2010)
Article Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009)
Article Google Scholar
Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_16
Chapter Google Scholar
Shi, Y., Sha, F.: Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, pp. 1275–1282 (2012)
Google Scholar
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Google Scholar
Wang, M., Deng, W.: Deep visual domain adaptation: a survey. Neurocomputing 312, 135–153 (2018)
Article Google Scholar
Wei, Y., Zhang, Y., Huang, J., Yang, Q.: Transfer learning via learning to transfer. In: Proceedings of the 35th International Conference on Machine Learning, pp. 5085–5094 (2018)
Google Scholar
Ye, H.J., Hu, H., Zhan, D.C., Sha, F.: Learning embedding adaptation for few-shot learning. arXiv preprint arXiv:1812.03664 (2018)
Zhang, L., Zuo, W., Zhang, D.: LSDT: latent sparse domain transfer learning for visual adaptation. IEEE Trans. Image Process. 25, 1177–1191 (2016)
Article MathSciNet Google Scholar
Zhang, Y., Liu, T., Long, M., Jordan, M.: Bridging theory and algorithm for domain adaptation. In: Proceedings of the 36th International Conference on Machine Learning, pp. 7404–7413 (2019)
Google Scholar

Download references

Acknowledgments

This research was supported by National Key R&D Program of China (2018YFB1004300), NSFC (61773198, 61632004, 61751306), NSFC-NRF Joint Research Project under Grant 61861146001, and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210046, China
Xin-Chun Li, De-Chuan Zhan, Jia-Qi Yang, Yi Shi & Cheng Hang
Huawei Technologies Co., Ltd., Nanjing, 210012, China
Yi Lu

Authors

Xin-Chun Li
View author publications
You can also search for this author in PubMed Google Scholar
De-Chuan Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Jia-Qi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Hang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to De-Chuan Zhan .

Editor information

Editors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
Hady W. Lauw
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Raymond Chi-Wing Wong
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
Alexandros Ntoulas
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Institute of Data Science, National University of Singapore, Singapore, Singapore
See-Kiong Ng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Sinno Jialin Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, XC., Zhan, DC., Yang, JQ., Shi, Y., Hang, C., Lu, Y. (2020). Towards Understanding Transfer Learning Algorithms Using Meta Transfer Features. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12085. Springer, Cham. https://doi.org/10.1007/978-3-030-47436-2_64

Download citation

DOI: https://doi.org/10.1007/978-3-030-47436-2_64
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47435-5
Online ISBN: 978-3-030-47436-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Understanding Transfer Learning Algorithms Using Meta Transfer Features

Abstract

Similar content being viewed by others