Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples. To obtain better generalization, using the starting point as the reference (SPAR), either through weights or features, has been successfully applied to transfer learning as a regularizer. However, due to the domain discrepancy between the source and target task, there exists obvious risk of negative transfer in a straightforward manner of knowledge preserving. In this paper, we propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED), where the relevant knowledge with respect to the target task is disentangled from the original source model and used as a regularizer during fine-tuning the target model. Specifically, we design two alternative methods, maximizing the Maximum Mean Discrepancy (Max-MMD) and minimizing the mutual information (Min-MI), for the representation disentanglement. Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average. TRED also outperforms related state-of-the-art transfer learning regularizers such as L2-SP, AT, DELTA, and BSS.


I. INTRODUCTION
Deep convolutional networks achieve great success on large scale vision tasks such as ImageNet [1] and Places365 [2].In addition to their notable improvements of accuracy, deep representations learned on modern CNNs are demonstrated transferable across relevant tasks [3].This is rather fortunate for many real world applications with inefficient labeled examples.Transfer learning aims to obtain good performance on such tasks by leveraging knowledge learned by relevant large scale datasets.The auxiliary and desired tasks are called the source and target tasks respectively.According to [4], we focus on inductive transfer learning, which cares about the situation that the source and target tasks have different label spaces.
The most popular practice is to fine-tune a model pretrained on the source task with the L 2 regularization, which has the effect of constraining the parameter around the origin of zeros.[5] points out that since the parameter may be driven far from the starting point of the pre-trained model, a major disadvantage of naive fine-tuning is the risk of catastrophic forgetting [6] of the knowledge learned from source.They recommend to use the L 2 -SP regularizer instead of the popular L 2 .While in parallel, knowledge distillation, which is originally designed for compressing the knowledge in a complex model to a simple one [7], is proved to be useful for transfer learning tasks where knowledge is distilled from a different dataset [8], [9].Recent work [10] formulates knowledge distillation in transfer learning as a regularizer on features and further improves through unactivated channel reusing for better fitting the training samples.These methods adopt a common assumption that, the relevant knowledge contained in the source model is expected to facilitate the generalization of the target model.This leads to the motivation of regularizing the fine-tuned model using the starting point as the reference (SPAR) in a mainstream group of inductive transfer learning algorithms [5], [8], [10].
Although the advanced methods with SPAR have succeed to preserve the knowledge contained in the source model, finetuning also takes the obvious risk of negative transfer [11].Intuitively, if the source and target data distribution are dissimilar to some extent, not all the knowledge from the source is transferable to the target and an indiscriminate transfer could be detrimental to the target model.However, the impact of negative transfer has been rarely considered in inductive transfer learning studies.The most related work, [12], proposed to investigate the regularizer of Batch Spectral Shrinkage (BSS) to inhibit negative transfer, where the small singular values of feature representations are considered as not transferable and suppressed.Yet, it is hard to adaptively determine the scope of small singular value when faced with different target tasks.Moreover, BSS does not take consideration of the catastrophic forgetting risk, which means it has to be equipped with other fine-tuning techniques (e.g., L 2 -SP [5], Attention-Transfer [8], DELTA [10], etc.) to achieve considerable performance.
According to the above analysis, it is straightforward to think about a better solution which simultaneously takes the consideration of preserving relevant knowledge and avoiding negative transfer.In this paper, we intend to improve the standard fine-tuning paradigm by accurate knowledge transfer.Assuming that the knowledge contained in the source model consists of one part relevant to the target task and the other part which is irrelevant 1 , we are going to explicitly disentangle the former from the source model.Thus, a target task specific starting point is used as the reference instead of the original one.Specifically, we design a novel regularizer of deep transfer learning through Target-awareness REpresentation Disentanglement (TRED).The whole algorithm includes two steps.First we use a lightweight disentangler to separate middle representations of the pre-trained source model into the positive and negative parts.The core component of the disentangler is a differentiable module which is capable of separating the positive and negative group of features.We propose two alternative methods for the disentanglement, which are maximizing the Maximum Mean Discrepancy (Max-MMD) on the spatial dimension, and minimizing the Mutual Information (Min-MI) on the channel dimension.Intuitively, we intend to distill the relevant ingredients from noisy features by separating irrelevant ingredients which should: (1) pay attention to different regions with the the relevant ones; or (2) have independent semantic representations with the relevant ones.The disentanglement is achieved by simultaneously optimizing the disentangling module and ensuring to reconstruct the original representation.Supervision information from labeled target examples is utilized to distinguish the positive part from the negative part.The second step is to perform finetuning using the disentangled positive part of representations as the reference.In summary, our main contributions are as following: • We are the first to involve the idea of representation disentanglement to improve inductive transfer learning.

A. Shrinkage Towards Chosen Parameters
Regularization techniques have a long history since Stein's paradox [14], [15], showing that shrinking towards chosen parameters obtains an estimate more accurate than simply using observed averages.Most common choices like Lasso and Ridge Regression pull the model towards zero, while it is widely believed that shrinking towards "true parameters" is more effective.In transfer learning, models pre-trained on relevant source tasks with sufficient labeled examples are often regarded as "true parameters".Earlier works demonstrate its effectiveness on Maximum Entropy [16] or Support Vector Machine models [17]- [19].
Despite that the notion of "true parameters" of deep neural networks is intractable, several studies [3], [20], [21] have demonstrated the great transferability of representations trained on large scale datasets for general purpose.[22] theoretically studies properties of the pre-traind model and explains why it outperforms training from scratch.

B. Deep Inductive Transfer Learning
Despite the great transferability, recent work [5] points out that naive fine-tuning the pre-trained model with L 2 regularization may cause losses of the original knowledge.In order to overcome over-fitting, various transfer learning regularizers have been proposed.According to the type of regularized objectives, they can be categorized as parameter based [5], feature based [8]- [10] or spectral based [12] methods.[12] improves the regularization of transfer learning from another angle.They propose Batch Spectral Shrinkage (BSS) to regularize spectral components of deep representations by penalizing small singular values.BSS is complementary to other regularizers.However, it doesn't deal with the issue of knowledge preserving.
Our paper adopts the general idea of preserving knowledge by regularizing features of the source model.While unlike previous methods, we do not directly use the original knowledge provided by the source model.Instead, we disentangle the useful part for reference to avoid negative transfer.There main differences are summarized in Table I.

C. Representation Disentanglement
The key assumption of representation disentanglement is that, a satisfactory representation should separate underlying factors of variations, which are compact, explanatory and independent with each other [29], [30].Representation disentanglement has been widely applied in advanced generative algorithms such as Generative Adversarial Networks [31] and Variational Autoencoder [32].Some works investigate general disentangling methods for data generating, including InfoGAN [33], AC-GAN [34], β-VAE [35], FactorVAE [36] and so on.
Recently representation disentanglement is also demonstrated to be useful in tasks of unsupervised image-to-image translation and domain adaptation, which are more related to our work.[37] proposes to perform joint representation disentanglement and domain adaptation with only attribute supervision available in the source domain.[38] further generalizes the previous study to a unified feature disentanglement network.In this work, the data domain is first regarded as an interested underlying factor to be disentangled.[39] improves above studies by employing class disentanglement and minimizing the mutual information between disentangled features to enhance the disentanglement further.
Our work is highly inspired and encouraged by the progress of domain information disentanglement [38] and disentangling techniques [39], [40], while we are the first to utilize disentanglement methods to improve inductive transfer learning, aiming at (target) task specific feature disentanglement rather than domain invariant feature extraction in unsupervised domain adaptation.

A. Problem Definition
In inductive transfer learning, we are given a model pretrained on the source task, with the parameter vector ω 0 .For the desired task, the training set contains n tuples, each of which is denoted as (x i , y i ).x i and y i refers to the i-th example and its corresponding label.
Let's further denote z as the function of the neural network and ω as the parameter vector of the target network.We have the objective of structural risk minimization where the first term is the empirical loss and the second is the regularization term.λ is the coefficient to balance the effect between data fitting and reducing over-fitting.

B. Regularizers for Transfer Learning
Recent studies in the deep learning paradigm show that SGD itself has the effect of implicit regularization that helps generalizing in over-parameterized regime [41].In addition, since fine-tuning is usually performed with a smaller learning rate and fewer epochs, it can be regarded as a form of implicit regularization towards the initial solution with good generalization properties [22].Besides, we give a brief introduction of state-of-the-art explicit regularizers for deep transfer learning.
L 2 Penalty.The most common choice is the L 2 penalty with the form of ω 2 2 , also named weight decay in deep learning.From a Bayesian perspective, it refers to a Gaussian prior of the parameter with a zero mean.The shortcoming is that the meaningful initial point ω 0 is ignored.L 2 -SP.[5] follows the idea of shrinking towards chosen targets instead of zero.They propose to use the starting point as the reference where the first term refers to constraining the parameter of the part responsible for representation learning around the starting point, and the second is weight decay of the remaining part which is task specific.Since ω s is general in all mentioned methods, we ignore it in following formulas.
DELTA.[10] extends the framework of feature distillation [8], [42] by incorporating an attention mechanism.They constrain 2-d activation maps with respect to different channels by different strengths according to their values to the target task.Given a tuple of training example (x i , y i ) and the distance metric between activation maps D, the regularization is formulated as where C is the number of channels and W j (•) refers to the regularization weight assigned to the j-th channel.Specifically, each weight is independently evaluated by the performance drop when disabling that channel.BSS.Authors in [12] propose Batch Spectral Shrinkage (BSS), towards penalizing untransferable spectral components of deep representations.They figure out that spectral components which are less transferable are those corresponding to relatively small singular values.They apply differentiable SVD to compute all singular values of a feature matrix and penalize the smallest k ones: where all singular values [σ 1 , σ 2 ..., σ b ] are in the descending order.ω 0 is not involved as BSS doesn't consider preserving the knowledge in the source model.; Obtain F M pos by Eq. 5 given (B, F s , D); Calculate Ω(ω s ) = B j=1 Ω j (ω s ) by Eq 15; Update F t by Eq 1; end return F t = F t IV.TARGET-AWARENESS DISENTANGLEMENT Features extracted from the source model, which is usually pre-trained over a large scale dataset with diverse categories, are often noisy for a specific target task.Irrelevant ingredients w.r.t. the target task contained the general knowledge may lead to negative transfer.Our aim is to disentangle the positive ingredients (relevant to the target task) from the entire representation.To achieve this goal, three conditions: distinguishable, discriminative, and recoverable should be satisfied: • The positive part should be distinguishable from the negative part.In other words, two features with similar patterns or semantic relations should not be disentangled into different parts.This is the most crucial component in this framework.

A. General definitions and notations
. We first specify general definitions and notions used in our algorithm.Different with the main stream of disentanglement studies which try to separate various atomic attributes such as the color or angle, we care about disentangling components relevant to the target task from the whole representation produced by the source model.Formally, we disentangle the original representation F M ori ∈ R C×H×W obtained from the pre-trained model into the positive and negative part with the disentangler module D: where F M pos and F M neg have the same shape with F M ori .
For efficient estimation and optimization of the disentanglement, we further denote the mapping functions F C : R C×H×W → R C and F S : R C×H×W → R H×W , representing dimension reduction along the spatial and channel direction respectively.Therefore we get where * refers to either pos or neg.

B. Feature Disentanglement
The design of our disentangling is motivated by the following two assumptions, to which the ideally disentangled relevant (positive) and irrelevant (negative) part are expected to conform.
• Non-overlapped Visual Attention.The visual attention can be regarded as a most interpretable kind of knowledge learned by DNNs [8].Intuitively, the disentangled two parts should pay attention to different visual regions.
Otherwise, similar patterns are more likely to concurrently exist in different parts, implying an incomplete disentanglement.
• Independent Semantic Representation.Another common assumption is that, patterns relevant to a same target object are likely to have dependence with each other, but not vice versa.For example, in order to recognize different dogs, a pattern on eyes often implies a pattern on noses, but not much likely to imply a pattern on indoor objects such as cups or floors.Both aforementioned assumptions can be demonstrated by the example in Figure 1.Next we describe the two methods in details.
Non-overlapped Visual Attention with Max-MMD.Imitating the visual attention mechanism of humans, we force the two parts to pay attention to different spatial regions within the original image.This is achieved by enlarging their statistical distributions along the spatial dimension.We use the Maximum Mean Discrepancy (MMD) to measure the distribution distance between the two spatial representations.Maximum Mean Discrepancy (MMD) is originally designed to test whether two distributions are the same [43].Further, it's also widely used as a criterion to measure the distance of two distributions in domain adaptation tasks [44]- [46] and generative adversarial networks [47], [48].Under a commonly adopted Reproducing Kernel Hilbert Space (RKHS) assumption, the MMD can be represented as an unbiased approximation with the kernel form as follows.
Denoting X s = {x 1 s , x 2 s , ..., x n s } and X t = {x 1 t , x 2 t , ..., x m t } as random variable sets with distributions P and Q, an empirical estimate [45], [49] of the MMD between P and Q compares the square distance between the empirical kernel mean embeddings as where k refers to the kernel, as which a Gaussian radial basis function (RBF) is usually used in practice [49], [50].
Our objective is to enlarge the MMD between the disentangled positive and negative part along the spatial dimension.Intuitively, this would explicitly encourage these two parts to recognize different regions of the input image.For stabler optimization, we minimize the negative exponent of the MMD as followed: Independent Semantic Representation with Min-MI.In information theory, mutual information (MI) between two random variables quantifies the amount of information obtained about one through observing the other.Formally, the mutual information between random variables X and Z is defined as , where P X and P Z are the marginal distributions, and P XZ is their joint distribution.In general, I(X; Z) is non-negative and zero only when X and Z are independent.
Recent study [35] demonstrated that disentanglement of interested factors can be enhanced by encouraging independence between them.Further, [39] practically minimized the mutual information between disentangled features to strengthen class/domain disentanglement.Inspired by [39], to achieve the goal of disentangling the original representation into two parts which are both meaningful and complementary, we minimize the mutual information between probability distributions of f c pos and f c neg .Since the exact computation of mutual information for highdimensional data is rather hard, we leverage recent approach for efficient estimation of mutual information [40] by a neural network T θ : Denoting the optimal solution as θ, the mutual information can be computed by Monte-Carlo approximation for a minibatch of b examples as , where each (x i , z i ) is drawn from the joint distribution P XZ , and z i is drawn from the marginal distribution P Z .
Mutual information estimation by Eq 10 is a maximizing problem by optimizing on T θ , while our objective is to minimize the mutual information by learning the disentangled representations.Therefore, we adopt common practice to alternatively update T θ and D. Given T θ estimated for the current distributions of f c pos and f c neg , we update D with C. Reconstruction Requirement .As both the positive and negative part are trained by the flexible disentangler, it is easy to produce two parts of meaningless representations with the only objective of disentangling with Max-MMD or Min-MI.To ensure the disentanglement is restricted within the knowledge of the source model rather than an arbitrary transformation, we add the reconstruction requirement to constrain the disentangled results.Specifically, the disentangled positive and negative parts are required to be capable of recovering the original representation by point-wise addition: D. Distinguishing the Positive Part .Since above representation disentanglement is actually symmetry for each part, an explicit signal is required to distinguish features which are useful for the target task.In particular, the selected layer for representation transfer is followed by a classifier consisting of a global pooling layer and a fully connected layer sequentially.A regular cross entropy loss is added to explicitly drive the disentangler to extract into the positive part components which are discriminative for the target task.Denoting the involved classifier as C, we have E. Regularizing the Disentangled Representation .After the step of representation disentanglement, we perform fine-tuning over the target task.We regularize the distance between a feature map and its corresponding starting point.Quite different from previous feature map based regularizers as [8], [10], [42], the starting point here is the disentangled positive part of the original representation.The regularization term corresponding to some example (x i , y i ) becomes: where ω di refers to the parameter of the disentangler D which is frozen during the fine-tuning stage.The complete training procedure is presented in Algorithm 1.

A. Datasets
We select several popular transfer learning datasets to evaluate the effectiveness of our method.
Stanford Dogs.The Stanford Dogs [51] dataset consists of images of 120 breeds of dogs, each of which containing 100 examples used for training and 72 for testing.It's a subset of ImageNet.
MIT Indoor-67.MIT Indoor-67 [52] is an indoor scene classification task consisting of 67 categories.There are 80 images for training and 20 for testing for each category.
CUB-200-2011.Caltech-UCSD Birds-200-2011 [53] contains 11,788 images of 200 bird species from around the world.Each species is associated with a Wikipedia article and organized by scientific classification.
Textures.Describable Textures Dataset [57] is a texture database, containing 5640 images organized by 47 categories according to perceptual properties of textures.

B. Settings and Hyperparameters
We implement transfer learning experiments based on ResNet [58].For MIT indoor-67, we use ResNet-50 pretrained with large scale scene recognition dataset Places 365 [2] as the source model.For remaining datasets, we use ImageNet pre-trained ResNet-101 as the source model.Input images are resized with the shorter edge being 256 and then random cropped to 224 × 224 during training.
For optimization, we first train 5 epochs to optimize the disentangler by Adam with the learning rate of 0.01.All involved hyperparameters are set to default values of λ di = 10 −2 , λ ce = 10 −3 , λ re = 10 −2 .Then we use SGD with the momentum of 0.9, batch size of 64 and initial learning rate of 0.01 for fine-tuning the target model.We train 40 epochs for each dataset.The learning rate is divided by 10 after 25 epochs.We run each experiment three times and report the average top-1 accuracy.

C. Results
While recent theoretical studies proved that weight decay actually has no regularization effect [59], [60] when combined with common used batch normalization, we use No Regularization as the most naive baseline, reevaluating L 2 .From Table II we observe that L 2 does not outperform fine-tuning without any regularization.This may imply that deep transfer learning hardly benefits from regularizers of non-informative priors.
Advanced works [5], [8], [10] adopt regularizers using the starting point of the reference for knowledge preserving.From the perspective of Bayes theory, these are equivalent to the informative prior which believes the knowledge contained in the source model, in the form of either parameters or behavior.Table II shows that these algorithms obtain significant improvements on some datasets such as Stanford Dogs and MIT indoor-67, where the target dataset is very similar to the source dataset.However, benefits are much less on other datasets such as CUB-200-2011, Flower-102 and Stanford Cars.
Table II illustrates that TRED consistently outperforms all above baselines over all evaluated datasets.It outperforms naive fine-tuning regularizer L 2 by more than 2% on average.Except for Stanford Dogs and MIT Indoor-67, improvements are still obvious even compared with state-of-the-art regularizers L 2 -SP, AT, DELTA and BSS.To evaluate the scalability of our algorithm with more limited data, we conduct additional experiments on subsets of the standard dataset CUB-200-2011.Baseline methods include L 2 , BSS [12], L 2 -SP [5], AT [8] and DELTA [10].Specifically, we random sample 50%, 30% and 15% training examples for each category to construct new training sets.Results show that our proposed TRED achieves remarkable improvements compared with all competitors, as presented in Table III.We will provide more explanations about the results in the supplementary material.

VI. DISCUSSIONS
In this section, we dive deeper into the mechanism and experiment results to explain why target-awareness disentanglement provides better reference.In subsection Representation Visualization, we show the effect of our method by visualizing attention maps and feature embeddings.In subsection Shrinking Towards True Behavior, we briefly discuss the theoretical understanding related with shrinkage estimation.Then we provide more statistical evidences to validate the advantage of the disentangled positive representation.In subsection Ablation Study, we empirically analyze why the disentanglement component is essential.

A. Representation Visualization
Show Cases.Authors in [8] show that the spatial attention map plays a critical role in knowledge transfer.We demonstrate the effect of representation disentanglement by visualizing the attention map in Fig 2 .As observed in typical cases from CUB-200-2011 and Stanford Cars, the original representations generated by the ImageNet pre-trained model usually contain a wide range of semantic features, such as objects or backgrounds, in addition to parts of birds.Our proposed disentangler is able to "purify" the interested concepts into the positive part, while the negative part pays more attention to the complementary constituent.
Embedding Visualization.Since the most important change of our method is to use the disentangled rather than the original representation as the reference, we are interested in comparing the properties of these two representations on the target task.We visualize the original and disentangled feature representations of Flower-102 and MIT Indoor-67.The dimension of features is reduced along the spatial direction and then plotted in the 2D space using t-SNE embeddings.As illustrated in Figure 3, deep representations derived by our proposed disentangler are separated more clearly than the original ones for different categories and clustered more tightly for samples of the same category.

B. Shrinking Towards True Behavior
Recent work [5] discusses the connection between their proposed L 2 -SP and classical statistical theory of shrinkage estimation [15].The key hypothesis is that shrinking towards a value which is close to the "true parameters" is more effective than an arbitrary one.[5] argues that the starting point is supposed to be more close to the "true parameters" than zero.[8], [10] regularize the feature rather than the parameter, which can be interpreted as shrinking towards the "true behavior".Our proposed TRED further improves them by explicitly disentangling "truer behavior" by utilizing the global distribution and supervision information of the target dataset.To support the claim, We provide some additional evidences as followed.
Reducing Untransferable Components.We  [12].This is intuitively consistent to the hypothesis that features relevant to the target task are disentangled and strengthened.
Robustness to Regularization Strength.We also provide an empirical evidence to illustrate the effect of "truer behavior" obtained by our proposed disentangler.The intuition is very straightforward that, if the behavior (representation) used as the reference is "truer", it is supposed to be more robust to the larger regularization strength.We compare with DELTA which uses the original representation as the reference.We

C. Mechanism Analysis
In this part, we briefly discuss about the necessity and relationship of the main components in our method.As our  purpose is to disentangle the knowledge which is relevant to the target task, the supervision from the target dataset is of course necessary.We will first discuss whether a simple supervision is enough to "disentangle" the relevant knowledge from the source model.Next we will explain why the component of reconstruction is essential to ensure the effectiveness.
Why disentanglement is useful.It seems reasonable to obtain the discriminative representation only using the classifier corresponding to L D ce .This is equivalent to perform a preadaptation upon the source model before fine-tuning.Unfortunately, such straightforward adaptation in a pure supervised manner is prone to over-fitting the limited target examples, as which the resultant representation is not adequate to serve as the prior knowledge.Encouraging the disentanglement between the relevant and irrelevant part, however, provides a distribution-level guarantee to simultaneously preserve the generalization capacity of the source model and adapt for the new task.That is achieved by restricting the integrity of the underlying data structure in a self-supervised way, e.g.disentanglement and reconstruction.To verify the hypothesis, we conduct an ablation study to compare the simpler framework without the disentanglement part, which performs direct transformation on the original representation.This version is denoted by TRED-.
We can observe in Table IV that, all evaluated tasks get significant performance drop on the target task without disentanglement.A reasonable guess is that, disentangling helps preserve knowledge in the source model and restrain the representation transformation from over-fitting the classifier C. To verify this hypothesis, we compare the top-1 accuracy of ImageNet classification (the source task) between TRED (we only use TRED-MMD here) and TRED-.Specifically, we train a random initialized classifier to recognize the category of ImageNet, using the fixed transformed representation as input.The top-1 accuracy of the pre-trained ResNet-101 model is 75.99%.As shown in Table IV, TREDgets more performance drops on ImageNet than TRED, indicating that representation disentanglement performs better in preserving the general knowledge learned by the source task.
The interdependence between disentanglement and re-construction.Disentanglement and reconstruction compose a complete self-supervised requirement.Without the force of disentanglement between the positive and negative part, the transformation will fall into the case of no disentanglement.Therefore, the reconstruction requirement alone is not capable of ensuring the preservation of the source knowledge of the learned positive part.Removing the reconstruction requirement results in an analogy situation, as the transformation can be arbitrary, pursuing to enlarge the discrepancy or independence between the two parts with no knowledge constraint.Intuitively, given that the positive part focus on the supervision information, the negative part can be easily learned to adapt either one (alone) of the two requirements.That causes both cases are almost equivalent to removing the entire disentangler.
In condition of the reconstruction guarantee, maximizing the attention discrepancy or minimizing the semantic independence between the positive and negative part encourages the two parts to focus on patterns related with different semantic concepts, with as less overlaps or interactions as possible.This is exact a form of representation disentanglement.For example the positive part is activated on heads and feathers of birds, while the negative part is activated on objects such as trees and wires as shown in Fig. 2.Moreover, this is achieved at a distribution level so that the semantic separation is consistent in the entire dataset.

VII. CONCLUSION
In this paper, we extend the study of catastrophic forgetting and negative transfer in inductive transfer learning.Specifically, we propose a novel approach TRED to regularize the disentangled deep representation, achieving accurate knowledge transfer.We succeed to implement the targetawareness disentanglement, by maximizing the Maximum Mean Discrepancy (MMD) on visual attentions and minimizing the Mutual Information (MI) on semantic features.Extensive experimental results on various real-world transfer learning datasets show that TRED significantly outperforms the state-of-the-art transfer learning regularizers.Moreover, we provide empirical analysis to verify that the disentangled target-awareness representation is closer to the expected "true behavior" of the target task.

Figure 1 :
Figure 1: The architecture of deep transfer learning through Target-awareness Representation Disentanglement (TRED).

Figure 2 :
Figure 2: The effectiveness of representation disentanglement on CUB-200-2011 (left) and Stanford Cars (right).For each dataset, we select three typical cases for demonstration.In addition to the input image (a), we add spatial attention map onto the original image in column (b), (c), and (d) using the input image and the desired representation of the last convolutional layer of ResNet-101.Specifically, (b) is the original representation generated by the ImageNet pre-trained model.(c) and (d) are the disentangled positive and negative part by TRED.

Figure 3 :
Figure 3: Visualization of the original (a, c) and disentangled (b, d) feature representations by t-SNE.Different colors and markers are used to denote different categories.

Figure 4 :
Figure 4: Singular values of the original and disentangled deep representation.
compute singular eigenvectors and values of the deep representation by SVD.All singular values are sorted in descending order and plotted in Fig 4. Authors in [12] demonstrate that the spectral components corresponding to smaller singular values are less transferable.They find that these less transferable components can be suppressed by involving more training examples.Interestingly, we find similar trends by the proposed representation disentanglement.As observed in Fig 4, smaller singular values of the disentangled positive representation are further reduced compared with the original representation.Fig 4 also shows the phenomenon that spectral components corresponding to larger singular values are increased, which does not exist in select three transfer learning tasks for evaluation, which are Places365 → MIT indoor-67, ImageNet → Stanford Cars and Places365 → Stanford Dogs.The regularization strength α is gradually increased from 0.001 to 1.As illustrated in Fig 5, the performance of DELTA falls rapidly as α increases, especially in ImageNet → Stanford Cars and Places365 → Stanford Dog, indicating that the regularizer using original representations as the reference suffers from negative transfer seriously.While TRED performs much more robust to the increasing of α.

Figure 5 :
Figure 5: Top-1 accuracy of transfer learning tasks corresponding to different regularization strength α.

Table I :
Comparison among different fine-tuning approaches.
* CF=Catastrophic Forgetting, NT=Negative Transfer, SPAR=Starting Point As the Reference.
Algorithm 1: The framework of TRED Input: labeled target dataset {(x i , y i )} n i=1 , model pre-trained on the source task F s , representation disentangler D, classifier for disentangled representations C and target model F t .Output: well-trained target model F t .// Representation Disentanglement 1 Set (D, C) trainable and F s frozen; 2 while not converged do 3Sample mini-batch B: {(x j , y j )} 4 Calculate (F M pos , F M neg ) by Eq. 5 given (B, F s , D); 5 Calculate L D di by Max-MMD(Eq.8) or Min-MI(Eq.12); 6 Calculate L = L D di + L D re + L D ce by Eq. 13-14; 7 Update D and C with SGD by minimizing L; 8 end 9 Obtain well-trained D = D; // Fine-tuning Set (F s , D) frozen and F t trainable; while not converged do Sample mini-batch B: {(x j , y j )} B j=1 The original representation should be recoverable by the disentangled parts.For a disentangling operation, both the disentangled parts should not represent knowledge beyond the original representation.Otherwise, the transformation may result in non-generalizable features.
• The positive part should be discriminative on the target task.The aforementioned disentangler is usually symmetric, i.e. we can define either part as the "positive" if without external supervision.Therefore, an extra signal is needed to discriminate the positive part.• consists of 102 flower categories.1020 images are used for training and 6149 images for testing.Only 10 samples are provided for each category during training.Stanford Cars.The Stanford Cars [55] dataset contains 16,185 images of 196 classes of cars.The data is split into 8,144 training and 8,041 testing images.

Table IV :
Top-1 accuracy (%) of the target and source task.TREDrefers to TRED without disentanglement.Pre-trained on Places365 and evaluated on ImageNet. *