Abstract
Domain adaptation (DA) is a popular strategy for pattern recognition and classification tasks. It leverages a large amount of data from the source domain to help train the model applied in the target domain. Supervised domain adaptation (SDA) approaches are desirable when only few labeled samples from the target domain are available. They can be easily adopted in many real-world applications where data collection is expensive. In this study, we propose a new supervision signal, namely center transfer loss (CTL), to efficiently align features under the SDA setting in the deep learning (DL) field. Unlike most previous SDA methods that rely on pairing up training samples, the proposed loss is trainable only using one-stream input based on the mini-batch strategy. The CTL exhibits two main functionalities in training to increase the performance of DL models, i.e., domain alignment and increasing the feature’s discriminative power. The hyper-parameter to balance these two functionalities is waived in CTL, which is the second improvement from the previous approaches. Extensive experiments completed on well-known public datasets show that the proposed method performs better than recent state-of-the-art approaches.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Deep learning (DL) methods have been successfully applied to various areas, such as computer vision [1], brain-computer interface [2], and medical diagnosis [3]. The outstanding performance of DL approaches benefits from numerous training data. However, it is sometimes tough to acquire sufficient data to train a DL model for a specific task at hand, since data recording and label annotation are costly and labor-intensive. One of the popular choices to address such an issue is domain adaptation (DA) [4]. Its main idea is to use available large-scale datasets in the source domain (\(\mathcal {D}_{s}\)) to assist the model training in the target domain (\(\mathcal {D}_{t}\)), where the training data is scarce.
According to [5, 6], DA can be either unsupervised, semi-supervised, or supervised, determined by the availability of the labeled data in \(\mathcal {D}_{t}\). Unsupervised domain adaptation (UDA) [7] only carries unlabeled target data. In a semi-supervised DA (SSDA) scheme [8], both a small amount of labeled and a considerable amount of unlabelled data are accessible. Alternatively, supervised domain adaptation (SDA) [9, 10] supposes that all available target samples are annotated, although the number is small. Sophisticated SDA approaches can usually outperform UDA and SSDA ones when the amount of available data in \(\mathcal {D}_{t}\) is small [6]. Making annotations on a small dataset is likely to be practical and does not require too much effort. Therefore, SDA methods are more appealing if only very few samples from \(\mathcal {D}_{t}\) are accessible. They have been performed in many applications such as cross-subject EEG emotion classification [11], CT scan-based Covid-19 diagnosis [12], emotion detection from the speech [13], and radar-based human activity recognition [14]. These applications only allow recording a small set of data from the target domain, as a massive data collection is either extremely expensive or impossible. Therefore, they require a suitable method to make use of such a small number of samples in \(\mathcal {D}_{t}\) to generate a reliable model applied to the target domain. SDA also has another name, i.e., few-shot domain adaptation [15], which more directly expresses the scenario of using few samples from \(\mathcal {D}_{t}\) in DA problems.
The typical way to implement an SDA approach in the DL community for a classification task is to learn a deep transformation that draws same-class samples close together, regardless from \(\mathcal {D}_{s}\) or \(\mathcal {D}_{t}\). A popular strategy to perform such a mapping is to operate a two-stream network, i.e., siamese network [16] or correlated network [17]. The training process of these networks typically starts with either a sample-based [6, 10, 15, 18, 19] or a batch-based [9, 17] pair-wise input. However, such a pairing mechanism leads to a quadratic increase of the sample size from the original dataset and unavoidably results in redundancy, slow convergence, and unstable performance [20]. For example, one standard protocol [19, 21] of MNIST → USPS domain adaptation task is to use 2000 labeled and 70 labeled samples from MNIST and USPS datasets, respectively, to train a model that applies to the classification task in USPS. Several recent state-of-the-arts (SOTA) methods [6, 10, 19] utilize a siamese network that trains the model with 56000 (2000*70*σ; σ is the ratio to control the redundancy and is equal to 0.4 in these studies.) pairs of samples. It is impractical to train a network when the sample size of either source or target dataset further enlarges.
This study proposes a simple but efficient loss function, namely center transfer loss (CTL), to address the abovementioned issues and increase the discriminative power of deep learning features. Specifically, we learn a center (a vector with the same dimensionality as that of the feature) for features of each class in each domain and update these centers in training. In addition, we minimize the distance between the features and their corresponding class centers in the opposite domain rather than the same domain. In other words, we minimize the distance between features of \(\mathcal {D}_{s}\) and class centers of \(\mathcal {D}_{t}\), as well as the distance between features of \(\mathcal {D}_{t}\) and class centers of \(\mathcal {D}_{s}\). For example, the features of class 1 in \(\mathcal {D}_{s}\) are pushed to the feature center of class 1 in \(\mathcal {D}_{t}\), and the features of class 1 in \(\mathcal {D}_{t}\) are pushed to the feature center of class 1 in \(\mathcal {D}_{s}\). Deep neural networks (DNNs) are trained by the joint supervision of softmax loss and CTL. Intuitively, the softmax loss ensures that features of the different classes stay apart. CTL pushes samples to the class center of the opposite domain. The same-class samples between different domains will eventually align by constant center update and distance minimization. More interestingly, CTL achieves a feature alignment at the beginning of training, see Fig. 1(b). In the later stage, when the distribution of features between domains is sufficiently aligned, CTL alternatively acts as another function to decrease the intra-class variation of the features and increases their discriminative power, Fig. 1(c). We do not require to set hyper-parameters for controlling the shift between early and latter training stages.
In sum, the proposed method has two major contributions in comparison with previous approaches.
-
1.
It is very convenient to employ CTL in DNNs. Our DL models are trainable by the mini-batch strategy in a single-stream setting without running two-stream architectures. The issue of redundancy, slow convergence, and unstable performance can be significantly avoided.
-
2.
The learning features can achieve both domain alignment and intra-class variation minimization by using the proposed CTL in the model training. Although several previous SDA approaches (e.g., [6, 10, 19]) can also provide a similar outcome, a trade-off value must be manually set in these methods to balance the domain alignment and intra-class variation minimization. The optimal choice for the trade-off value varies in different datasets and tasks, resulting in a labour-intensive exhaustive search each time but no guarantee of finding the best value. Alternatively, our CTL can achieve both these two functionalities in different stages of training without the need to set a trade-off value to balance them.
To verify the effectiveness of our approach, we conduct extensive experiments on common DA benchmarks. The results show that our method achieves a better performance than current SOTAs. The remainder of this article is organized as follows. Section 2 introduces the related works. Section 3 explains the proposed method in detail. Section 4 presents the experiment protocols and results. The conclusion is drawn in Section 5.
2 Related works
SDA approaches focus on the specific scenario that the labeled target data are available in training, albeit very few samples per class. There are diverse SDA strategies targeted to different types of tasks, e.g., regression [22, 23], object detection [24, 25], and classification. Our study focuses on SDA methods in the classification problem.
Early SDA approaches for the classification task depend on the matrix-based mapping between domains and linear classifiers. Zhou et al. [26] proposed an SDA method called SHFR-ECOC by constructing a sparse feature transformation matrix to get invariant features between domains as inputs to a linear SVM classifier. Sukhija et al. [27] explored another SDA strategy that also learns a sparse feature transformation for the feature generation. This study used a random forest classifier instead. DL models have been rapidly developing in recent years. It offers an end-to-end way for the classification task and naturally arouses more researchers’ interests.
In [18], source and sparely labeled target data are used to train a siamese network. The network learns domain invariant features using a soft label distribution matching loss. Similarly, Motiian et al. [6] raised a method called CCSA, also based on the siamese architecture. The model was trained by the joint supervision of categorical entropy (i.e., softmax loss) and point-wise contrastive loss introduced in [28]. The authors found that their method provided a fast convergence and a better performance in terms of classification accuracy. The same research group of CCSA proposed another SDA method (i.e., FADA [15]) using the adversarial training to attain the feature alignment. They carefully designed four kinds of paired data that the discriminator in the network is augmented to distinguish. In the same year, Piotr et al. [17] also presented a strategy, namely SoHoT, using the mixture of second or/and third scatter alignment measures between source and target domains. They aim to align within-class scatters of a two-stream network to a certain degree using bespoke loss and to keep a good separation of the between-class scatters.
More recently, another SDA method, called Domain Adaptation using Stochastic Neighborhood Embedding (d-SNE), was proposed by Xu et al. [19]. Interestingly, this approach only focused on minimizing the distance of the same-class pairs between source and target domains with the largest distance and maximizing the distance of the most nearby different-class pair. Alternatively, Hedegaard et al. [10, 29] utilized the graph embedding technique to learn a domain-invariant and semantically meaningful feature space. Similar to CCSA, a siamese network was also used in this study as a feature generator. Generated features are finally put into a linear discriminative analysis (LDA) classifier to perform the prediction. In addition, Tong et al. [9] presented a mathematical framework (MF) that considered DA as a convex optimization problem. This MF quantifies the transferability in the transfer learning problems based on the number of samples, model complexity, and Chi-square distance between source and target tasks. The authors also designed an SDA approach using this framework and achieved encouraging performance in the DA benchmarks. Nevertheless, the training of SDA methods above requires pair-up samples between \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\). Our method is rather trainable by single-stream data based on the mini-batch strategy.
In addition, another study similar to ours is the center loss [20], which calculates the class centers using the data from both source and target domains. Given that the number of source samples is much larger than that of the target subject, the computation of the class centers is dominated by the source samples. Thus, the model trained by the center loss [20] only directly pulls the scarce samples of \(\mathcal {D}_{t}\) to the centers of \(\mathcal {D}_{s}\) without accounting for the feature distribution of \(\mathcal {D}_{t}\). Generally, the center loss has a different insight from ours and can not address SDA classification tasks effectively.
3 Methodology
In this work, we are only concerned with the specific SDA problem in which a large-scale dataset in \(\mathcal {D}_{s}\) and very few annotated samples in \(\mathcal {D}_{t}\) are available. In other words, we have all data \(\mathcal {D}_{all} = \{\mathbf {x}_{i},y_{i}\}^{m+n}_{i=1}\) combined by the ones from source domain \(\mathcal {D}_{s} = \{{\mathbf {x}^{s}_{i}},{y^{s}_{i}}\}^{m}_{i=1}\) and from target domain \(\mathcal {D}_{t} = \{{\mathbf {x}^{t}_{i}},{y^{t}_{i}}\}^{n}_{i=1}\), respectively. \(\mathbf {x}_{i} \in \mathbb {R}^{d}\) denotes the ith feature in d dimensional space (vector size), and yi is the corresponding label of xi having a classes. The features \({\mathbf {x}^{s}_{i}}\) and \({\mathbf {x}^{t}_{i}}\) can be regarded as realization of random variables Xs and Xt, respectively. Note that m >> n in the SDA scenario. Let’s say that Xs represents the source domain and Xt represents the target domain. In the absence of domain shift, we can simply train a DNN model using all the data directly from \(\mathcal {D}_{all}\) with the softmax loss defined as
where \({\mathbf {W}}_{j} \in \mathbb {R}^{d}\) represents the jth column of the weights \({\mathbf {W}} \in \mathbb {R}^{d\times a}\) in the last dense layer (features as input) and \({\mathbf {b}} \in \mathbb {R}^{a}\) denotes the bias term. k is the size of a mini-batch. However, the distributions of the two domains are mostly different. Directly using \({\mathbf {x}^{s}_{i}}\) for the training of classifier in \(\mathcal {D}_{t}\) is naturally problematic. An alignment of features between different domains is therefore necessary. In UDA setting [7], it is assumed that labels are unavailable in \(\mathcal {D}^{t}\). A common strategy for domain alignment is to introduce a distance loss of the marginal distribution between \(\mathcal {D}^{s}\) and \(\mathcal {D}^{t}\) (i.e. p(Xs) and p(Xt)) as follows.
where D(⋅,⋅) is a certain metric between two distribution inputs which once aligned, a feature can no longer be recognized from the source or target domain. The UDA methods have a natural limitation that even the marginal distribution is perfectly aligned: there is no promise that the features belonging to the same class but in different domains are transformed into the same space. Such an alignment may not offer significant benefits on the DNN model with respect to a classification task. Alternatively, we have labeled data from the target domain in hand, albeit a small amount. It is practical to achieve a better alignment, where features from different domains but with the same class label are mapped in the nearby distribution by amending (2) as:
Now, the core challenge is to find an appropriate metric D(⋅,⋅). To this end, we minimize the Euclidean distance between features and the corresponding class centers of the opposite domain. Mathematically, the (3) can be reformulated as:
where \({\mathbf {c}}_{y_{i}}^{t} \in \mathbb {R}^{d}\) denotes the yith class center of features in \(\mathcal {D}^{t}\), and \({\mathbf {c}}_{y_{j}}^{s} \in \mathbb {R}^{d}\) represents the yjth class center of features in \(\mathcal {D}^{s}\). An intuitive example of (4) can be referred to the “Early stage in training” in Fig. 2. It is noted that m is usually much larger than n in SDA setting. In this case, the first half of (4) usually dominates the loss function without considering the contribution of the samples in \(\mathcal {D}^{t}\) to the aligned latent space. To address this issue, we take the mean of distances instead of summing them up in each domain as follows
Ideally, calculations of \({{\mathscr{L}}}_{F-CTL}\) and feature centers c should take all features of the whole training set into account. However, due to the large sample size of the source training set and a limited RAM storage, it is impractical to perform such an implementation. We implement the update for centers and features based on mini-batch. The \({\mathscr{L}}_{F-CTL}\) can be changed as (6).
where ks and kt are number of samples from \(\mathcal {D}^{s}\) and \(\mathcal {D}^{t}\) in the mini-batch, respectively. Equation (6) (\({{\mathscr{L}}}_{CTL}\)) is the ultimate form of the proposed center transfer loss (CTL) based on the mini-batch update strategy. From the equation, it is noticed that CTL can be optimized with one-stream input without relying on a two-stream network used in previous methods [6, 10, 15, 18, 19] for the loss optimization. The training of a two-stream network requires pairing up samples, leading to a quadratic increase in the sample size from the original dataset. Our loss is based on one-stream training and is able to avoid this problem. In addition, CTL contributes to a domain alignment at the initial training stage and increases the discriminative power of features afterwards by minimizing the intra-class variation (Fig. 2). Although previous methods introduce similar losses that also achieve a domain alignment and the increase of discriminative power. They require to manually set a trade-off value to balance these two functionalities in training to produce a good outcome. Alternatively, it is clear that the proposed loss, as shown in the equation, does not require such a manual trade-off value to balance them.
Some centers may not be updated in each iteration of training, as the training is conducted by the mini-batch strategy. The updating equations of class centers ct and cs are formulated as:
where δ(⋅) is the indicator function, and it is equal to 1 if the condition is satisfied and equal to 0 otherwise. ρ is a small constant (i.e., 10− 5) to avoid the equation being divided by zero. A scalar α ∈ (0,1] controls the learning rate of the centers, eliminating the noisy samples’ negative impact. We adopt the joint supervision of softmax loss and CTL to train the DNN models for DA in the classification task. The objective function is given as follows.
where λ is a trade-off scalar to balance these two losses and is ranging from 0 to 5 in our experiments. We summarize the training strategy based on the mini-batch in Algorithm 1.
4 Experimental protocols and results
We evaluate the proposed loss function on different common SDA benchmarks, including Office31 [30], Office-Caltech-10 [31], Office-Home [32], and digit transfer (MNIST [33], USPS [34], SVHN [35], and MNIST-M [36]). The visualization of features and sensitivity analyses on λ and α are also presented. The impact of the batch size on the effectiveness of CTL is introduced in the last part of this section. The source code for experiments is publicly availableFootnote 1. All data generated or analysed during this study are included.
4.1 Office31
Office31 is a classical benchmark collected for the evaluation of DA methods. It comprises 31 visual objectives from three separate domains, namely Amazon (\({\mathcal {A}}\)), Webcam (\({\mathcal {W}}\)), and DSLR (\(\mathcal {D}\)). Amazon is the largest dataset and contains 2,817 images. Webcam and DSLR are relatively compact and have 795 and 498 images, respectively. Examples of images in this dataset are shown in Fig. 3.
Our experiments follow the setting used in [9]. Six domain shifts, including \({\mathcal {A}} \rightarrow {\mathcal {D}}\), \({\mathcal {A}} \rightarrow {\mathcal {W}}\), \({\mathcal {W}} \rightarrow {\mathcal {A}}\), \({\mathcal {W}} \rightarrow {\mathcal {D}}\), \({\mathcal {D}} \rightarrow \mathcal {A}\), and \({\mathcal {D}} \rightarrow {\mathcal {W}}\) are examined in this dataset. All classes of the dataset and five-train-test-split validation scheme are used in the experiments. With respect to the source domain, 20 samples per class from \(\mathcal {A}\), whereas 8 samples per class from \(\mathcal {W}\) or \(\mathcal {D}\) are randomly chosen to train the model for each split. For the target domain, 3 samples per class are randomly selected for training in each split. The remaining target samples are used for testing.
The convolutional layers of VGG16 [37] followed by two dense layers with output sizes of 1024 and 128, respectively, are used as the base network in the experiment. We use this architecture for a fair comparison to most published SDA approaches. The weights of convolutional layers were pre-trained by ImageNet [38]. The ones of dense layers are randomly initialized. All images are resized to 224 × 224, followed by normalization. The learning rates for convolutional and dense layers are set as 0.001 and 0.01, respectively. The size of the mini-batch is 32. λ and α are fixed as 0.1 and 0.5, respectively. We compare our method with recent supervised domain adaptation SOTAs, including SDADT [18], CCSA [6], FADA [15], d-SNE [19], DAGE-LDA [10], and MF [9]. Three baselines, including (1) Model 1 trained by only source data using softmax loss, (2) Model 2 trained by source data and target samples using softmax loss, and (3) Model 3 trained by source data and target samples using a joint supervision of softmax loss and center loss, are also involved in the comparison. It is noted that all baselines use the same backbone (i.e., VGG16) as that used in the proposed method. We also compare our strategy with another SOTA approach, So-HoT [17]. Although the original paper of SoHoT reports the model performance using VGG16 on Office31, it only contains two cross-domain schemes (i.e., \(\mathcal {A} \rightarrow \mathcal {D}\) and \(\mathcal {A} \rightarrow \mathcal {W}\)). Alternatively, the paper shows the performance of AlexNet [39] on all six schemes. Therefore, the proposed method based on AlexNet is also implemented for a fair comparison to So-HoT.
Table 1 shows the performance of different models on the Office31 dataset. Our experiments show that most SDA methods outperform Models 1 and 2, the baselines trained using the softmax loss without DA. It is worth knowing that several previous studies, such as [6, 15, 19], only compare their methods with a weak baseline (i.e., Model 1 trained by only source data using softmax loss). They claim that SDA methods have significant improvements (over 15%) in the classification task of Office31 dataset compared to this baseline. This may overstate the effectiveness of SDA methods. A stronger and fairer baseline, i.e., Model 2 trained by source and target data using softmax loss, should also be involved in the comparison. In Table 1, we can observe that SDA approaches only have less than 7% increments in the classification performance from the stronger baseline (Model 2), consistent with the findings in [10, 17]. Nevertheless, it is encouraging to see that our model has a higher performance than other recent SOTAs and baselines on the Offce31 dataset.
4.2 Office-caltech-10
Office-Caltech-10 dataset contains ten sharing categories (i.e., backpack, bike, calculator, headphones, keyboard, laptop-computer, monitor, mouse, mug, and projector) in Office31 and Caltech-256 [40]. The dataset contains 4 domains including, Amazon (958 images; \(\mathcal {A}\)), Webcam (295 images; \(\mathcal {W}\)), Dslr (157 images; \(\mathcal {D}\)), and Caltech (1123 images; \(\mathcal {C}\)). Therefore, we have 12 cross-domain tasks formulated in this data collection. The same split-generation protocol in Office31 experiments is used but only applied to the ten classes above. Following the settings in [9], we implement our experiments using DeCAF-fc6 features [41] as model inputs.
Referring to the base architecture in [6, 9], we utilize two dense layers with output sizes of 1024 and 128 with PReLU activation as the feature embeddings and one fully-connected layer as a classifier. The learning rate is 0.001. The size of the mini-batch is 32. λ and α are fixed as 0.1 and 0.5, respectively. We compare the proposed method with CCSA, d-SNE, DAG-LDA, MF, and three baselines (Model 1, 2 and 3 mentioned in the Office31 experiment). We either report the results in the original publications or implement these recent SOTA methods based on their open-source codes.
Table 2 shows the performance of models using the DeCaF-fc6 features on the ten categories of the Office-Caltech-10 database. Again, our method gains a higher accuracy than other SDA approaches and baseline models on both Within-Office31 and Office-Caltech DA tasks. This result shows the advantage of the proposed strategy in DA classification tasks.
4.3 Office-home
Office-Home [32] is a relatively large-scale dataset for DA experiments. It contains 15500 images with 65 classes. The dataset has four different domains, i.e., Art (\(\mathcal {A} r\)), Clip Art (\(\mathcal {C} a\)), Product (\(\mathcal {P} r\)), and Real World (\(\mathcal {R} w\)). Thus, it contains 12 (4 × 3) different DA tasks.
The “S+T” evaluation protocol in [42] is implemented in our experiments. Specifically, labeled source images and three labeled target images per class are used to train the model in each DA task. The exact data splits can be found in the link belowFootnote 2. According to [42], AlexNet [39] pre-trained on ImageNet is used in our experiments. All images are resized to 227 × 227, followed by normalization. The size of the mini-batch is 32. λ and α are fixed as 0.001 and 0.5, respectively. We compare the proposed method with three recent SDA methods, i.e., CCSA [6], d-SNE [19], and DAGE-LDA [29]. The experiments of these three methods are conducted based on their open source codes. Three baselines, Model 1, 2 and 3, mentioned in Section 4.1 are also included in the comparison.
Table 3 presents the classification performance on twelve DA tasks of the 65-class Office-Home dataset. The proposed method achieves the highest accuracy in most DA tasks. It also obtains the best average accuracy across different DA tasks.
4.4 Digit transfer
The digit transfer dataset collection has also been popularly used to study the effectiveness of SDA approaches. We use four datasets, all containing hand-written digits from 0 to 9. These datasets include MNIST (\({\mathscr{M}}\)), USPS (\(\mathcal {U}\)), SVHN (\(\mathcal {S}\)), and MNIST-M (\({\mathscr{M}} {\mathscr{M}}\)). MNIST consists of 70,000 28 × 28 grayscale images; USPS contains 11,000 grayscale images with a 16 × 16 resolution; SVHN is a real-world image dataset having 99,280 RGB images extracted from street view house numbers; MNIST-M has 68,002 RGB images generated from MNIST by adding different backgrounds.
4.4.1 First experiment
The evaluation protocol in [10, 19] is performed in this experiment. We investigate the transfers from \({{\mathscr{M}}}\) to \({{\mathscr{M}}} {{\mathscr{M}}}\), between \({{\mathscr{M}}}\) and \({\mathcal {U}}\), and between \({{\mathscr{M}}}\) and \(\mathcal {S}\). Original train-test splits of the datasets are used in our experiments. With respect to the target domain, we randomly select 10 samples per class from the training split. The evaluation is repeated five times.
We use the same architecture as LetNets++ [20] according to [19]. Pre-processing techniques, including resize, normalization, and RGB-Greyscale transformation, are applied if necessary. The learning rate for the parameters of the network is 0.001. The size of the mini-batch is 64. λ and α are fixed as 0.75 and 0.5, respectively. We compared the proposed method with CCSA [6], d-SNE [19], DAGE-LDA [29] , and three baselines (Model 1, 2 and 3 mentioned in the Office31 experiment). The details of the architecture are given in Fig. 4.
As shown in Table 4, our method outperforms than other SOTAs and baselines. We notice that the proposed model has a similar performance with d-SNE when the domain shift is small, i.e. the cross-domain tasks \({\mathscr{M}} \rightarrow {\mathscr{M}} {\mathscr{M}}\), \({\mathscr{M}} \rightarrow \mathcal {U}\), \(\mathcal {U} \rightarrow {\mathscr{M}}\), and \(\mathcal {S} \rightarrow {\mathscr{M}}\). When it comes to the \({\mathscr{M}} \rightarrow \mathcal {S}\) condition that has a relatively-large domain shift, our method demonstrates an evident advantage.
4.4.2 Second experiment
It is also interesting to see how the performance of models varies with even a smaller size of samples per class from the target domain. Therefore, we conduct another evaluation protocol in [6, 19, 29] applied to MNIST and USPS datasets. This protocol examines both \({\mathscr{M}} \rightarrow \mathcal {U}\) and \(\mathcal {U} \rightarrow {\mathscr{M}}\) cross-domain tasks, where 2000 and 1800 images are randomly selected from MNIST and USPS, respectively. In addition, a small number (N) of labeled samples per category are randomly picked up from the target domain and used in the model training. The evaluation is repeated ten times for each N form 0 to 7. We use the same data splits generated in [6]Footnote 3.
The implementation details are the same as those in the first experiment of digit transfer above. We compare the proposed method with CCSA [6], FADA [15] , d-SNE [19], and DAGE-LDA [29]. As reported in [10], there are discrepancies in the network architecture between the description in publications and public source codes of CCSA and d-SNE. Furthermore, differences in the model performance between the results reported in original articles and those reproduced by [10, 43] are also relatively large regarding CCSA and d-SNE. Therefore, for a fair comparison, we amend network architectures of CCSA, d-SNE, and DAGE-LDA as LetNets++ and rerun the experiments based on their publicly available codes. To our best knowledge, the authors of FADA may not release the source code, so we directly report the result in the original publication. Two baselines, Model 2 and Model 3 mentioned in the Office31 experiment, are also included in the comparison. Models 2 and 3 are trained by only source data when N = 0.
Table 5 shows the average classification accuracies for different approaches on the MNIST-USPS collection. The standard deviations of ten splits are minor for all methods, so we do not report them in the table. Clearly, SDA-based approaches (expect DAG-LDA) achieve better performance than baselines (Model 2 and 3) which do not incorporate the distribution alignment in training in our experiments. We also plot the performance of the proposed model against Models 2 and 3 on the \({\mathscr{M}} \rightarrow \mathcal {U}\) task with different N samples per class from \(\mathcal {D}_{t}\) in Fig. 5(a). We observe that the proposed strategy significantly improves the classification performance over baselines when N is small. With an increase of N, the accuracy and the improvement gradually converge.
More importantly, Table 5 shows that our model outperforms other SDA methods in most scenarios, except that it has a slightly lower classification accuracy than FADA when N = 5. The proposed approach improves accuracy by at least 1% and 1.5% against other SOTA methods when N = 7. When N = 1, the proposed method loses its computational advantage against other SDA methods, as the number of pair-up samples is equal to the number of original samples. However, we still recognize an accuracy-wise superiority of our strategy. The superiority may come from the improvement of the discriminative power of the features by decreasing the intra-class variation in addition to the feature alignment between domains during the training.
4.5 Visualization of deep learning features
We visualize the deep features of models trained via multiple N (N = 0,1,4, or 7) on \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) to understand the proposed CTL better. We train the model only using softmax loss when N = 0, but using the joint supervision of softmax loss and CTL, otherwise. The models are trained by a random draw of 1800 samples from USPS and N samples per class from MNIST. Other settings are the same as those in the digit transfer experiments. The visualization is performed on another random draw of 1800 and 2000 samples from USPS and MNIST, respectively, to avoid visualization on training data. We apply the t-SNE technique [44] to transfer the high-dimension features into 2-D vectors for an easy illustration (Fig. 6).
Figure 6(a) shows that the features are not well aligned if no adaptation mechanism is involved. In addition, we notice that the features with the same label but in different domains stay close to each other even when N = 1, as shown in subfigure (b). The alignment gets even better with an increase of N using the proposed CTL loss by comparing the subfigures (b), (c), and (d). We can still observe discrepancies of features between \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\) in several classes, e.g., classes 0 and 6 when N is equal to 1 or 4. However, the distributions of features between \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\) nearly overlap with each other in the case when N = 7, as demonstrated in subfigure (d).
4.6 Sensitivity analysis on λ and α
The hyper-parameters λ and α control the adaptation rate between domains and the negative impact of noisy samples, respectively. They both are significant to the training of the DNN model. Therefore, we carry out analyses to demonstrate their sensitiveness.
The analyses consider four cross-domain tasks, i.e., \({\mathcal {W}} \rightarrow {\mathcal {A}}\) (Office-31), \({\mathcal {C}} \rightarrow {\mathcal {A}}\) (Office-Caltech-10), \({\mathcal {A}} r \rightarrow {\mathcal {C}} a\) (Office-Home), and \({\mathcal {U}} \rightarrow {\mathscr{M}}\) (Digit transfer). The experimental protocols of \(\mathcal {W} \rightarrow \mathcal {A}\), \(\mathcal {C} \rightarrow \mathcal {A}\), and \(\mathcal {A} r \rightarrow \mathcal {C} a\) are the same as those described in previous sections, except that the values of λ and α are not fixed but vary in this sensitivity analysis. For \(\mathcal {U} \rightarrow {\mathscr{M}}\) task, we sample 1800 images from USPS and 3 images per class from MNIST to form a training set. The evaluation is performed on 2000 samples randomly drawn from MNIST, excluding the samples that have been selected for training. The implementation details are the same as those in Section 4.4.2, except that the values of λ and α vary in this analysis.
First, we fix the center step α as 0.5 and vary λ values to train different models. As CTL is based on l2 − norm, the dimensionality of the deep feature is positively related to the loss value, i.e., a higher dimension leads to a larger CTL. Dimensionalities of deep features for four cross-domain tasks are diverse. Therefore, we test different ranges of λ values for different tasks as follows.
-
1.
\({\mathcal {W}} \rightarrow {\mathcal {A}}\): {0,0.025,0.05,0.075,0.1,0.25,0.5,0.75,1}
-
2.
\({\mathcal {C}} \rightarrow {\mathcal {A}}\): {0,0.025,0.05,0.075,0.1,0.25,0.5,0.75,1}
-
3.
\({\mathcal {A}} r\rightarrow {\mathcal {C}} a\): {0,0.00025,0.0005,0.00075,0.001,0.0002 5,0.005,0.0075,0.01}
-
4.
\({\mathcal {U}} \rightarrow {{\mathscr{M}}}\): {0,0.1,0.25,0.5,0.75,1,2.5,5}
Figure 7 shows the evaluation accuracies of the proposed method with different values of λ. It is observed that simply using the softmax loss (when λ = 0) is not a good option, and DNN models have the lowest average classification accuracy. We can also observe a “Log” curve of the model performance in all four tasks when λ varies in the investigating ranges. This shows that our model is insensitive to λ values in a relatively large scope.
DNN models achieve the best performance when λ is equal to 0.1, 0.1, 0.001, and 0.75 for the evaluated four tasks, respectively. We then fix λ as these values and train DNN models with different values of α from 0.01 to 1 (the same range for all tasks). The evaluation accuracies of these models with different values α are shown in Fig. 8. We also observe stable performances among these models across different values of α, i.e., from 0.2 to 1.
4.7 Size of mini-batch
The proposed CTL can be trained by the mini-batch strategy. It is also interesting to explore how the size of mini-batch influences the effectiveness of CTL. We conduct experiments on a cross-domain task, \(\mathcal {U} \rightarrow {\mathscr{M}}\). We sample 1800 images from USPS and 3 images per class from MNIST to form a training set. The evaluation is performed on 2000 samples randomly drawn from MNIST, excluding the samples that have been selected for training. The implementation details are the same as those in the digit transfer experiments, except that the batch size is not fixed as 64 at this time. Batch sizes that are multiples of powers of 2 are common in DL training. Thus, different values, i.e., {1, 2, 4, 8, 16, 32, 64, 128, 256, 512}, are tested in the analysis.
The accuracy of the proposed method using different values of batch size is shown in Fig. 9. It is observed that the performance is unsatisfactory when the batch size is small. We further plot the learning curve of CTL using different batch sizes to identify the minimization process of CTL (Fig. 10). The figure shows that the training of CTL is unstable when the batch size is small. The smaller the batch size is, the more volatile the training process becomes. Updating the centers based on very few samples each time naturally leads to great randomness and biases. When the batch size gets larger, the learning curves of CTL become stable. In addition, we also identify the accuracy drops in Fig. 9 when the batch size is relatively big (i.e., 256 and 512). It is consistent with the finding in [45]. The study states that a large batch size (over 10% of the full batch) may not be a good choice. A model trained using a larger batch size is more likely to converge to sharp minima, e.g., the model is reasonably good but does not offer the best solution to the classification task.
In sum, although the choice of batch size has an impact on the effectiveness of the proposed CTL, our loss is robust to the common options of batch size in DL training. Unless the batch size is either too small or too large, the model performance remains to be satisfactory and stable.
5 Conclusion
Domain adaptation has drawn considerable interest in the DL community recently. It aims to make use of the copious amount of accessible data from different domains. In this work, we propose a new loss function, referred to as CTL. It is trainable using a single-stream network based on the mini-batch strategy. By a joint supervision of the softmax loss and CTL, same-class features between source and target domains achieve a desirable degree of alignment and a compact intra-class variation. At the same time, different-class features keep sufficiently separated. The usage of CTL results in both domain alignment and the minimization of intra-class variation subsequently in the early and latter training stages without the need to set trade-off values to balance these two functions. The “single-stream implementation” and “manual-balance-waived simultaneous achievement of domain alignment and intra-class variation minimization” are two main advantages of our approach compared to previous methods. Experiments in the present study show that our approach performs better than baselines and recent SOTAs under identical settings across standard DA benchmarks.
Although the proposed CTL offers an encouraging outcome, it is worthwhile to investigate whether its variants provide more promising performance. For example, referring to [46], we can try using only the nearest feature points of each class center instead of relying on all feature points to update the class centers in each iteration. This implementation may be able to decrease the negative impact of the out-of-distribution feature points in the center update. Moreover, in addition to using the l2 − norm distance, other metrics, such as the cosine distance and other types of norms, are also valuable to be explored in future works.
References
Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst, 1–21, https://doi.org/10.1109/TNNLS.2021.3084827>
Roy Y, Banville H, Albuquerque I, Gramfort A, Falk TH, Faubert J (2019) Deep learning-based electroencephalography analysis: a systematic review. J Neural Eng 16(5):051001
Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, Ashrafian H, Darzi A (2021) Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med 4(1):1–23
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22 (10):1345–1359. https://doi.org/10.1109/TKDE.2009.191
Wang M, Deng W (2018) Deep visual domain adaptation: a survey. Neurocomputing 312:135–153. https://doi.org/10.1016/j.neucom.2018.05.083
Motiian S, Piccirilli M, Adjeroh DA, Doretto G (2017) Unified deep supervised domain adaptation and generalization. In: Proceedings of the IEEE international conference on computer vision (ICCV)
Wilson G, Cook DJ (2020) A survey of unsupervised deep domain adaptation. ACM Trans Intell Syst Technol, 11(5). https://doi.org/10.1145/3400066
Singh A (2021) CLDA: contrastive learning for semi-supervised domain adaptation. In: Beygelzimer A, Dauphin Y, Liang P, Vaughan JW (eds) Advances in neural information processing systems. https://openreview.net/forum?id=1ODSsnoMBav
Tong X, Xu X, Huang S-L, Zheng L (2021) A mathematical framework for quantifying transferability in multi-source transfer learning. In: Beygelzimer A, Dauphin Y, Liang P, Vaughan JW (eds) advances in neural information processing systems. https://openreview.net/forum?id=wQZWg82TWx
Hedegaard L, Sheikh-Omar OA, Iosifidis A (2021) Supervised domain adaptation : a graph embedding perspective and a rectified experimental protocol. IEEE Trans Image Proc 30:8619–8631. https://doi.org/10.1109/TIP.2021.3118978
Wang Y, Liu J, Ruan Q, Wang S, Wang C (2021) Cross-subject eeg emotion classification based on few-label adversarial domain adaption. Expert Syst Appl 115581:185. https://doi.org/10.1016/j.eswa.2021.115581
Sawyer D, Fiaidhi J, Mohammed S (2021) Few shot learning of covid-19 classification based on sequential and pretrained models: a thick data approach. In: 2021 IEEE 45Th annual computers, software, and applications conference (COMPSAC), pp 1832–1836. https://doi.org/10.1109/COMPSAC51774.2021.00276
Abdelwahab M, Busso C (2015) Supervised domain adaptation for emotion recognition from speech. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5058–5062. https://doi.org/10.1109/ICASSP.2015.7178934
Li X, He Y, Zhang JA, Jing X (2021) Supervised domain adaptation for few-shot radar-based human activity recognition. IEEE Sensors J 21(22):25880–25890. https://doi.org/10.1109/JSEN.2021.3117942
Motiian S, Jones Q, Iranmanesh SM, Doretto G (2017) Few-shot adversarial domain adaptation. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17, Curran Associates Inc, pp 6673–6683
Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol 2. Lille, p 0
Koniusz P, Tas Y, Porikli F (2017) Domain adaptation by mixture of alignments of second-or higher-order scatter tensors. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7139–7148. https://doi.org/10.1109/CVPR.2017.755
Tzeng E, Hoffman J, Darrell T, Saenko K (2015) Simultaneous deep transfer across domains and tasks, pp 4068–4076. https://doi.org/10.1109/ICCV.2015.463
Xu X, Zhou X, Venkatesan R, Swaminathan G, Majumder O (2019) D-sne: domain adaptation using stochastic neighborhood embedding. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2492–2501. https://doi.org/10.1109/CVPR.2019.00260
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016, Springer, pp 499–515
Fernando B, Tommasi T, Tuytelaars T (2015) Joint cross-domain classification and subspace learning for unsupervised adaptation Pattern Recognition Letters, 65. https://doi.org/10.1016/j.patrec.2015.07.009
Teshima T, Sato I, Sugiyama M (2020) Few-shot domain adaptation by causal mechanism transfer. In: International conference on machine learning, PMLR, pp 9458–9469
Taskesen B, Yue M-C, Blanchet J, Kuhn D, Nguyen VA (2021) Sequential domain adaptation by synthesizing distributionally robust experts. In: International conference on machine learning, PMLR, pp 10162–10172
Corral-Soto ER, Nabatchian A, Gerdzhev M, Bingbing L (2021) Lidar few-shot domain adaptation via integrated cyclegan and 3d object detector with joint learning delay. In: 2021 IEEE international conference on robotics and automation (ICRA), pp 13099–13105. https://doi.org/10.1109/ICRA48506.2021.9561466
Zhong C, Wang J, Feng C, Zhang Y, Sun J, Yokota Y (2022) Pica: Point-wise instance and centroid alignment based few-shot domain adaptive object detection with loose annotations. In: 2022 IEEE/CVF winter conference on applications of computer vision (WACV), pp 398–407. https://doi.org/10.1109/WACV51458.2022.00047
Zhou JT, Tsang IW, Pan SJ , Tan M (2014) Heterogeneous domain adaptation for multiple classes. In: Kaski S, Corander J (eds) Proceedings of the seventeenth international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 33. PMLR, pp 1095–1103. https://proceedings.mlr.press/v33/zhou14.html
Sukhija S, Krishnan NC, Singh G (2016) Supervised heterogeneous domain adaptation via random forests. In: IJCAI
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. pp 1735–1742. https://doi.org/10.1109/CVPR.2006.100
Morsing LH, Sheikh-Omar OA, Iosifidis A (2021) Supervised domain adaptation using graph embedding. In: 2020 25Th international conference on pattern recognition (ICPR), pp 7841–7847. https://doi.org/10.1109/ICPR48806.2021.9412422
Saenko K, Kulis B, Fritz M, Darrell T (2010) Adapting visual category models to new domains. In: Daniilidis K, Maragos P, Paragios N (eds) Computer vision – ECCV 2010, Springer, pp 213–226
Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE conference on computer vision and pattern recognition, pp 2066–2073. https://doi.org/10.1109/CVPR.2012.6247911
Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S (2017) Deep hashing network for unsupervised domain adaptation. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 5385–5394
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
LeCun Y, Boser B, Denker J, Henderson D, Howard R, Hubbard W, Jackel L, Handwritten digit recognition with a back-propagation network, Touretzky D (1989). In: Advances in neural information processing systems, vol 2. Morgan-Kaufmann, https://proceedings.neurips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on deep learning and unsupervised feature learning 2011
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2096–2030
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1409.1556
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A, Fei-Fei L (2014) Imagenet large scale visual recognition challenge International Journal of Computer Vision 1150. https://doi.org/10.1007/s11263-015-0816-y
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25. Curran Associates Inc
Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset CalTech Report
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of machine learning research, vol 32. PMLR, pp 647–655. https://proceedings.mlr.press/v32/donahue14.html
Saito K, Kim D, Sclaroff S, Darrell T, Saenko K (2019) Semi-supervised domain adaptation via minimax entropy. In: 2019 IEEE/CVF international conference on computer vision (ICCV), IEEE Computer Society, pp 8049–8057. https://doi.org/10.1109/ICCV.2019.00814
Wang Z, Du B, Guo Y (2020) Domain adaptation with neural embedding matching. IEEE Trans Neural Netw Learn Syst 31(7):2387–2397. https://doi.org/10.1109/TNNLS.2019.2935608
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9 (86):2579–2605
Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2017) On large-batch training for deep learning : generalization gap and sharp minima. In: International conference on learning representations. https://openreview.net/forum?id=H1oyRlYgg
Wang X, Zheng Z, He Y, Yan F, Zeng Z, Yang Y (2021) Soft person reidentification network pruning via blockwise adjacent filter decaying. IEEE Trans Cybern, 1–15. https://doi.org/10.1109/TCYB.2021.3130047
Acknowledgements
The work in this paper was supported in part by the Hong Kong Research Grants Council (PolyU 152006/19E).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, X., Zhou, N., Huang, J. et al. Center transfer for supervised domain adaptation. Appl Intell 53, 18277–18293 (2023). https://doi.org/10.1007/s10489-022-04414-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04414-2