1 Introduction

Deep learning (DL) methods have been successfully applied to various areas, such as computer vision [1], brain-computer interface [2], and medical diagnosis [3]. The outstanding performance of DL approaches benefits from numerous training data. However, it is sometimes tough to acquire sufficient data to train a DL model for a specific task at hand, since data recording and label annotation are costly and labor-intensive. One of the popular choices to address such an issue is domain adaptation (DA) [4]. Its main idea is to use available large-scale datasets in the source domain (\(\mathcal {D}_{s}\)) to assist the model training in the target domain (\(\mathcal {D}_{t}\)), where the training data is scarce.

According to [5, 6], DA can be either unsupervised, semi-supervised, or supervised, determined by the availability of the labeled data in \(\mathcal {D}_{t}\). Unsupervised domain adaptation (UDA) [7] only carries unlabeled target data. In a semi-supervised DA (SSDA) scheme [8], both a small amount of labeled and a considerable amount of unlabelled data are accessible. Alternatively, supervised domain adaptation (SDA) [9, 10] supposes that all available target samples are annotated, although the number is small. Sophisticated SDA approaches can usually outperform UDA and SSDA ones when the amount of available data in \(\mathcal {D}_{t}\) is small [6]. Making annotations on a small dataset is likely to be practical and does not require too much effort. Therefore, SDA methods are more appealing if only very few samples from \(\mathcal {D}_{t}\) are accessible. They have been performed in many applications such as cross-subject EEG emotion classification [11], CT scan-based Covid-19 diagnosis [12], emotion detection from the speech [13], and radar-based human activity recognition [14]. These applications only allow recording a small set of data from the target domain, as a massive data collection is either extremely expensive or impossible. Therefore, they require a suitable method to make use of such a small number of samples in \(\mathcal {D}_{t}\) to generate a reliable model applied to the target domain. SDA also has another name, i.e., few-shot domain adaptation [15], which more directly expresses the scenario of using few samples from \(\mathcal {D}_{t}\) in DA problems.

The typical way to implement an SDA approach in the DL community for a classification task is to learn a deep transformation that draws same-class samples close together, regardless from \(\mathcal {D}_{s}\) or \(\mathcal {D}_{t}\). A popular strategy to perform such a mapping is to operate a two-stream network, i.e., siamese network [16] or correlated network [17]. The training process of these networks typically starts with either a sample-based [6, 10, 15, 18, 19] or a batch-based [9, 17] pair-wise input. However, such a pairing mechanism leads to a quadratic increase of the sample size from the original dataset and unavoidably results in redundancy, slow convergence, and unstable performance [20]. For example, one standard protocol [19, 21] of MNIST → USPS domain adaptation task is to use 2000 labeled and 70 labeled samples from MNIST and USPS datasets, respectively, to train a model that applies to the classification task in USPS. Several recent state-of-the-arts (SOTA) methods [6, 10, 19] utilize a siamese network that trains the model with 56000 (2000*70*σ; σ is the ratio to control the redundancy and is equal to 0.4 in these studies.) pairs of samples. It is impractical to train a network when the sample size of either source or target dataset further enlarges.

This study proposes a simple but efficient loss function, namely center transfer loss (CTL), to address the abovementioned issues and increase the discriminative power of deep learning features. Specifically, we learn a center (a vector with the same dimensionality as that of the feature) for features of each class in each domain and update these centers in training. In addition, we minimize the distance between the features and their corresponding class centers in the opposite domain rather than the same domain. In other words, we minimize the distance between features of \(\mathcal {D}_{s}\) and class centers of \(\mathcal {D}_{t}\), as well as the distance between features of \(\mathcal {D}_{t}\) and class centers of \(\mathcal {D}_{s}\). For example, the features of class 1 in \(\mathcal {D}_{s}\) are pushed to the feature center of class 1 in \(\mathcal {D}_{t}\), and the features of class 1 in \(\mathcal {D}_{t}\) are pushed to the feature center of class 1 in \(\mathcal {D}_{s}\). Deep neural networks (DNNs) are trained by the joint supervision of softmax loss and CTL. Intuitively, the softmax loss ensures that features of the different classes stay apart. CTL pushes samples to the class center of the opposite domain. The same-class samples between different domains will eventually align by constant center update and distance minimization. More interestingly, CTL achieves a feature alignment at the beginning of training, see Fig. 1(b). In the later stage, when the distribution of features between domains is sufficiently aligned, CTL alternatively acts as another function to decrease the intra-class variation of the features and increases their discriminative power, Fig. 1(c). We do not require to set hyper-parameters for controlling the shift between early and latter training stages.

Fig. 1
figure 1

The distribution of features generated by the model trained using the combination of softmax and CTL in a two-class toy problem with 200 source (MNIST) and 7 target (USPS) samples in each class. Training settings, expect for data and loss function, are the same as the toy example in [20]. The features are the output of the second last layer and dimensionally reduced 2 for visualization. (a) Features generated by the network without training. (b) Features generated by the network trained with only 5 epochs. (c) Features generated by the network trained with 20 epochs

In sum, the proposed method has two major contributions in comparison with previous approaches.

  1. 1.

    It is very convenient to employ CTL in DNNs. Our DL models are trainable by the mini-batch strategy in a single-stream setting without running two-stream architectures. The issue of redundancy, slow convergence, and unstable performance can be significantly avoided.

  2. 2.

    The learning features can achieve both domain alignment and intra-class variation minimization by using the proposed CTL in the model training. Although several previous SDA approaches (e.g., [6, 10, 19]) can also provide a similar outcome, a trade-off value must be manually set in these methods to balance the domain alignment and intra-class variation minimization. The optimal choice for the trade-off value varies in different datasets and tasks, resulting in a labour-intensive exhaustive search each time but no guarantee of finding the best value. Alternatively, our CTL can achieve both these two functionalities in different stages of training without the need to set a trade-off value to balance them.

To verify the effectiveness of our approach, we conduct extensive experiments on common DA benchmarks. The results show that our method achieves a better performance than current SOTAs. The remainder of this article is organized as follows. Section 2 introduces the related works. Section 3 explains the proposed method in detail. Section 4 presents the experiment protocols and results. The conclusion is drawn in Section 5.

2 Related works

SDA approaches focus on the specific scenario that the labeled target data are available in training, albeit very few samples per class. There are diverse SDA strategies targeted to different types of tasks, e.g., regression [22, 23], object detection [24, 25], and classification. Our study focuses on SDA methods in the classification problem.

Early SDA approaches for the classification task depend on the matrix-based mapping between domains and linear classifiers. Zhou et al. [26] proposed an SDA method called SHFR-ECOC by constructing a sparse feature transformation matrix to get invariant features between domains as inputs to a linear SVM classifier. Sukhija et al. [27] explored another SDA strategy that also learns a sparse feature transformation for the feature generation. This study used a random forest classifier instead. DL models have been rapidly developing in recent years. It offers an end-to-end way for the classification task and naturally arouses more researchers’ interests.

In [18], source and sparely labeled target data are used to train a siamese network. The network learns domain invariant features using a soft label distribution matching loss. Similarly, Motiian et al. [6] raised a method called CCSA, also based on the siamese architecture. The model was trained by the joint supervision of categorical entropy (i.e., softmax loss) and point-wise contrastive loss introduced in [28]. The authors found that their method provided a fast convergence and a better performance in terms of classification accuracy. The same research group of CCSA proposed another SDA method (i.e., FADA [15]) using the adversarial training to attain the feature alignment. They carefully designed four kinds of paired data that the discriminator in the network is augmented to distinguish. In the same year, Piotr et al. [17] also presented a strategy, namely SoHoT, using the mixture of second or/and third scatter alignment measures between source and target domains. They aim to align within-class scatters of a two-stream network to a certain degree using bespoke loss and to keep a good separation of the between-class scatters.

More recently, another SDA method, called Domain Adaptation using Stochastic Neighborhood Embedding (d-SNE), was proposed by Xu et al. [19]. Interestingly, this approach only focused on minimizing the distance of the same-class pairs between source and target domains with the largest distance and maximizing the distance of the most nearby different-class pair. Alternatively, Hedegaard et al. [10, 29] utilized the graph embedding technique to learn a domain-invariant and semantically meaningful feature space. Similar to CCSA, a siamese network was also used in this study as a feature generator. Generated features are finally put into a linear discriminative analysis (LDA) classifier to perform the prediction. In addition, Tong et al. [9] presented a mathematical framework (MF) that considered DA as a convex optimization problem. This MF quantifies the transferability in the transfer learning problems based on the number of samples, model complexity, and Chi-square distance between source and target tasks. The authors also designed an SDA approach using this framework and achieved encouraging performance in the DA benchmarks. Nevertheless, the training of SDA methods above requires pair-up samples between \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\). Our method is rather trainable by single-stream data based on the mini-batch strategy.

In addition, another study similar to ours is the center loss [20], which calculates the class centers using the data from both source and target domains. Given that the number of source samples is much larger than that of the target subject, the computation of the class centers is dominated by the source samples. Thus, the model trained by the center loss [20] only directly pulls the scarce samples of \(\mathcal {D}_{t}\) to the centers of \(\mathcal {D}_{s}\) without accounting for the feature distribution of \(\mathcal {D}_{t}\). Generally, the center loss has a different insight from ours and can not address SDA classification tasks effectively.

3 Methodology

In this work, we are only concerned with the specific SDA problem in which a large-scale dataset in \(\mathcal {D}_{s}\) and very few annotated samples in \(\mathcal {D}_{t}\) are available. In other words, we have all data \(\mathcal {D}_{all} = \{\mathbf {x}_{i},y_{i}\}^{m+n}_{i=1}\) combined by the ones from source domain \(\mathcal {D}_{s} = \{{\mathbf {x}^{s}_{i}},{y^{s}_{i}}\}^{m}_{i=1}\) and from target domain \(\mathcal {D}_{t} = \{{\mathbf {x}^{t}_{i}},{y^{t}_{i}}\}^{n}_{i=1}\), respectively. \(\mathbf {x}_{i} \in \mathbb {R}^{d}\) denotes the ith feature in d dimensional space (vector size), and yi is the corresponding label of xi having a classes. The features \({\mathbf {x}^{s}_{i}}\) and \({\mathbf {x}^{t}_{i}}\) can be regarded as realization of random variables Xs and Xt, respectively. Note that m >> n in the SDA scenario. Let’s say that Xs represents the source domain and Xt represents the target domain. In the absence of domain shift, we can simply train a DNN model using all the data directly from \(\mathcal {D}_{all}\) with the softmax loss defined as

$$ {\mathcal{L}}_{S}=-\sum\limits_{i=1}^{k}\log \frac{e^{{\mathbf{W}}_{y_{i}}^{\top}{\mathbf{x}}_{i}+b_{y_{i}}}}{{\sum}_{j=1}^{a} e^{{\mathbf{W}}_{j}^{\top}{\mathbf{x}}_{i}+b_{j}}} $$
(1)

where \({\mathbf {W}}_{j} \in \mathbb {R}^{d}\) represents the jth column of the weights \({\mathbf {W}} \in \mathbb {R}^{d\times a}\) in the last dense layer (features as input) and \({\mathbf {b}} \in \mathbb {R}^{a}\) denotes the bias term. k is the size of a mini-batch. However, the distributions of the two domains are mostly different. Directly using \({\mathbf {x}^{s}_{i}}\) for the training of classifier in \(\mathcal {D}_{t}\) is naturally problematic. An alignment of features between different domains is therefore necessary. In UDA setting [7], it is assumed that labels are unavailable in \(\mathcal {D}^{t}\). A common strategy for domain alignment is to introduce a distance loss of the marginal distribution between \(\mathcal {D}^{s}\) and \(\mathcal {D}^{t}\) (i.e. p(Xs) and p(Xt)) as follows.

$$ {\mathcal{L}}_{DIS}=D(p(X^{s}), p(X^{t})) $$
(2)

where D(⋅,⋅) is a certain metric between two distribution inputs which once aligned, a feature can no longer be recognized from the source or target domain. The UDA methods have a natural limitation that even the marginal distribution is perfectly aligned: there is no promise that the features belonging to the same class but in different domains are transformed into the same space. Such an alignment may not offer significant benefits on the DNN model with respect to a classification task. Alternatively, we have labeled data from the target domain in hand, albeit a small amount. It is practical to achieve a better alignment, where features from different domains but with the same class label are mapped in the nearby distribution by amending (2) as:

$$ {\mathcal{L}}_{CDIS}=\sum\limits_{i=1}^{a}D(p({X^{s}_{i}}), p({X^{t}_{i}})) $$
(3)

Now, the core challenge is to find an appropriate metric D(⋅,⋅). To this end, we minimize the Euclidean distance between features and the corresponding class centers of the opposite domain. Mathematically, the (3) can be reformulated as:

$$ {\mathcal{L}}_{F-CTL}=\frac{1}{2}\sum\limits_{i=1}^{m}\|{{\mathbf{x}}_{i}^{s}}-{\mathbf{c}}_{y_{i}}^{t}\|_{2}^{2}+\frac{1}{2}\sum\limits_{j=1}^{n}\|{{\mathbf{x}}_{j}^{t}}-{\mathbf{c}}_{y_{j}}^{s}\|_{2}^{2} $$
(4)

where \({\mathbf {c}}_{y_{i}}^{t} \in \mathbb {R}^{d}\) denotes the yith class center of features in \(\mathcal {D}^{t}\), and \({\mathbf {c}}_{y_{j}}^{s} \in \mathbb {R}^{d}\) represents the yjth class center of features in \(\mathcal {D}^{s}\). An intuitive example of (4) can be referred to the “Early stage in training” in Fig. 2. It is noted that m is usually much larger than n in SDA setting. In this case, the first half of (4) usually dominates the loss function without considering the contribution of the samples in \(\mathcal {D}^{t}\) to the aligned latent space. To address this issue, we take the mean of distances instead of summing them up in each domain as follows

$$ {\mathcal{L}}_{F-CTL}=\frac{1}{2m}\sum\limits_{i=1}^{m}\|{{\mathbf{x}}_{i}^{s}}-{\mathbf{c}}_{y_{i}}^{t}\|_{2}^{2}+\frac{1}{2n}\sum\limits_{j=1}^{n}\|{{\mathbf{x}}_{j}^{t}}-{\mathbf{c}}_{y_{j}}^{s}\|_{2}^{2}{.} $$
(5)
Fig. 2
figure 2

An illustration of the proposed loss to achieve both a feature alignment and a decrease of the intra-class variation

Ideally, calculations of \({{\mathscr{L}}}_{F-CTL}\) and feature centers c should take all features of the whole training set into account. However, due to the large sample size of the source training set and a limited RAM storage, it is impractical to perform such an implementation. We implement the update for centers and features based on mini-batch. The \({\mathscr{L}}_{F-CTL}\) can be changed as (6).

$$ {\mathcal{L}}_{CTL}=\frac{1}{2k^{s}}\sum\limits_{i=1}^{k^{s}}\|{{\mathbf{x}}_{i}^{s}}-{\mathbf{c}}_{y_{i}}^{t}\|_{2}^{2}+\frac{1}{2k^{t}}\sum\limits_{j=1}^{k^{t}}\|{{\mathbf{x}}_{j}^{t}}-{\mathbf{c}}_{y_{j}}^{s}\|_{2}^{2} $$
(6)

where ks and kt are number of samples from \(\mathcal {D}^{s}\) and \(\mathcal {D}^{t}\) in the mini-batch, respectively. Equation (6) (\({{\mathscr{L}}}_{CTL}\)) is the ultimate form of the proposed center transfer loss (CTL) based on the mini-batch update strategy. From the equation, it is noticed that CTL can be optimized with one-stream input without relying on a two-stream network used in previous methods [6, 10, 15, 18, 19] for the loss optimization. The training of a two-stream network requires pairing up samples, leading to a quadratic increase in the sample size from the original dataset. Our loss is based on one-stream training and is able to avoid this problem. In addition, CTL contributes to a domain alignment at the initial training stage and increases the discriminative power of features afterwards by minimizing the intra-class variation (Fig. 2). Although previous methods introduce similar losses that also achieve a domain alignment and the increase of discriminative power. They require to manually set a trade-off value to balance these two functionalities in training to produce a good outcome. Alternatively, it is clear that the proposed loss, as shown in the equation, does not require such a manual trade-off value to balance them.

Some centers may not be updated in each iteration of training, as the training is conducted by the mini-batch strategy. The updating equations of class centers ct and cs are formulated as:

$$ \begin{array}{@{}rcl@{}} {\Delta} {{\mathbf{c}}_{h}^{t}} &=& \frac{{\sum}_{j=1}^{k^{t}} \delta(y_{j}=h)({{\mathbf{c}}^{t}_{h}}-{{\mathbf{x}}^{t}_{j}})}{\rho+{\sum}_{j=1}^{k^{t}} \delta(y_{j}=h)};\ \ \ h=1,\ldots,a \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} {\Delta} {{\mathbf{c}}_{h}^{s}} &=& \frac{{\sum}_{i=1}^{k^{s}} \delta(y_{i}=h)({{\mathbf{c}}^{s}_{h}}-{{\mathbf{x}}^{s}_{i}})}{\rho+{\sum}_{i=1}^{k^{s}} \delta(y_{i}=h)};\ \ \ h=1,\ldots,a \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} {{\mathbf{c}}_{h}^{t}}&=&{{\mathbf{c}}_{h}^{t}} - \alpha {\Delta} {{\mathbf{c}}_{h}^{t}}. \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} {{\mathbf{c}}_{h}^{s}}&=&{{\mathbf{c}}_{h}^{s}} - \alpha {\Delta} {{\mathbf{c}}_{h}^{s}}. \end{array} $$
(10)

where δ(⋅) is the indicator function, and it is equal to 1 if the condition is satisfied and equal to 0 otherwise. ρ is a small constant (i.e., 10− 5) to avoid the equation being divided by zero. A scalar α ∈ (0,1] controls the learning rate of the centers, eliminating the noisy samples’ negative impact. We adopt the joint supervision of softmax loss and CTL to train the DNN models for DA in the classification task. The objective function is given as follows.

$$ {\mathcal{L}} ={\mathcal{L}}_{S}+\lambda {\mathcal{L}}_{CTL} $$
(11)

where λ is a trade-off scalar to balance these two losses and is ranging from 0 to 5 in our experiments. We summarize the training strategy based on the mini-batch in Algorithm 1.

Algorithm 1
figure g

Training of Center Transfer Loss.

4 Experimental protocols and results

We evaluate the proposed loss function on different common SDA benchmarks, including Office31 [30], Office-Caltech-10 [31], Office-Home [32], and digit transfer (MNIST [33], USPS [34], SVHN [35], and MNIST-M [36]). The visualization of features and sensitivity analyses on λ and α are also presented. The impact of the batch size on the effectiveness of CTL is introduced in the last part of this section. The source code for experiments is publicly availableFootnote 1. All data generated or analysed during this study are included.

4.1 Office31

Office31 is a classical benchmark collected for the evaluation of DA methods. It comprises 31 visual objectives from three separate domains, namely Amazon (\({\mathcal {A}}\)), Webcam (\({\mathcal {W}}\)), and DSLR (\(\mathcal {D}\)). Amazon is the largest dataset and contains 2,817 images. Webcam and DSLR are relatively compact and have 795 and 498 images, respectively. Examples of images in this dataset are shown in Fig. 3.

Fig. 3
figure 3

Examples of Office31 dataset

Our experiments follow the setting used in [9]. Six domain shifts, including \({\mathcal {A}} \rightarrow {\mathcal {D}}\), \({\mathcal {A}} \rightarrow {\mathcal {W}}\), \({\mathcal {W}} \rightarrow {\mathcal {A}}\), \({\mathcal {W}} \rightarrow {\mathcal {D}}\), \({\mathcal {D}} \rightarrow \mathcal {A}\), and \({\mathcal {D}} \rightarrow {\mathcal {W}}\) are examined in this dataset. All classes of the dataset and five-train-test-split validation scheme are used in the experiments. With respect to the source domain, 20 samples per class from \(\mathcal {A}\), whereas 8 samples per class from \(\mathcal {W}\) or \(\mathcal {D}\) are randomly chosen to train the model for each split. For the target domain, 3 samples per class are randomly selected for training in each split. The remaining target samples are used for testing.

The convolutional layers of VGG16 [37] followed by two dense layers with output sizes of 1024 and 128, respectively, are used as the base network in the experiment. We use this architecture for a fair comparison to most published SDA approaches. The weights of convolutional layers were pre-trained by ImageNet [38]. The ones of dense layers are randomly initialized. All images are resized to 224 × 224, followed by normalization. The learning rates for convolutional and dense layers are set as 0.001 and 0.01, respectively. The size of the mini-batch is 32. λ and α are fixed as 0.1 and 0.5, respectively. We compare our method with recent supervised domain adaptation SOTAs, including SDADT [18], CCSA [6], FADA [15], d-SNE [19], DAGE-LDA [10], and MF [9]. Three baselines, including (1) Model 1 trained by only source data using softmax loss, (2) Model 2 trained by source data and target samples using softmax loss, and (3) Model 3 trained by source data and target samples using a joint supervision of softmax loss and center loss, are also involved in the comparison. It is noted that all baselines use the same backbone (i.e., VGG16) as that used in the proposed method. We also compare our strategy with another SOTA approach, So-HoT [17]. Although the original paper of SoHoT reports the model performance using VGG16 on Office31, it only contains two cross-domain schemes (i.e., \(\mathcal {A} \rightarrow \mathcal {D}\) and \(\mathcal {A} \rightarrow \mathcal {W}\)). Alternatively, the paper shows the performance of AlexNet [39] on all six schemes. Therefore, the proposed method based on AlexNet is also implemented for a fair comparison to So-HoT.

Table 1 shows the performance of different models on the Office31 dataset. Our experiments show that most SDA methods outperform Models 1 and 2, the baselines trained using the softmax loss without DA. It is worth knowing that several previous studies, such as [6, 15, 19], only compare their methods with a weak baseline (i.e., Model 1 trained by only source data using softmax loss). They claim that SDA methods have significant improvements (over 15%) in the classification task of Office31 dataset compared to this baseline. This may overstate the effectiveness of SDA methods. A stronger and fairer baseline, i.e., Model 2 trained by source and target data using softmax loss, should also be involved in the comparison. In Table 1, we can observe that SDA approaches only have less than 7% increments in the classification performance from the stronger baseline (Model 2), consistent with the findings in [10, 17]. Nevertheless, it is encouraging to see that our model has a higher performance than other recent SOTAs and baselines on the Offce31 dataset.

Table 1 Average classification accuracy (%) of different methods on 31 classes of Office31 dataset

4.2 Office-caltech-10

Office-Caltech-10 dataset contains ten sharing categories (i.e., backpack, bike, calculator, headphones, keyboard, laptop-computer, monitor, mouse, mug, and projector) in Office31 and Caltech-256 [40]. The dataset contains 4 domains including, Amazon (958 images; \(\mathcal {A}\)), Webcam (295 images; \(\mathcal {W}\)), Dslr (157 images; \(\mathcal {D}\)), and Caltech (1123 images; \(\mathcal {C}\)). Therefore, we have 12 cross-domain tasks formulated in this data collection. The same split-generation protocol in Office31 experiments is used but only applied to the ten classes above. Following the settings in [9], we implement our experiments using DeCAF-fc6 features [41] as model inputs.

Referring to the base architecture in [6, 9], we utilize two dense layers with output sizes of 1024 and 128 with PReLU activation as the feature embeddings and one fully-connected layer as a classifier. The learning rate is 0.001. The size of the mini-batch is 32. λ and α are fixed as 0.1 and 0.5, respectively. We compare the proposed method with CCSA, d-SNE, DAG-LDA, MF, and three baselines (Model 1, 2 and 3 mentioned in the Office31 experiment). We either report the results in the original publications or implement these recent SOTA methods based on their open-source codes.

Table 2 shows the performance of models using the DeCaF-fc6 features on the ten categories of the Office-Caltech-10 database. Again, our method gains a higher accuracy than other SDA approaches and baseline models on both Within-Office31 and Office-Caltech DA tasks. This result shows the advantage of the proposed strategy in DA classification tasks.

Table 2 Average classification accuracy (%) of different methods on the Office-Caltech-101 using DeCaF-fc6 features

4.3 Office-home

Office-Home [32] is a relatively large-scale dataset for DA experiments. It contains 15500 images with 65 classes. The dataset has four different domains, i.e., Art (\(\mathcal {A} r\)), Clip Art (\(\mathcal {C} a\)), Product (\(\mathcal {P} r\)), and Real World (\(\mathcal {R} w\)). Thus, it contains 12 (4 × 3) different DA tasks.

The “S+T” evaluation protocol in [42] is implemented in our experiments. Specifically, labeled source images and three labeled target images per class are used to train the model in each DA task. The exact data splits can be found in the link belowFootnote 2. According to [42], AlexNet [39] pre-trained on ImageNet is used in our experiments. All images are resized to 227 × 227, followed by normalization. The size of the mini-batch is 32. λ and α are fixed as 0.001 and 0.5, respectively. We compare the proposed method with three recent SDA methods, i.e., CCSA [6], d-SNE [19], and DAGE-LDA [29]. The experiments of these three methods are conducted based on their open source codes. Three baselines, Model 1, 2 and 3, mentioned in Section 4.1 are also included in the comparison.

Table 3 presents the classification performance on twelve DA tasks of the 65-class Office-Home dataset. The proposed method achieves the highest accuracy in most DA tasks. It also obtains the best average accuracy across different DA tasks.

Table 3 Average classification accuracy (%) of different methods on the Office-Home dataset

4.4 Digit transfer

The digit transfer dataset collection has also been popularly used to study the effectiveness of SDA approaches. We use four datasets, all containing hand-written digits from 0 to 9. These datasets include MNIST (\({\mathscr{M}}\)), USPS (\(\mathcal {U}\)), SVHN (\(\mathcal {S}\)), and MNIST-M (\({\mathscr{M}} {\mathscr{M}}\)). MNIST consists of 70,000 28 × 28 grayscale images; USPS contains 11,000 grayscale images with a 16 × 16 resolution; SVHN is a real-world image dataset having 99,280 RGB images extracted from street view house numbers; MNIST-M has 68,002 RGB images generated from MNIST by adding different backgrounds.

4.4.1 First experiment

The evaluation protocol in [10, 19] is performed in this experiment. We investigate the transfers from \({{\mathscr{M}}}\) to \({{\mathscr{M}}} {{\mathscr{M}}}\), between \({{\mathscr{M}}}\) and \({\mathcal {U}}\), and between \({{\mathscr{M}}}\) and \(\mathcal {S}\). Original train-test splits of the datasets are used in our experiments. With respect to the target domain, we randomly select 10 samples per class from the training split. The evaluation is repeated five times.

We use the same architecture as LetNets++ [20] according to [19]. Pre-processing techniques, including resize, normalization, and RGB-Greyscale transformation, are applied if necessary. The learning rate for the parameters of the network is 0.001. The size of the mini-batch is 64. λ and α are fixed as 0.75 and 0.5, respectively. We compared the proposed method with CCSA [6], d-SNE [19], DAGE-LDA [29] , and three baselines (Model 1, 2 and 3 mentioned in the Office31 experiment). The details of the architecture are given in Fig. 4.

Fig. 4
figure 4

CNN architecture used in MNIST-USPS experiments

As shown in Table 4, our method outperforms than other SOTAs and baselines. We notice that the proposed model has a similar performance with d-SNE when the domain shift is small, i.e. the cross-domain tasks \({\mathscr{M}} \rightarrow {\mathscr{M}} {\mathscr{M}}\), \({\mathscr{M}} \rightarrow \mathcal {U}\), \(\mathcal {U} \rightarrow {\mathscr{M}}\), and \(\mathcal {S} \rightarrow {\mathscr{M}}\). When it comes to the \({\mathscr{M}} \rightarrow \mathcal {S}\) condition that has a relatively-large domain shift, our method demonstrates an evident advantage.

Table 4 Average classification accuracy (%) of different methods on digit transfer tasks

4.4.2 Second experiment

It is also interesting to see how the performance of models varies with even a smaller size of samples per class from the target domain. Therefore, we conduct another evaluation protocol in [6, 19, 29] applied to MNIST and USPS datasets. This protocol examines both \({\mathscr{M}} \rightarrow \mathcal {U}\) and \(\mathcal {U} \rightarrow {\mathscr{M}}\) cross-domain tasks, where 2000 and 1800 images are randomly selected from MNIST and USPS, respectively. In addition, a small number (N) of labeled samples per category are randomly picked up from the target domain and used in the model training. The evaluation is repeated ten times for each N form 0 to 7. We use the same data splits generated in [6]Footnote 3.

The implementation details are the same as those in the first experiment of digit transfer above. We compare the proposed method with CCSA [6], FADA [15] , d-SNE [19], and DAGE-LDA [29]. As reported in [10], there are discrepancies in the network architecture between the description in publications and public source codes of CCSA and d-SNE. Furthermore, differences in the model performance between the results reported in original articles and those reproduced by [10, 43] are also relatively large regarding CCSA and d-SNE. Therefore, for a fair comparison, we amend network architectures of CCSA, d-SNE, and DAGE-LDA as LetNets++ and rerun the experiments based on their publicly available codes. To our best knowledge, the authors of FADA may not release the source code, so we directly report the result in the original publication. Two baselines, Model 2 and Model 3 mentioned in the Office31 experiment, are also included in the comparison. Models 2 and 3 are trained by only source data when N = 0.

Table 5 shows the average classification accuracies for different approaches on the MNIST-USPS collection. The standard deviations of ten splits are minor for all methods, so we do not report them in the table. Clearly, SDA-based approaches (expect DAG-LDA) achieve better performance than baselines (Model 2 and 3) which do not incorporate the distribution alignment in training in our experiments. We also plot the performance of the proposed model against Models 2 and 3 on the \({\mathscr{M}} \rightarrow \mathcal {U}\) task with different N samples per class from \(\mathcal {D}_{t}\) in Fig. 5(a). We observe that the proposed strategy significantly improves the classification performance over baselines when N is small. With an increase of N, the accuracy and the improvement gradually converge.

Table 5 Average classification accuracy (%) of different methods on the MNIST-USPS collection
Fig. 5
figure 5

Average classification accuracy for (a) \({{\mathscr{M}}} \rightarrow {\mathcal {U}}\) and (b) \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) tasks for different number (N) of labeled tagert samples per class from \(\mathcal {D}_{t}\)

More importantly, Table 5 shows that our model outperforms other SDA methods in most scenarios, except that it has a slightly lower classification accuracy than FADA when N = 5. The proposed approach improves accuracy by at least 1% and 1.5% against other SOTA methods when N = 7. When N = 1, the proposed method loses its computational advantage against other SDA methods, as the number of pair-up samples is equal to the number of original samples. However, we still recognize an accuracy-wise superiority of our strategy. The superiority may come from the improvement of the discriminative power of the features by decreasing the intra-class variation in addition to the feature alignment between domains during the training.

4.5 Visualization of deep learning features

We visualize the deep features of models trained via multiple N (N = 0,1,4, or 7) on \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) to understand the proposed CTL better. We train the model only using softmax loss when N = 0, but using the joint supervision of softmax loss and CTL, otherwise. The models are trained by a random draw of 1800 samples from USPS and N samples per class from MNIST. Other settings are the same as those in the digit transfer experiments. The visualization is performed on another random draw of 1800 and 2000 samples from USPS and MNIST, respectively, to avoid visualization on training data. We apply the t-SNE technique [44] to transfer the high-dimension features into 2-D vectors for an easy illustration (Fig. 6).

Fig. 6
figure 6

Visualization of deep learning feature distributions varying N labeled samples per class from \(\mathcal {D}_{t}\) on \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) task. (a) N = 0; without DA. (b) N = 1. (c) N = 4. (d) N = 7. Circles and triangles are from \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\), respectively

Figure 6(a) shows that the features are not well aligned if no adaptation mechanism is involved. In addition, we notice that the features with the same label but in different domains stay close to each other even when N = 1, as shown in subfigure (b). The alignment gets even better with an increase of N using the proposed CTL loss by comparing the subfigures (b), (c), and (d). We can still observe discrepancies of features between \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\) in several classes, e.g., classes 0 and 6 when N is equal to 1 or 4. However, the distributions of features between \(\mathcal {D}_{s}\) and \(\mathcal {D}_{t}\) nearly overlap with each other in the case when N = 7, as demonstrated in subfigure (d).

4.6 Sensitivity analysis on λ and α

The hyper-parameters λ and α control the adaptation rate between domains and the negative impact of noisy samples, respectively. They both are significant to the training of the DNN model. Therefore, we carry out analyses to demonstrate their sensitiveness.

The analyses consider four cross-domain tasks, i.e., \({\mathcal {W}} \rightarrow {\mathcal {A}}\) (Office-31), \({\mathcal {C}} \rightarrow {\mathcal {A}}\) (Office-Caltech-10), \({\mathcal {A}} r \rightarrow {\mathcal {C}} a\) (Office-Home), and \({\mathcal {U}} \rightarrow {\mathscr{M}}\) (Digit transfer). The experimental protocols of \(\mathcal {W} \rightarrow \mathcal {A}\), \(\mathcal {C} \rightarrow \mathcal {A}\), and \(\mathcal {A} r \rightarrow \mathcal {C} a\) are the same as those described in previous sections, except that the values of λ and α are not fixed but vary in this sensitivity analysis. For \(\mathcal {U} \rightarrow {\mathscr{M}}\) task, we sample 1800 images from USPS and 3 images per class from MNIST to form a training set. The evaluation is performed on 2000 samples randomly drawn from MNIST, excluding the samples that have been selected for training. The implementation details are the same as those in Section 4.4.2, except that the values of λ and α vary in this analysis.

First, we fix the center step α as 0.5 and vary λ values to train different models. As CTL is based on l2 − norm, the dimensionality of the deep feature is positively related to the loss value, i.e., a higher dimension leads to a larger CTL. Dimensionalities of deep features for four cross-domain tasks are diverse. Therefore, we test different ranges of λ values for different tasks as follows.

  1. 1.

    \({\mathcal {W}} \rightarrow {\mathcal {A}}\): {0,0.025,0.05,0.075,0.1,0.25,0.5,0.75,1}

  2. 2.

    \({\mathcal {C}} \rightarrow {\mathcal {A}}\): {0,0.025,0.05,0.075,0.1,0.25,0.5,0.75,1}

  3. 3.

    \({\mathcal {A}} r\rightarrow {\mathcal {C}} a\): {0,0.00025,0.0005,0.00075,0.001,0.0002 5,0.005,0.0075,0.01}

  4. 4.

    \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\): {0,0.1,0.25,0.5,0.75,1,2.5,5}

Figure 7 shows the evaluation accuracies of the proposed method with different values of λ. It is observed that simply using the softmax loss (when λ = 0) is not a good option, and DNN models have the lowest average classification accuracy. We can also observe a “Log” curve of the model performance in all four tasks when λ varies in the investigating ranges. This shows that our model is insensitive to λ values in a relatively large scope.

Fig. 7
figure 7

Classification accuracies on DA tasks achieved by models with different λ and fixed α = 0.5. (a) \({\mathcal {W}} \rightarrow {\mathcal {A}}\); (b) \({\mathcal {C}} \rightarrow {\mathcal {A}}\); (c) \({\mathcal {A}} r\rightarrow {\mathcal {C}} a\), *values on the x-axis are on a 10− 4 basis; and (d) \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\)

DNN models achieve the best performance when λ is equal to 0.1, 0.1, 0.001, and 0.75 for the evaluated four tasks, respectively. We then fix λ as these values and train DNN models with different values of α from 0.01 to 1 (the same range for all tasks). The evaluation accuracies of these models with different values α are shown in Fig. 8. We also observe stable performances among these models across different values of α, i.e., from 0.2 to 1.

Fig. 8
figure 8

Classification accuracies on DA tasks achieved by models with different α and fixed λ. (a) \({\mathcal {W}} \rightarrow {\mathcal {A}}\) (λ = 0.1); (b) \({\mathcal {C}} \rightarrow {\mathcal {A}}\) (λ = 0.1); (c) \({\mathcal {A}} r \rightarrow {\mathcal {C}} a\) (λ = 0.001); and (d) \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) (λ = 0.75)

4.7 Size of mini-batch

The proposed CTL can be trained by the mini-batch strategy. It is also interesting to explore how the size of mini-batch influences the effectiveness of CTL. We conduct experiments on a cross-domain task, \(\mathcal {U} \rightarrow {\mathscr{M}}\). We sample 1800 images from USPS and 3 images per class from MNIST to form a training set. The evaluation is performed on 2000 samples randomly drawn from MNIST, excluding the samples that have been selected for training. The implementation details are the same as those in the digit transfer experiments, except that the batch size is not fixed as 64 at this time. Batch sizes that are multiples of powers of 2 are common in DL training. Thus, different values, i.e., {1, 2, 4, 8, 16, 32, 64, 128, 256, 512}, are tested in the analysis.

The accuracy of the proposed method using different values of batch size is shown in Fig. 9. It is observed that the performance is unsatisfactory when the batch size is small. We further plot the learning curve of CTL using different batch sizes to identify the minimization process of CTL (Fig. 10). The figure shows that the training of CTL is unstable when the batch size is small. The smaller the batch size is, the more volatile the training process becomes. Updating the centers based on very few samples each time naturally leads to great randomness and biases. When the batch size gets larger, the learning curves of CTL become stable. In addition, we also identify the accuracy drops in Fig. 9 when the batch size is relatively big (i.e., 256 and 512). It is consistent with the finding in [45]. The study states that a large batch size (over 10% of the full batch) may not be a good choice. A model trained using a larger batch size is more likely to converge to sharp minima, e.g., the model is reasonably good but does not offer the best solution to the classification task.

Fig. 9
figure 9

Classification accuracies of the proposed method on \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) task using different batch sizes

Fig. 10
figure 10

Learning curves of CTL on \({\mathcal {U}} \rightarrow {{\mathscr{M}}}\) task using different values of batch sizes. *The learning curve in subfigure (a) seems to be flat before 18th epoch. This is because the loss at 18th epoch reaches over 300, and the losses at other epochs are much smaller than this value. In fact, the losses in this learning curve are volatile during the training process

In sum, although the choice of batch size has an impact on the effectiveness of the proposed CTL, our loss is robust to the common options of batch size in DL training. Unless the batch size is either too small or too large, the model performance remains to be satisfactory and stable.

5 Conclusion

Domain adaptation has drawn considerable interest in the DL community recently. It aims to make use of the copious amount of accessible data from different domains. In this work, we propose a new loss function, referred to as CTL. It is trainable using a single-stream network based on the mini-batch strategy. By a joint supervision of the softmax loss and CTL, same-class features between source and target domains achieve a desirable degree of alignment and a compact intra-class variation. At the same time, different-class features keep sufficiently separated. The usage of CTL results in both domain alignment and the minimization of intra-class variation subsequently in the early and latter training stages without the need to set trade-off values to balance these two functions. The “single-stream implementation” and “manual-balance-waived simultaneous achievement of domain alignment and intra-class variation minimization” are two main advantages of our approach compared to previous methods. Experiments in the present study show that our approach performs better than baselines and recent SOTAs under identical settings across standard DA benchmarks.

Although the proposed CTL offers an encouraging outcome, it is worthwhile to investigate whether its variants provide more promising performance. For example, referring to [46], we can try using only the nearest feature points of each class center instead of relying on all feature points to update the class centers in each iteration. This implementation may be able to decrease the negative impact of the out-of-distribution feature points in the center update. Moreover, in addition to using the l2 − norm distance, other metrics, such as the cosine distance and other types of norms, are also valuable to be explored in future works.