Distribution matching and structure preservation for domain adaptation

Cross-domain classification refers to completing the corresponding classification task in a target domain which lacks label information, by exploring useful knowledge in a related source domain but with different data distribution. Domain adaptation can deal with such cross-domain classification, by reducing divergence of domains and transferring the relevant knowledge from the source to the target. To mine the discriminant information of the source domain samples and the geometric structure information of domains, and thus improve domain adaptation performance, this paper proposes a novel method involving distribution matching and structure preservation for domain adaptation (DMSP). First, it aligns the subspaces of the source domain and target domain on the Grassmann manifold; and learns the non-distorted embedded feature representations of the two domains. Second, in this embedded feature space, the empirical structure risk minimization method with distribution adaptation regularization and intra-domain graph regularization is used to learn an adaptive classifier, further adapting the source and target domains. Finally, we perform extensive experiments on widely used cross-domain classification datasets to validate the superiority of DMSP. The average classification accuracy of DMSP on these datasets is the highest compared with several state-of-the-art domain adaptation methods.


Introduction
Traditional machine learning methods usually need to meet two assumptions when dealing with classification problems. First, the training set and the testing set are independent and identically distributed. Second, there should be large amounts of labeled training samples. However, it is usually very expensive and time-consuming to collect labeled samples in real-word applications such as medical image analysis and autonomous vehicles [1]. It is, therefore, necessary to leverage other related data to assist in completing corresponding classification tasks. Cross-domain classification refers to classifying the samples from a target domain with the B Zhiwei Ni zhiwein@163.com same label space, aiming to explore useful knowledge in the source domain to classify the target data. DA mainly contains supervised DA and unsupervised DA, of which unsupervised DA is more challenging in practical applications since it has no label information in the target domain. In this paper, our principal focus is on unsupervised DA.
DA methods can be divided into two categories: deep DA methods, and non-deep DA methods, depending on whether they learn the deep adaptive features or not. Deep DA methods can directly work on original image data, using deep neural network to perform end-to-end domain adaptation. However, this kind of methods needs to retrain deep network parameters; the training is not stable enough, and the tuning process is also complex. Non-deep DA methods mainly include instance re-weighting methods [6][7][8], feature adaptation methods [9][10][11][12] and adaptive classifier learning methods [13][14][15]. Significantly, non-deep DA methods can also act on the deep features, using labeled samples in the source domain to learn classifiers that generalize well to the target domain [16].
The usual strategy adopted by instance re-weighting methods is to weight or resample the samples to assist the target domain in classification task. However, this kind of methods is not suitable for cross-domain classification where domain divergences in conditional distribution exist. Feature adaptation methods learn new feature representations of different domains by aligning their subspaces or minimizing their predefined distribution distance. In general, feature adaptation methods can only reduce, but cannot remove domain divergence. Moreover, feature adaptation methods need to use traditional classification methods to train a classifier on source labeled samples while the divergence will affect the performance of the classifier on the target domain. Adaptive classifier learning methods directly train an adaptive classifier by jointly minimizing the classification function and the distribution difference between the source and target domains. According to DA-related theories [17], the classification accuracy of the learned classifier on the target domain is affected to a certain extent by factors such as the distribution difference between the domains, and its empirical error in the source domain.
However, adaptive classifier learning methods usually ignore the inter-class difference and intra-domain local structure [15]. Actually, the inter-class difference in the source domain has an important impact on the discriminant structure of the classifier, and the adjacent samples in a same domain usually belong to the same category [18]. Therefore, in this paper, we propose a new DA method, namely distribution matching and structure preservation for domain adaptation (DMSP). It learns the adaptive classifier by simultaneously minimizing the intra-domain graph regularization, the structural risk function, and the distribution adaptation regularization related to class-wise distribution distance between domains and inter-class distance in the source domain. In addition, adaptive classifier learning methods are usually designed for the original feature space, where feature distortion will negatively affect the DA performance [19]. To address this problem, DMSP learns the adaptive classifier on the Grassmann manifold (GM) where the feature distortion can be avoided.
The contributions of this paper are as follows: (1) We propose intra-domain graph regularization to preserve the respective local structures of the source and target domains, which is helpful to learn a discriminative classifier.
(2) Our DMSP jointly minimizes the structural risk function, the distribution adaptation regularization and the intra-domain graph regularization, to match the distributions of the source and target domains and obtain an adaptive classifier which is robust and discriminative to target data. Furthermore, the classifier is learned on the Grassman manifold where feature distortion can be avoided, so that its performance can be improved.
(3) We conduct comprehensive experiments on several cross-domain image datasets. Experimental results verify the effectiveness and superiority performance of our method.

Related work
In this section, we review some previous works related to our proposed method in terms of distribution matching, adaptive classifier learning and local structure preservation.

Distribution matching
Distribution matching aims to reduce the distribution difference between the source and target domains. Maximum mean discrepancy (MMD) [20] is often used to measure the distribution difference. Based on MMD, joint distribution adaptation (JDA) [21] minimizes the class-wise distribution distance between the source and target domains to match their marginal and conditional distribution difference. Furthermore, to mine discriminative information within the two domains, domain invariant and class discriminative feature learning (DICD) [22] also minimizes the intra-class scatter and maximizes the inter-class dispersion simultaneously, while unsupervised metric transfer learning method (UMTL) [23] also maximizes the inter-class distance.
These above methods conduct distribution matching in a principal dimension reduction procedure to exploit a shared feature space. However, they still need to utilize a traditional classification method for label prediction of the target data. Moreover, the cross-domain discrepancy still exists in the reduced latent feature space. Our DMSP minimizes the class-wise distribution distance and maximizes the interclass distance simultaneously as in [23], but in a different way. It conducts distribution matching through a classifier learning procedure, which directly obtains an adaptive classifier for the target data.

Local structure preservation
Local structure preservation aims to preserve the local geometric structure of given data. Graph regularization is devoted to preserving the local structure which encourages the samples from the same category to be close with each other. In DA problems, [24,25] minimize graph regulariza-tion concerning the whole of the source and target data to preserve their space relationship, where the graph regularization can be regarded as inter-domain graph regularization. [26,27] minimize such inter-domain graph regularization to ensure that the inferred data labels comply with the local structure of the source and target data.
The methods mentioned above use the inter-domain graph regularization to preserve the local structure of the whole data in different domains (i.e., inter-domain local structure). However, the difference across domains might cause the local manifold structures of the source data and target data to be different. Specifically, the adjacent samples from different domains might belong to different categories. As a result, forcibly minimizing the inter-domain graph regularization will degrade the DA performance. This paper proposes intradomain graph regularization concerning source and target data separately, which can preserve the respective local manifold structures of source data and target data.

Adaptive classifier learning
Adaptive classifier learning aims to directly obtain an adaptive classifier trained on source labeled samples that performs well on target data. Adaptation regularization based transfer learning (ARLT) [15] learns the adaptive classifier by jointly minimizing inter-domain graph regularization, the structural risk function, and class-wise distribution distance between domains. Clustering for domain adaptation (DAC) [28] further explores the cluster structure of target data. However, ARTL and DAC are designed only for the original feature where feature distortion will undermine the performance [19]. On the basis of ARTL, manifold dynamic distribution adaptation (MDDA) [29] learns the adaptive classifier on the Grassmann manifold (GM) to overcome feature distortion.
DMSP is also based on ARTL aiming to learn the adaptive classifier on the GM, but it is different from ARTL and MDDA. On the one hand, ARTL and MDDA utilize the inter-domain graph regularization to preserve the manifold consistency underlying the whole data, which will result in different inferred labels for adjacent samples due to different local manifold structures of the source data and the target data. In contrast, DMSP uses the intra-domain graph regularization to induce adjacent samples in the same domain to be inferred as the same label. On the other hand, ARTL and MDDA fail to consider the impact of discriminative information within domains on the discriminant structure of the adaptive classifier, while DMSP maximizes the inter-class distance to improve the discriminant structure.

Proposed method
In this section, we detail the proposed distribution matching and structure preservation for domain adaptation (DMSP) method.

Problem statement
Given the source domain D s {x i , y i } n s i 1 and the target domain D t x j n s +n t j n s +1 , where x i ∈ R 1×m is a source sample associated with its label y i ∈ {1, 2, . . . , Cl} and x j ∈ R 1×m is a target sample, we assume that the two domains have different distributions, but share the same label space and feature space. The goal of DMSP is to train an adaptive classifier on D s , which can perform well on D t with low expectation error.

Problem formulation
DMSP works two steps. In the first step, DMSP adopts geodesic flow kernel (GFK) [30] to perform manifold feature learning on the Grassmann manifold (GM), which can avoid feature distortion in the original space and align the subspaces of the source and target domains. Specifically, the bases of the source and target subspaces are denoted as S 1 ∈ R m×d and S 2 ∈ R m×d respectively, where d is the dimension of the low-dimensional linear subspace. Considering S 1 and S 2 as two points on the GM, we know from the literature [30] that ∀x i , x j ∈ R m in the original feature space can be projected into a manifold embedded feature space, denoted as z i , z j , and their inner product is where G is the geodesic flow kernel matrix, (t) represents the geodesic flow from S 2 . According to (1), the distance between samples on the GM is calculated, and then based on the labeled information of the source samples, the initial pseudo labels Y t y n s +1 , . . . , y n s +n t of the target samples are obtained using the 1-nearest neighbor (1-NN). In addition, the manifold embedded feature representations of the source and target samples can be obtained in explicit form from [19], that is, z i √ Gx i , i 1, 2, . . . , n s + n t . In the second step, DMSP learns an adaptive classifier by simultaneously optimizing the structure risk function, inter-domain distribution matching and intra-domain local structure preservation. The object function of DMSP is formulated as where k ·, · is the kernel function, H k represents the corresponding Hilbert space, f 2 k is the square norm of f , the shrinkage regularization parameter σ > 0, the intra-domain graph regularization parameter γ > 0, the distribution adaptation regularization parameter λ > 0, and the trade-off parameter β > 0.
The first two items in formula (1) are the structural risk minimization on the source domain. According to the representer theorem [31], the structural risk minimization can be reformulated as where · 2 F represents the F-norm square, Y y 1 , . . . , y n s , y n s +1 , . . . , y n s +n t is the label matrix, for multi-class problems, Y ∈ R Cl×(n s +n t ) , ω ω 1 , ω 2 , · · · , ω n s +n t T is the coefficients, K ∈ R (n s +n t )×(n s +n t ) is the kernel matrix, its elements are . . , E n s +n t , when 1 ≤ i ≤ n s , E i 1, when i ≥ n s + 1, E i 0, matrix E can remove unreliable pseudo labels in the target domain.
The third item in formula (1) is the proposed intra-domain graph regularization, which aims to preserve the respective local structures of the source and target domains, and to induce adjacent samples in the same domain to be inferred as the same label. Let N s p (·) and N t p (·) be the sets of p-nearest neighbors in source and target domains, respectively. The intra-domain graph regularization is defined as where The intra-domain graph regularization can be expressed as The fourth item in formula (1) is the inter-domain distribution adaptation regularization, which simultaneously minimizes the class-wise distribution distance between domains and maximizes the inter-class distance in the source domain as in [23], where the class-wise distribution distance is formulated as and the inter-class distance is Finally, the inter-domain distribution adaptation becomes where

Problem solving and algorithm description
Substituting with formulas (3), (7) and (10), the object function of DMSP in formula (2) can be reformulated as Setting ∂ L(ω) ∂ω 0, we obtain the solution Then, we can obtain the adaptive classifier f (z) z). Like ARTL [15], DMSP iteratively updates the target pseudo labels, eventually optimizing itself.
Finally, DMSP is summarized in Algorithm 1. In addition, a chart showing the architecture of DMSP is given in Fig. 2.

Experimental analysis
In this section, we evaluate the performance of DMSP through extensive experiments on widely used cross-domain image datasets.

Dataset descriptions
This paper uses public cross-domain classification datasets for experiments, including Office + Caltech, ImageCLEF -DA and Office-31. Office + Caltech dataset [30] consists of 4 domains, which are from Amazon, Webcam, DSLR and  Caltech-256, respectively. These 4 domains are abbreviated as A10, W10, D10 and C10, with 10 common categories. Accordingly, 12 cross-domain classification tasks can be formed, i.e., A10 → C10, W10 → D10, etc. Note that D s → D t represents cross-domain classification task from D s to D t . Office-31 [32] consists of three domains: Amazon, Webcam and DSLR. These three domains are abbreviated as A31, W31 and D31, with 31 shared categories. Similarly, we can construct 6 tasks. ImageCLEF-DA is composed of three domains, which are, respectively, from Caltech-256, ImageNet ILSVRC 2012 and Pascal VOC 2012. These three domains are abbreviated as P12, I12, C12, with 12 common categories, and 6 tasks can be formed. For Office + Caltech, the SURF [30] features and Decaf6 [33] features are used for experiments. For ImageCLEF-DA and Office-31, we adopt the Resnet-50 features, extracted from the Resnet-50 model [34]. Figure 3. shows some exemplary images from these datasets and Table 1 lists the descriptions of these datasets.
The dimension d of the low-dimensional subspace in DMSP method can be selected by the subspace disagreement measure proposed in [30]. In the comparative experiment, The best results are marked in bold The best results are marked in bold considering that the proposed method and ARTL share some of the same type parameters, for fair comparison, we uniformly set the corresponding parameters σ 0.1, λ 10, p 10, T 10, and the kernel function is set as Gaussian kernel function. In addition, for datasets using shallow features [i.e., Office + Caltech (SURF)], we set γ 0.1, while for datasets using deep features (i.e., Office + Caltech (Decaf6), ImageCLEF-DA and Office-31), we set γ 1. Furthermore, the trade-off parameter is set to β 0.1 on all datasets. The parameters of other DA methods are selected based on experience, and the experimental results under the parameters used in the original literature are compared. The best result is adopted.
In our experiments, we use classification Accuracy on the target domain as the evaluation measurement, which is widely adopted in existing literature [15,21,24,25,28,29]: where f (x) and y are the prediction label and truth label of x, respectively.

Experimental results
The classification accuracy results of Office + Caltech using SURF and Decaf6 features are reported in Tables 2 and 3, respectively. We observe that the classification accuracy of no adaptation baseline (i.e., 1-NN) on Office + Caltech (SURF) and Office + Caltech (Decaf6) is lower than all the DA methods. Therefore, whether we use the shallow features or deep features, DA is necessary for cross-domain classification. The best results are marked in bold The best results are marked in bold Compared with those adaptive classifier learning methods (i.e., DCA, ARTL and MDDA), our DMSP (also an adaptive classifier learning method) preserves the intra-domain local structure and enhances the discriminant structure of the classifier, thus outperforming them on almost all the tasks. Another observation is that DMSP achieves the highest average classification accuracy on both Office + Caltech (SURF) and Office + Caltech (Decaf6). Even though the Decaf6 features yield an obvious improvement over the SURF features, DMSP still improves the final classification significantly, by more than 1.8% compared with the best comparison method (MDDA). From Tables 2 and 3 we also note that on the tasks of W10 → C10 (SURF) and D10 → C10 (SURF), DMSP dominate over other DA methods except OT-GL. A possible explanation is that there is great difference between domains, and DMSP preserves the respective local structure of each domain but it does not outright advantage over sample-based matching in OT-GL. For the tasks W10 → D10 and D10 → W10, DMSP is beaten by JDA, JGSA, DCA and MDDA as reported in Tables 2 and 3. This is because D10 and W10 are closest pair of domains and our DMSP preserving the intra-domain local structure has not superiority. The fact that D10 and W10 are closest pair of domains is clearly found from the 1-NN accuracies. Although DMSP does not achieve the best performance in all tasks, DMSP performs better on most tasks (17/24). Particularly, on C10 → D10, A10 → W10 (SURF), A10 → C10 (SURF), C10 → W10 (Decaf6) and A10 → D10 (Decaf6), our DMSP improves by around 5% over the best comparison.
The results of ImageCLEF-DA and Office-31 datasets are also reported in Table 4. It can be found that using Resnet-50 features, DMSP also outperforms the related methods of ARTL and MDDA. This is because DMSP preserves the intra-domain local structure and makes the inter-class centroids separable. However, for the tasks D31 → W31 and W31 → D31, DMSP does not produce improvements in the classification accuracy, and is even beaten by the baseline method 1-NN. The main reason is that the two domain are very close and the advantage of DMSP (e.g., intra-domain local structure preservation) is not fully realized, instead it causes a negative impact. We also note although ARTL, MDDA and DMSP are non-deep DA methods, they outperform the deep DA methods (e.g., DAN, DANN, JAN, CAN, DART and MRAN), which demonstrates the significance of non-deep DA methods. Moreover, our DMSP achieves the highest average classification accuracy (88.6%), which is 0.3% higher than the best comparison. Especially, on the tasks of A31 → D31, DMSP improves by up to nearly 3 points over the best comparison result.

Effectiveness verification
DMSP mainly consists of following components on top of the structural risk minimization: the inter-domain distribution adaptation regularization with inter-class distance maximization added, the intra-domain graph regularization and the embedded features. We, thus, performed an ablation study to investigate these components, where we remove each component of DMSP separately and observe how it changes in performance. The classification accuracy results of the variants on several tasks are reported in Table 5. We see that DMSP outperforms all of the variants, suggesting that each component is important to DMSP. Expectedly, the overall classification performance of DMSP-L 1 is the worst compared with other variants. This is because distribution matching is essential to reduce the distribution difference between domains. DMSPoriginal is obviously inferior to DMSP on several tasks (e.g., C10 → A10 (SURF), A10 → W10 (SURF), and D10 → A10 (SURF), which indicates that embedded features learned on the Grassmann manifold (GM) can effectively alleviate feature distortion and achieve subspace alignment for those domain pairs from Office + Caltech (SURF). DMSP-L 2 degrades the classification performance on these tasks, indicating that it is necessary to preserve intra-domain local structure for DMSP. The inter-class distance maximization can separate the inter-class centroids so as to enhance the discriminant structure of the final classifier, therefore removing it will degrade the classification performance. DMSP-L 3 has lower classification accuracy compared with DSMP indicating the necessity of inter-class distance maximization.
To verify the advantage of our proposed intra-domain graph regularization over the inter-domain graph regularization [15,29], we use the inter-domain graph regularization instead of intra-domain graph regularization to obtain another variant of DMSP, denoted as DMSP-L 4 , then observe the change in performance. The classification accuracy results of DMSP and DMSP-L 4 on several tasks are demonstrated in Fig. 4, where C10 → W10, A10 → W10, W10 → C10 are from Office + Caltech (SURF) dataset, C10 → D10, A10 → D10, D10 → A10 are from Office + Cal- tech (Decaf6) dataset. We find that the classification accuracy of DMSP on each task is higher than that of DMSP-L 4 , whether based on shallow feature (i.e., SURF) or deep feature (i.e., Decaf6 and ResNet50). This is because intra-domain graph regularization aims to preserve the respective local structure of each domain while intra-domain graph regularization ignoring difference between domains aims to preserve the local structure of the whole data in different domains. Therefore, the intra-domain graph regularization is more conducive to enhancing the performance of final classification than the inter-domain graph regularization.
To further verify the performance of DMSP, we calculate and observe the average number of samples from a different class in each set of 10-nearest neighbors (ANDC) based on the label distribution in the target domain. The label distribution can be obtained by the final adaptive classifier learned by DMSP. If the final classifier is able to discriminate the target samples, the adjacent samples are predicted as the same label and the value of ANDC will be small. Figure 5 shows the ANDC values produced by DMSP and the related works (i.e., ARTL, MDDA) on several tasks from Office + Caltech (SURF). As can be observed, the ANDC values differ greatly. Especially, on tasks C10 → W10 and A10 → W10, the ANDC value of ARTL reaches almost 7. That is, on average, up to 7 of each 10-nearest neighbors are predicted to be different categories. The main reason is that ARTL learns the adaptive classifier in original space where feature distortion will negatively affect the performance of the classifier, and ignores the discriminative information within domains. MDDA learns the adaptive classifier on the Grassmann manifold (GM), but it also ignores the impact of the intra-domain local structure and the inter-class distance on the discriminant structure of the adaptive classifier. Therefore, the ANDC value of MDDA is smaller than that of ARTL, but larger than that of our DMSP, which jointly minimizes the structural risk function, the distribution adaptation regularization with inter-class distance maximization added and the intra-domain graph regularization, and learns the adaptive classifier on the GM.

Parameter analysis
Experiments are conducted on randomly selected tasks A10 → C10 with Decaf6 features, A10 → W10 with SURF features, C12 → p12 and W31 → A31 to analyze the parameter sensitivity and convergence of DMSP. Since we have verified that our proposed DMSP performs well, when σ 0.1, λ 10, p 10 are fixed to be the same as ARTL on all cross-domain classification tasks, we only evaluate the sensitivity of intra-domain graph regularization parameter γ and trade-off parameter β. The classification accuracy curves of the selected tasks are provided in Fig. 6.
γ is the intra-domain graph regularization parameter. If γ is too small, the intra-domain local structure cannot be preserved; if it is too large, the discriminant information of domains and the distribution adaptation will be ignored. Figure 6a shows that DMSP achieves better DA performance in the range γ ∈ [0.1, 1]. Besides, β is the trade-off parameter between inter-domain and inter-class distribution difference. Figure 6b reveals that DMSP has better DA performance in the range β ∈ [0.1, 0.4].
Finally, we check the convergence property of DMSP through empirical analysis. The convergence curves of the classification accuracy of DMSP are shown in Fig. 7. It can be observed that the classification accuracy of DMSP in each task increases steadily with more iterations and tends to stabilize within 10 iterations, indicating that DMSP converges within a couple of iterations.

Conclusion
In this paper, a novel method referred to as distribution matching and structure preservation for domain adaptation (DMSP) is proposed. DMSP aims to exploit an adaptive classifier on the GM with the principle of structural risk minimization while preserving the intra-domain local structure and matching the distributions of different domains. First, the source and target samples are embedded into the manifold feature space using the GFK method, and their feature subspaces are geometrically aligned. Then, based on the principle of structural risk minimization, an adaptive classifier is learned. During this process, distribution matching is carried out by adding regularization term based on inter-domain and inter-class distribution differences; the respective local structures of the source and target domains are preserved by adding intra-domain graph regularization term. Comprehensive experiments on several cross-domain classification datasets validate the effectiveness of DMSP and its superiority over other state-of-the-art DA methods. Actually, DMSP performs adaptive classifier learning on the GM in two separate steps, which is not simple enough. In the future, more simplified and efficient design of DA models is worthy of further research. In addition, we would like to extend our work to more complex situations, such as those where both the distribution and the feature space are different between the source and target domains.  Code availability All authors declare that all software applications or custom codes comply with field standards.

Conflict of interest All authors declare no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.