Keywords

1 Introduction

Classification models trained on labeled datasets are ineffective over data from different distributions owing to data-shift [14]. The problem of domain adaptation (DA) deals with adapting models trained on one data distribution (source domain) to different data distributions (target domains). For the purpose of this paper, we organize unsupervised DA procedures under two categories: linear and nonlinear, based on feature representations used in the model. Linear techniques determine linear transformations of the source (target) data and align it with the target (source), or learn a linear classifier with the source data and adapt it to the target data [2, 13, 15]. Nonlinear procedures on the other hand, apply nonlinear transformations to reduce cross-domain disparity [6, 11].

In this work we present the Nonlinear Embedding Transform (NET) procedure for unsupervised DA. The NET consists of two steps, (i) Nonlinear domain alignment using Maximum Mean Discrepancy (MMD) [9], (ii) similarity-based embedding to cluster the data for enhanced classification. In addition, we introduce a procedure to sample source data in order to generate a validation set for model selection. We study the performance of the NET algorithm with popular DA datasets for computer vision. Our results showcase significant improvement in the classification accuracies compared to competitive DA procedures.

2 Related Work

In this section we provide a concise review of some of the unsupervised DA procedures closely related to the NET. Under linear methods, Burzzone et al. [2], proposed the DASVM algorithm to iteratively adapt a SVM trained on the source data, to the unlabeled target data. The state-of-the-art linear DA procedures are Subspace Alignment (SA), by Fernando et al. [5], and the CORAL algorithm, by Sun et al. [13]. The SA aligns the subspaces of the source and the target with a linear transformation and the CORAL transforms the source data such that the covariance matrices of the source and target are aligned.

Nonlinear procedures generally project the data to a high-dimensional space and align the source and target distributions in that space. The popular GFK algorithm by Gong et al. [6], projects the two distributions onto a manifold and learns a transformation to align them. The Transfer Component Analysis (TCA) [11], Transfer Joint Matching (TJM) [8], and Joint Distribution Adaptation (JDA) [9], algorithms, apply MMD-based projection to nonlinearly align the domains. In addition, the TJM implements instance selection using \(\ell _{2,1}\)-norm regularization and the JDA performs a joint distribution alignment of the source and target domains. The NET implements nonlinear alignment of the domains along with a similarity preserving projection, which ensures that the projected data is clustered based on category. We compare the NET with only kernel-based nonlinear methods and do not include deep learning based DA procedures.

3 DA with Nonlinear Embedding

In this section we outline the problem of unsupervised DA and develop the NET algorithm. Let \({\mathbf {X}}_S = [{\mathbf {x}}_1^s, \ldots , {\mathbf {x}}_{n_s}^s] \in {\mathbb {R}}^{d\times n_s}\) and \({\mathbf {X}}_T = [{\mathbf {x}}_1^t, \ldots , {\mathbf {x}}_{n_t}^t] \in {\mathbb {R}}^{d\times n_t}\) be the source and target data points respectively. Let \(Y_S = [y_1^s, \ldots , y_{n_s}^s]\) and \(Y_T = [y_1^t, \ldots , y_{n_t}^t]\) be the source and target labels respectively. Here, \({\mathbf {x}}_i^s\) and \({\mathbf {x}}_i^t\) \(\in {\mathbb {R}}^d\) are data points and \(y_i^s\) and \(y_i^t\) \(\in \{1,\ldots ,C\}\) are the associated labels. We define , where \(n = n_s + n_t\). In the case of unsupervised DA, the labels \(Y_T\) are missing and the joint distributions for the two domains are different, i.e. \(P_S(X,Y) \ne P_T(X,Y)\). The task lies in learning a classifier \(f({\mathbf {x}}) = p(y|{\mathbf {x}})\), that predicts the labels of the target data points.

3.1 Nonlinear Embedding for DA

One of the techniques to reduce domain disparity is to project the source and target data to a common subspace. KPCA is a popular nonlinear projection algorithm where data is first mapped to a high-dimensional (possibly infinite-dimensional) space given by \(\varPhi ({\mathbf {X}}) = [\phi ({\mathbf {x}}_1), \ldots , \phi ({\mathbf {x}}_n)]\). \(\phi :{\mathbb {R}}^d \rightarrow {\mathcal {H}}\) defines the mapping and \({\mathcal {H}}\) is a RKHS with a psd kernel \(k({\mathbf {x}},{\mathbf {y}}) = \phi ({\mathbf {x}})^\top \phi ({\mathbf {y}})\). The kernel matrix for \({\mathbf {X}}\) is given by \({\mathbf {K}}= \varPhi ({\mathbf {X}})^\top \varPhi ({\mathbf {X}}) \in {\mathbb {R}}^{n\times n}\). The mapped data is then projected onto a subspace of eigen-vectors (directions of maximum nonlinear variance in the RKHS). The top k eigen-vectors in the RKHS are obtained using the representer theorem, \({\mathbf {U}}= \varPhi ({\mathbf {X}}){\mathbf {A}}\), where \({\mathbf {A}}\in {\mathbb {R}}^{n\times k}\) is the matrix of coefficients that needs to be determined. The nonlinearly projected data is then given by, \({\mathbf {Z}}= [{\mathbf {z}}_1, \ldots , {\mathbf {z}}_n] = {\mathbf {A}}^\top {\mathbf {K}}\in {\mathbb {R}}^{k\times n}\), where \({\mathbf {z}}_i \in {\mathbb {R}}^k\), \(i = 1,\ldots ,n\), are the projected data points.

In order to reduce the domain discrepancy in the projected space, we implement the joint distribution adaptation (JDA), as outlined in [9]. The JDA seeks to align the mariginal and conditional probability distributions of the projected data (\({\mathbf {Z}}\)), by estimating the coefficient matrix \({\mathbf {A}}\), which minimizes:

$$\begin{aligned} \min _{{\mathbf {A}}} \sum _{c=0}^C{\mathrm {tr}}({\mathbf {A}}^\top {\mathbf {K}}{\mathbf {M}}_c{\mathbf {K}}^\top {\mathbf {A}}). \end{aligned}$$
(1)

\({\mathrm {tr}}(.)\) refers to trace and \({\mathbf {M}}_c\), where \(c = 0,\ldots ,C\), are \(n \times n\) matrices given by,

$$\begin{aligned} (M_c)_{ij}= & {} {\left\{ \begin{array}{ll} \frac{1}{n_s^{(c)}n_s^{(c)}},&{} {\mathbf {x}}_i, {\mathbf {x}}_j \in {\mathcal {D}}_s^{(c)}\\ \frac{1}{n_t^{(c)}n_t^{(c)}},&{} {\mathbf {x}}_i, {\mathbf {x}}_j \in {\mathcal {D}}_t^{(c)}\\ \frac{-1}{n_s^{(c)}n_t^{(c)}},&{} {\left\{ \begin{array}{ll} {\mathbf {x}}_i \in {\mathcal {D}}_s^{(c)}, {\mathbf {x}}_j \in {\mathcal {D}}_t^{(c)}\\ {\mathbf {x}}_j \in {\mathcal {D}}_s^{(c)}, {\mathbf {x}}_i \in {\mathcal {D}}_t^{(c)}\\ \end{array}\right. }\\ 0, &{} \text {otherwise}, \end{array}\right. }\end{aligned}$$
(2)
$$\begin{aligned} (M_0)_{ij}= & {} {\left\{ \begin{array}{ll} \frac{1}{n_sn_s},&{} {\mathbf {x}}_i, {\mathbf {x}}_j \in {\mathcal {D}}_s\\ \frac{1}{n_tn_t},&{} {\mathbf {x}}_i, {\mathbf {x}}_j \in {\mathcal {D}}_t\\ \frac{-1}{n_sn_t},&{} \text {otherwise},\\ \end{array}\right. } \end{aligned}$$
(3)

where \({\mathcal {D}}_s\) and \({\mathcal {D}}_t\) are the sets of source and target data points respectively. \({\mathcal {D}}_s^{(c)}\) is the set of source data points belonging to class c and \(n_s^{(c)} = |{\mathcal {D}}_s^{(c)}|\). Likewise, \({\mathcal {D}}_t^{(c)}\) is the set of target data points belonging to class c and \(n_t^{(c)} = |{\mathcal {D}}_t^{(c)}|\). Since the target labels are unknown, we use predicted labels for the target data points. We begin with predicting the target labels using a classifier trained on the source data and refine these labels over iterations, to arrive at the final prediction. For more details please refer to [9].

In addition to domain alignment, we would like the projected data \({\mathbf {Z}}\), to be classification friendly (easily classifiable). To this end, we introduce Laplacian eigenmaps to ensure a similarity-preserving projection such that data points with the same class label are clustered together. The similarity relations are captured by the \((n\times n)\) adjacency matrix \({\mathbf {W}}\), and the optimization problem estimates the projected data \({\mathbf {Z}}\);

(4)
$$\begin{aligned} \min _{{\mathbf {Z}}} \frac{1}{2}\sum _{ij}\bigg |\bigg |\frac{{\mathbf {z}}_i}{\sqrt{d_i}} - \frac{{\mathbf {z}}_j}{\sqrt{d_j}}\bigg |\bigg |^2{\mathbf {W}}_{ij} = {\mathrm {tr}}({\mathbf {Z}}{\mathbf {L}}{\mathbf {Z}}^\top ). \end{aligned}$$
(5)

\({\mathbf {D}}= \textit{diag}(d_1, \ldots , d_n)\) is the \((n\times n)\) diagonal matrix where, \(d_i = \sum _j{\mathbf {W}}_{ij}\) and \({\mathbf {L}}\) is the normalized graph laplacian matrix that is symmetric positive semidefinite and is given by \({\mathbf {L}}= {\mathbf {I}}-{\mathbf {D}}^{-1/2}{\mathbf {W}}{\mathbf {D}}^{-1/2}\), where \({\mathbf {I}}\) is an identity matrix. When \({\mathbf {W}}_{ij} = 1\), the projected data points \({\mathbf {z}}_i\) and \({\mathbf {z}}_j\) are close together (as they belong to the same category). The normalized distance between the vectors \(||{\mathbf {z}}_i/\sqrt{d_i} - {\mathbf {z}}_j/\sqrt{d_j}||^2\), captures a more robust measure of data point clustering compared to the un-normalized distance \(||{\mathbf {z}}_i - {\mathbf {z}}_j||^2\), [4].

3.2 Optimization Problem

The optimization problem for NET is obtained from (1) and (5) by substituting, \({\mathbf {Z}}= {\mathbf {A}}^\top {\mathbf {K}}\). Along with regularization and a constraint, we get,

$$\begin{aligned} \min _{{\mathbf {A}}^\top {\mathbf {K}}{\mathbf {D}}{\mathbf {K}}^\top {\mathbf {A}}= {\mathbf {I}}} \alpha .{\mathrm {tr}}({\mathbf {A}}^\top {\mathbf {K}}\sum _{c=0}^C{\mathbf {M}}_c{\mathbf {K}}^\top {\mathbf {A}}) + \beta .{\mathrm {tr}}({\mathbf {A}}^\top {\mathbf {K}}{\mathbf {L}}{\mathbf {K}}^\top {\mathbf {A}}) + \gamma ||{\mathbf {A}}||_F^2. \end{aligned}$$
(6)

\({\mathbf {A}}\in {\mathbb {R}}^{n\times k}\) is the projection matrix. The regularization term \(||{\mathbf {A}}||_F^2\) (Frobenius norm), controls the smoothness of projection and the magnitudes of \((\alpha , \beta , \gamma )\), denote the importance of the individual terms in (6). The constraint prevents the data points from collapsing onto a subspace of dimensionality less than k, [1]. Equation (6) can be solved by constructing the Lagrangian \(L({\mathbf {A}}, \mathbf {\Lambda )}\), where, \(\mathbf {\Lambda } = \textit{diag}(\lambda _1, \ldots , \lambda _k)\), is the diagonal matrix of Lagrangian constants (see [8]). Setting the derivative \(\frac{\partial L}{\partial {\mathbf {A}}} = 0\), yields the generalized eigen-value problem,

$$\begin{aligned} \big (\alpha {\mathbf {K}}\sum _{c=0}^C{\mathbf {M}}_c {\mathbf {K}}^\top + \beta {\mathbf {K}}{\mathbf {L}}{\mathbf {K}}^\top + \gamma {\mathbf {I}}\big ){\mathbf {A}}= {\mathbf {K}}{\mathbf {D}}{\mathbf {K}}^\top {\mathbf {A}}\mathbf {\Lambda }. \end{aligned}$$
(7)

\({\mathbf {A}}\) is the matrix of the k-smallest eigen-vectors of (7) and \(\mathbf {\Lambda }\) is the diagonal matrix of eigen-values. The projected data points are given by, \({\mathbf {Z}}= {\mathbf {A}}^\top {\mathbf {K}}\).

3.3 Model Selection

Current DA methods use the target data to validate the optimum parameters for their models [8, 9]. We introduce a new technique to evaluate \((\alpha , \beta , \gamma , k)\), using a subset of the source data as a validation set. The subset is selected by weighting the source data points using Kernel Mean Matching (KMM). The KMM computes source instance weights \(w_i\), by minimizing, \(\big |\big |\frac{1}{n_s}\sum _{i=1}^{n_s}w_i\phi ({\mathbf {x}}_i^s) - \frac{1}{n_t}\sum _{j=1}^{n_t}\phi ({\mathbf {x}}_i^t)\big |\big |_\mathcal {H}^2\). Defining \(\kappa _i := \frac{n_s}{n_t}\sum _{j=1}^{n_t}k({\mathbf {x}}_i^s, {\mathbf {x}}_j^t)\), \(i = 1,\ldots ,n_s\) and \({\mathbf {K}}_{S_{ij}} = k({\mathbf {x}}_i^s, {\mathbf {x}}_j^s)\), the minimization can be written in terms of quadratic programming:

$$\begin{aligned} \min _{{\mathbf {w}}} = \frac{1}{2} {\mathbf {w}}^\top {\mathbf {K}}_S{\mathbf {w}}- \mathbf {\kappa }^\top {\mathbf {w}}, ~~\text {s.t.}~ w_i\in [0, B],~\bigg |\sum _{i=1}^{n_s}w_i - n_s \bigg |\le n_s\epsilon . \end{aligned}$$
(8)

The first constraint limits the scope of discrepancy between source and target distributions with \(B\rightarrow 1\), leading to an unweighted solution. The second constraint ensures the measure \(w(x)P_S(x)\), is a probability distribution [7]. In our experiments, the validation set is 30 % of the source data with the largest weights. This validation set is used to estimate the best values for \((\alpha , \beta , \gamma , k)\).

4 Experiments

We compare the NET algorithm with the following baseline and state-of-the-art methods. NA (No Adaptation - classifier trained on the source and tested on the target), SA (Subspace Alignment [5]), CA (Correlation Alignment (CORAL) [13]), GFK (Geodesic Flow Kernel [6]), TCA (Transfer Component Analysis [11]), JDA (Joint Distribution Adaptation [9]). \(\text {NET}_v\) is a special case of the NET algorithm where parameters \((\alpha , \beta , \gamma , k)\), have been estimated using (8) (see Sect. 3.3). For \(\text {NET}^*\), the optimum values for \((\alpha , \beta , \gamma , k)\) are estimated using the target data for cross validation.

4.1 Datasets

Office-Caltech Datasets: This object recognition dataset [6], consists of images of everyday objects categorized into 4 domains; Amazon, Caltech, Dslr and Webcam. It has 10 categories of objects and a total of 2533 images. We experiment with two kinds of features (i) SURF features obtained from [6], (ii) Deep features. To extract deep features, we use an ‘off-the-shelf’ deep convolutional neural network (VGG-F model [3]). We use the 4096-dimensional features from the fc8 layer and apply PCA to reduce the feature dimension to 500.

MNIST-USPS Datasets: We use a subset of the popular handwritten digit (0–9) recognition datasets (2000 images from MNIST and 1800 images from USPS based on [8]). The images are resized to \(16 \times 16\) pixels and represented as 256-dimensional vectors.

CKPlus-MMI Datasets: The CKPlus [10], and MMI [12], datasets consist of facial expression videos. From these videos, we select the frames with the most-intense expression to create the domains CKPlus and MMI, with around 1500 images each and 6 categories viz., anger, disgust, fear, happy, sad, surprise. We use a pre-trained deep neural network to extract features (see Office-Caltech).

4.2 Results and Discussion

For k, we explore optimum values in the set {10, 20, \(\ldots \), 100, 200}. For \((\alpha , \beta , \gamma )\), we select from {0, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}. For the sake of brevity, we evaluate and present one set of parameters \((\alpha , \beta , \gamma , k)\), for all the DA experiments in a dataset. For all the experiments, we choose 10 iterations to converge to the predicted test/validation labels when estimating \({\mathbf {M}}_c\). Figure 1, depicts the variation in validation set accuracies for each of the parameters. We select the parameter value with the highest validation set accuracy as the optimal value in the \(\text {NET}_v\) experiments.

Fig. 1.
figure 1

Each figure depicts the accuracies over the validation set for a range of values. When studying a parameter (say k), the remaining parameters \((\alpha , \beta , \gamma )\) are fixed at the optimum value.

Table 1. Recognition accuracies (%) for DA experiments on the digit and face datasets. {MNIST(M), USPS(U), CKPlus(CK), MMI(MM)}. M\(\rightarrow \)U implies M is source domain and U is target domain. The best and second best results are in bold and italic.
Table 2. Recognition accuracies (%) for DA experiments on the Office-Caltech dataset with SURF and Deep features. {Amazon(A), Webcam(W), Dslr(D), Caltech(C)}. A\(\rightarrow \)W implies A is source and W is target. The best and second best results are in bold and italic.

For fair comparison with existing methods, we follow the same experimental protocol as in [6, 8]. We train a nearest neighbor (NN) classifier on the projected source data and test on the projected target data. Table 1, captures the results for the digit and face datasets. Table 2, outlines the results for the Office-Caltech dataset. The accuracies reflect the percentage of correctly classified target data points. The accuracies obtained with \(\text {NET}_v\), demonstrate that the validation set generated from the source data is a good option for validating model parameters in unsupervised DA. The parameters for the \(\text {NET}^*\) experiment are estimated using the target datset; \((\alpha =1, \beta =1,\gamma =1, k=20)\) for the object recognition datasets, \((\alpha =1, \beta =0.01,\gamma =1, k=20)\) for the digit dataset and \((\alpha =0.01, \beta =0.01,\gamma =1, k=20)\) for the face dataset. The accuracies obtained with the NET algorithm are consistently better than existing methods, demonstrating the role of nonlinear embedding along with domain alignment.

5 Conclusions and Acknowledgments

We have proposed the NET algorithm for unsupervised DA along with a procedure for generating a validation set for model selection using the source data. Both the validation procedure and NET have better recognition accuracies than competitive visual DA methods across multiple vision based datasets. This material is based upon work supported by the National Science Foundation (NSF) under Grant No:1116360. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.