1 Introduction

When we apply machine learning [19] to real-world problems, e.g., in image recognition [18] or speech recognition [13], a significant challenge is the need for having large amounts of (labeled) training data, which may not always be available. Consequently, there has been longstanding interest in developing machine learning techniques that can transfer knowledge across domains, thereby alleviating to some extent the need for training data as well as the time required to train the machine learning system. A detailed survey of transfer learning is given in [22].

Traditional machine learning assumes that the distribution of test examples follows that of the training examples [8], whereas in transfer learning, this assumption is usually violated. Domain adaptation (DA) is a well-studied formulation of transfer learning that is based on developing methods that deal with the change of distribution in test instances as compared with training instances [5, 11]. In this paper, we propose a new framework for domain adaptation, based on formulating transfer from source to target as a problem of geometric mean metric learning on manifolds. Our proposed approach enables integrating multiple sources of variation between source and target in a unified framework with a theoretically optimal solution. We also present detailed experiments using standard transfer learning testbeds from computer vision, showing how our proposed algorithms give improved results compared to existing methods.

A general way to model domain adaptation is using the framework of optimal transport (OT) [9, 23]. The OT problem, originally studied by Monge in 1781, is defined as the effort needed to move a given quantity of dirt (formally modeled by some source distribution) to a new location with no loss in volume (modeled formally by a target distribution), minimizing some cost of transportation per unit volume of dirt. A revised formulation by Kantorovich in 1942 allowed the transport map to be (in the discrete case) a bistochastic matrix, whose rows sum to the source distribution, and whose columns sum to the target distribution. When the cost of transportation is given by a distance metric, the optimal transport distance is defined as the Wasserstein distance. The solution we propose in this paper, based on the geometric mean of the source and target covariances, can be derived from optimal transport theory, under the assumption that the source and target distributions correspond to multivariate Gaussian distributions. Our approach goes beyond the simple solution proposed by optimal transport theory in that we take into account not only the cost of transportation, but also other factors, such as the geometry of source and target domains. In addition, we use the weighted geometric mean, giving us additional flexibility in tuning the solution.

Background: One standard approach of domain adaptation is based on modeling the covariate shift [1]. Unlike traditional machine learning, in DA, the training and test examples are assumed to have different distributions. It is usual in DA to categorize the problem into different types: (i) semi-supervised domain adaptation (ii) unsupervised domain adaptation (iii) multi-source domain adaptation (iv) heterogeneous domain adaptation.

Another popular approach to domain adaptation is based on aligning the distributions between source and target domains. A common strategy is based on the maximum mean discrepancy (MMD) metric [7], which is a nonparametric technique for measuring the dissimilarity of empirical distributions between source and target domains. Domain-invariant projection is one method that seeks to minimize the MMD measure using optimization on the Grassmannian manifold of fixed-dimensional subspaces of n-dimensional Euclidean space [2].

Linear approaches to domain adaptation involve the use of alignment of lower-dimensional subspaces or covariances from a data source domain \(D_s = \{ x^s_i \}\) with labels \(L_s = \{ y_i \}\) to a target data domain \(D_t = \{ x^t_i \}\). We assume both \(x^s_i\) and \(x^t_i\) are n-dimensional Euclidean vectors, representing the values of n features of each training example. One popular approach to domain adaptation relies on first projecting the data from the source and target domains onto a low-dimensional subspace, and then finding correspondences between the source and target subspaces. Of these approaches, the most widely used one is Canonical Correlation Analysis (CCA) [17], a standard statistical technique used in many applications of machine learning and bioinformatics. Several nonlinear versions [15] and deep learning variants [27] of CCA have been proposed. These methods often require explicit correspondences between the source and target domains to learn a common subspace. Because CCA finds a linear subspace, a family of manifold alignment methods have been developed that extend CCA [10, 25] to exploit the nonlinear structure present in many datasets.

In contrast to using a single shared subspace across source and target domains, subspace alignment finds a linear mapping that transforms the source data subspace into the target data subspace [14]. To explain the basic algorithm, let \(P_S, P_T \in \mathbb {R}^{n \times d}\) denote the two sets of basis vectors that span the subspaces for the “source” and “target” domains. Subspace alignment attempts to find a linear mapping M that minimizes

It can be shown that the solution to the above optimization problem is simply the dot product between \(P_S\) and \(P_T\), i.e.,:

$$\begin{aligned} M^* = \text {argmin}_M F(M) = P_S^T P_T. \end{aligned}$$

Another approach exploits the property that the set of k-dimensional subspaces in n-dimensional Euclidean space forms a curved manifold called the Grassmannian [12], a type of matrix manifold. The domain adaptation method called geodesic flow kernels (GFK) [16] is based on constructing a distance function between source and target subspaces that is based on the geodesic or shortest path between these two elements on the Grassmannian.

Rather than aligning subspaces, a popular technique called CORAL [24] aligns correlations between source and target domains. Let \(\mu _s, \mu _t\) and \(A_s, A_t\) represent the mean and covariance of the source and target domains, respectively. CORAL finds a linear transformation A that minimizes the distance between the second-order statistics of the source and target features (which can be assumed as normalized with zero means). Using the Frobenius (Euclidean) norm as the matrix distance metric, CORAL is based on solving the following optimization problem:

(1)

where \(A_s, A_t\) are of size \(n\times n\). Using the singular value decomposition of \(A_s\) and \(A_t\), CORAL [24] computes a particular closed-form solutionFootnote 1 to find the desired linear transformation A.

Novelty of Our Approach: Our proposed solution differs from the above previous approaches in several fundamental ways: one, we explicitly model the space of covariance matrices as a curved Riemannian manifold of symmetric positive definite (SPD) matrices. Note the difference of two SPD matrices is not an SPD matrix, and hence they do not form a vector space. Second, our approach can be shown to be both unique and globally optimal, unlike some of the above approaches. Uniqueness and optimality derive from the fact that we reduce all domain adaptation computations to nested equations involving solving the well-known Riccati equation [6].

The organization of the paper is as follows. In Sect. 2, we show the connection between the domain adaptation problem to the metric learning problem. In particular, we discuss the Riccati point of view for the domain adaptation problem. Section 3 discusses briefly the Riemannian geometry of the space of SPD matrices. Sections 4 and 5 discuss additional domain adaptation formulations. Our proposed algorithms are presented in Sects. 6 and 7. Finally, in Sect. 8 we show the experimental results on the standard Office and the extended Office-Caltech10 datasets, where our algorithms show clear improvements over CORAL.

2 Domain Adaptation Using Metric Learning

In this section, we will describe the central idea of this paper: modeling the problem of domain adaptation as a geometric mean metric learning problem. Before explaining the specific approach, it will be useful to introduce some background. The metric learning problem [4] involves taking input data in \(\mathbb {R}^n\) and constructing a (non)linear mapping \(\varPhi : \mathbb {R}^n \rightarrow \mathbb {R}^m\), so that the distance between two points x and y in \(\mathbb {R}^n\) can be measured using the distance \(\Vert \varPhi (x) - \varPhi (y) \Vert \). A simple approach is to learn a squared Mahalanobis distance: \(\delta _A^2(x, y) = (x - y)^T A (x - y)\), where \(x, y \in \mathbb {R}^n\) and A is an \(n \times n\) symmetric positive definite (SPD) matrix. If we represent \(A = W^T W\), for some linear transformation matrix W, then it is easy to see that \(\delta _A^2(x, y) = \Vert W x - W y\Vert _F^2\), thereby showing that the Mahalanobis distance is tantamount to projecting the data into a potentially lower-dimensional space, and measuring distances using Euclidean (Frobenius) norm. Typically, the matrix A is learned using some weak supervision, given two sets of training examples of the form:

$$\begin{aligned} \mathcal{S} = \{ (x_i, x_j) | x_i \text { and } x_j \text { are in the same class} \}, \end{aligned}$$
$$\begin{aligned} \mathcal{D} = \{ (x_i, x_j) | x_i \text { and } x_j \text { are in different classes} \}. \end{aligned}$$

A large variety of metric learning methods can be designed based on formulating different optimization objectives based on functions over the S and D sets to extract information about the distance matrix A.

For our purposes, the method that will provide the closest inspiration to our goal of designing a domain adaptation method based on metric learning is the recently proposed geometric mean metric learning (GMML) algorithm [28]. GMML models the distance between points in the S set by the Mahalanobis distance \(\delta _A(x_i, x_j), x_i, x_j \in S\) by exploiting the geometry of the SPD matrices, and crucially, also models the distance between points in the disagreement set D by the inverse metric \(\delta _{A^{-1}}(x_i, x_j), x_i, x_j \in D\). GMML is based on solving the objective function over all SPD matrices A:

$$\begin{aligned} \min _{A \succ 0} \sum _{(x_i, x_j) \in S} \delta ^2_A(x_i, x_j) + \sum _{(x_i, x_j) \in D} \delta ^2_{A^{-1}}(x_i, x_j), \end{aligned}$$

where \(A \succ 0\) refers to the set of all SPD matrices.

Several researchers have previously explored the connection between domain adaptation and metric learning. One recent approach is based on constructing a transformation matrix A that both minimizes the difference between the source and target distributions based on the previously noted MMD metric, but also captures the manifold geometry of source and target domains, and attempts to preserve the discriminative power of the label information in the source domain [26]. Our approach builds on these ideas, with some significant differences. One, we use an objective function that is based on finding a solution that lies on the geodesic between source and target (estimated) covariance matrices (which are modeled as symmetric positive definite matrices). Second, we use a cascaded series of geometric mean computations to balance multiple factors. We describe these ideas in more detail in this and the next section.

We now describe how the problem of domain adaptation can be considered as a type of metric learning problem, called geometric mean metric learning (GMML) [28]. Recall that in domain adaptation, we are given a source dataset \(D_s\) (usually with a set of training labels) and a target dataset \(D_t\) (unlabeled). The aim of domain adaptation, as reviewed above, is to construct an intermediate representation that combines some of the features of both the source and target domains, with the rationale being that the distribution of target features differs from that of the source. Relying purely on either the source or the target features is therefore suboptimal, and the challenge is to determine what intermediate representation will provide optimal transfer between the domains.

To connect metric learning to domain adaptation, note that we can define the two sets S and D in the metric learning problem as associated with the source and target domains respectively, wherebyFootnote 2

$$\begin{aligned} S\subseteq & {} D_s \times D_s = \{ (x_i, x_j) | x_i \in D_s, x_j \in D_s \} \\ D\subseteq & {} D_t \times D_t = \{ (x_i, x_j) | x_i \in D_t, x_j \in D_t \}. \end{aligned}$$

Our approach seeks to exploit the nonlinear geometry of covariance matrices to find a Mahalanobis distance matrix A, such that we can represent distances in the source domain using A, but crucially we measure distances in the target domain using the inverse \(A^{-1}\).

$$\begin{aligned} \min _{A \succ 0} \sum _{(x_i, x_j) \in S} (x_i - x_j)^T A (x_i - x_j) + \sum _{(x_i, x_j) \in D} (x_i - x_j)^T A^{-1} (x_i - x_j). \end{aligned}$$

To provide some intuition here, we observe that as we vary A to reduce the distance \(\sum _{(x_i, x_j) \in S} (x_i - x_j)^T A (x_i - x_j) \) in the source domain, we simultaneously increase the distance in the target domain by minimizing \(\sum _{(x_i, x_j) \in D} (x_i - x_j)^T A^{-1} (x_i - x_j)\), and vice versa. Consequently, by appropriately choosing A, we can seek to minimize the above sum. We can now use the matrix trace to reformulate the Mahalanobis distances:

$$\begin{aligned} \min _{A \succ 0} \sum _{(x_i, x_j) \in S} \mathrm{tr}(A (x_i - x_j) (x_i - x_j)^T ) + \sum _{(x_i, x_j) \in D} \mathrm{tr}(A^{-1} (x_i - x_j) (x_i - x_j)^T ). \end{aligned}$$

Denoting the source and target covariance matrices \(A_s\) and \(A_t\) as:

$$\begin{aligned} A_s:= & {} \sum _{(x_i, x_j) \in S} (x_i - x_j) (x_i - x_j)^T \end{aligned}$$
(2)
$$\begin{aligned} A_t:= & {} \sum _{(x_i, x_j) \in D} (x_i - x_j) (x_i - x_j)^T, \end{aligned}$$
(3)

we can finally write a new formulation of the domain adaptation problem as minimizing the following objective function to find the SPD matrix A such that:

$$\begin{aligned} \min _{A \succ 0} \ \omega (A) := \mathrm{tr}(A A_s) + \mathrm{tr}(A^{-1} A_t) . \end{aligned}$$
(4)

3 Riemannian Geometry of SPD Matrices

In this section, we outline some other formulations of domain adaptation that will be useful to discuss for presenting our overall approach.

Fig. 1.
figure 1

The space of symmetric positive definite matrices forms a Riemannian manifold, as illustrated here. The methods we propose are based on computing geodesics (the shortest distance), shown in the dotted line, between source domain information and target domain information.

As Fig. 1 shows, our proposed approach to domain adaptation builds on the nonlinear geometry of the space of SPD (or covariance) matrices, we review some of this material first [6]. Taking a simple example of a \(2 \times 2\) SPD matrix M, where:

$$\begin{aligned} M = \left[ \begin{array}{cc} a\quad &{} \quad b \\ b\quad &{} \quad c \end{array} \right] , \end{aligned}$$

where \(a > 0\), and the SPD requirement implies the positivity of the determinant \(ac - b^2 > 0\). Thus, the set of all SPD matrices of size \(2 \times 2\) forms the interior of a cone in \(\mathbb {R}^3\). More generally, the space of all \(n \times n\) SPD matrices forms a manifold of non-positive curvature in \(\mathbb {R}^{n^2}\) [6]. In the CORAL objective function in Eq. (1), the goal is to find a transformation A that makes the source covariance resemble the target as closely as possible. Our approach simplifies Eq. (1) by restricting the transformation matrix A to be a SPD matrix, i.e, \(A \succ 0\), and furthermore, we solve the resulting nonlinear equation exactly on the manifold of SPD matrices. More formally, we solve the Riccati equation [6]:

$$\begin{aligned} A A_s A = A_t, \text { for } A \succ 0, \end{aligned}$$
(5)

where \(A_s\) and \(A_t\) are source and target covariances SPD matrices, respectively. Note that in comparison with the CORAL approach in Eq. (1), the A matrix is symmetric (and positive definite), so A and \(A^T\) are the same. The solution to the above Riccati equation is the well-known geometric mean or sharp mean, of the two SPD matrices, \(A_s^{-1}\) and \(A_t\).

$$\begin{aligned} A = A_s^{-1} \sharp _{\frac{1}{2}} A_t = A_s^{-\frac{1}{2}} (A_s^\frac{1}{2} A_t A_s^{\frac{1}{2}})^\frac{1}{2} A_s^{-\frac{1}{2}}, \end{aligned}$$

where \(\sharp _{\frac{1}{2}}\) is denotes the geometric mean of SPD matrices [28]. The sharp mean has an intuitive geometric interpretation: it is the midpoint of the geodesic connecting the source domain \(A_s^{-1}\) and target domain \(A_t\) matrices, where length is measured on the Riemannian manifold of SPD matrices. As mentioned earlier, the geometric mean is also the solution to the domain adaptation problem from optimal transport theory when the source and target distributions are given by multivariate Gaussians [23].

In a manifold, the shortest path between two elements, if it exists, can be represented by a geodesic. For the SPD manifold, it can be shown that the geodesic \(\gamma (t)\) for a scalar \(0 \le t \le 1\) between \(A_s^{-1}\) and \(A_t\), is given by [6]:

$$\begin{aligned} \gamma (t) = A_s^{-1/2} (A_s^{1/2} A_t A_s^{1/2})^t A_s^{-1/2}. \end{aligned}$$

It is common to denote \(\gamma (t)\) as the so-called “weighted” sharp mean \(A_s^{-1} \sharp _t A_t\). It is easy to see that for \(t=0\), \(\gamma (0) = A_s^{-1}\), and for \(t=1\), we have \(\gamma (1) = A_t\). For the distinguished value of \(t = 1/2\), it turns out that \(\gamma (1/2)\) is the geometric mean of \(A_s^{-1}\) and \(A_t\), respectively, and satisfies all the properties of a geometric mean [6].

$$\begin{aligned} \gamma (1/2) = A_s^{-1/2} \sharp _{1/2} A_t = A_s^{-1/2} (A_s^{1/2} A_t A_s^{1/2})^{1/2} A_s^{-1/2}. \end{aligned}$$

The following theorem summarizes some of the properties of the objective function given by Eq. (4).

Theorem 1

[28, Theorem 3] The cost function \(\omega (A)\) in Eq. (4) is both strictly convex as well as strictly geodesically convex on the SPD manifold.

Theorem 1 ensures the uniqueness of the solution to Eq. (4).

4 Statistical Alignment Across Domains

A key strength of our approach is that it can exploit both geometric and statistical information, and multiple sources of alignment are integrated by solving nested sets of Ricatti equations. To illustrate this point, in this section we explicitly introduce a secondary criterion of aligning the source and target domains so that the underlying (marginal) distributions are similar. As our results show later, we obtain a significant improvement over CORAL on a standard computer vision dataset (Office/Caltech/Amazon problem). The reason our approach outperforms CORAL is that not only are we able to solve the Riccati equation uniquely, whereas the CORAL solution proposed is [24] is only a particular solution due to non-uniquenessFootnote 3, we can exploit multiple sources of information.

A common way to incorporate the statistical alignment constraint is based on minimizing the maximum mean discrepancy metric (MMD) [7], a nonparametric measure of the difference between two distributions.

(6)

where \(X = \{x_1^s, \ldots , x_n^s, x_t^1, \ldots , x_t^n\} \in \mathbb {R}^{(n+m)}\) and \(L \in \mathbb {R}^{(n + m)(n+m)}\), where \(L(i,j) = \frac{1}{n^2}\) if \(x_i, x_j \in X_{s}\), \(L(i,j) = \frac{1}{m^2}\) if \(x_i, x_j \in X_t\), and \(L(i,j) = -\frac{1}{mn}\) otherwise. It is straightforward to show that \(L \succeq 0\), a symmetric positive-semidefinite matrix [26]. We can now combine the MMD objective in Eq. (6) with the previous geometric mean objective in Eq. (4) to give rise to the following modified objective function:

$$\begin{aligned} \min _{A \succ 0} \ \xi (A) := \mathrm{tr}(A A_s) + \mathrm{tr}(A^{-1} A_t) + \mathrm{tr}(A X L X^T). \end{aligned}$$
(7)

We can once again find a closed-form solution to the modified objective in Eq. (7) by taking gradients:

$$\begin{aligned} \nabla \xi (A) = A_s - A^{-1} A_t A^{-1} + X L X^T = 0, \end{aligned}$$

whose solution is now given \(A = A_m^{-1} \sharp _{\frac{1}{2}} A_t\), where \(A_m = A_s + X L X^T\).

5 Geometrical Diffusion on Manifolds

So far we have shown how the solution to the domain adaptation problem can be shown to involve finding the geometric mean of two terms, one involving the source covariance information and the Maximum Mean Discrepancy (MMD) of source and target training instances, and the second involving the target covariance matrix. In this section, we impose additional geometrical constraints on the solution that involve modeling the nonlinear manifold geometry of the source and target domains.

The usual approach is to model the source and target domains as a nonlinear manifold and set up a diffusion on a discrete graph approximation of the continuous manifold [20], using a random walk on a nearest neighbor graph connecting nearby points. Standard results have been established showing asymptotic convergence of the graph Laplacian to the underlying manifold Laplacian [3]. We can use the above algorithm to find two graph kernels \(K_s\) and \(K_t\) that are based on the eigenvectors of the random walk on the source and target domain manifold, respectively.

$$\begin{aligned} K_s = \sum _{i=1}^m e^{-\frac{-\sigma ^2_s}{2 \lambda ^s_i}} v^s_i (v^s_i)^T \\ K_t = \sum _{i=1}^n e^{-\frac{-\sigma ^2_t}{2 \lambda ^t_i}} v^t_i (v^t_i)^T. \end{aligned}$$

Here, \(v^s_i\) and \(v^t_i\) refer to the eigenvectors of the random walk diffusion matrix on the source and target manifolds, respectively, and \(\lambda ^s_i\) and \(\lambda ^t_i\) refer to the corresponding eigenvalues.

We can now introduce a new objective function that incorporates the source and target domain manifold geometry:

$$\begin{aligned} \min _{A \succ 0} \ \eta (A) := \mathrm{tr}(A X (K + \mu L) X^T) + \mathrm{tr}(A A_s) + \mathrm{tr}(A^{-1} A_t), \end{aligned}$$
(8)

where \(K \succ 0\) and \(K = \left( \begin{array}{cc} K^{-1}_s &{} 0 \\ 0 &{} K^{-1}_t \end{array} \right) \), and \(\mu \) is a weighting term that combines the geometric and statistical constraints over A.

Once again, we can exploit the SPD nature of the matrices involved, the closed-form solution to Eq. (8) is \(A = A_{gs} \sharp _{\frac{1}{2}} A_t\), where \(A_{gs} = A_s + X (K + \mu L) X^T\).

6 Cascaded Weighted Geometric Mean

One additional refinement that we use is the notion of a weighted geometric mean. To explain this idea, we introduce the following Riemannian distance metric on the nonlinear manifold of SPD matrices:

for two SPD matrices X and Y. Using this metric, we now introduce a weighted version of the previous objective functions in Eqs. (4), (7), and (8).

For the first objective function in Eq. (4), we get:

$$\begin{aligned} \min _{A \succ 0}\ \omega _t(A) := (1-t) \ \delta _R^2(A, A_s^{-1}) + t \ \delta _R^2(A, A_t), \end{aligned}$$
(9)

where \(0 \le t \le 1\) is the weight parameter. The unique solution to (9) is given by the weighted geometric mean \(A = A_s^{-1} \sharp _t A_t\) [28]. Note that the weighted metric mean is no longer strictly convex (in the Euclidean sense), but remains geodesically strictly convex [28, 6, Chap. 6].

Similarly, we introduce the weighted variant of the objective function given by Eq. (7):

$$\begin{aligned} \min _{A \succ 0} \ \xi _t(A) := (1 -t) \ \delta ^2_R(A, (A_s + XLX^T)^{-1}) + t \ \delta ^2_R(A, A_t), \end{aligned}$$
(10)

whose unique solution is given by \(A = A_{m}^{-1} \ \sharp _t \ A_t\), where \(A_m = A_s + X L X^T\) as before. A cascaded variant is obtained when we further exploit the SPD structure of \(A_s\) and \(XLX^T\), i.e., \(A_m = A_s \sharp _\gamma (X L X^T)\) (weighted geometric mean of \(A_s\) and \((X L X^T\)) instead of \(A_m = A_s + X L X^T\) (which is akin to the Euclidean mean of \(A_s\) and \((X L X^T\)). Here, \(0 \le \gamma \le 1\) is the weight parameter.

Finally, we obtain the weighted variant of the third objective function in Eq. (8):

$$\begin{aligned} \min _{A \succ 0} \ \eta _t(A) := (1-t)\delta _R^2(A, (A_s + X(K + \mu I)X^T)^{-1} + t \ \delta _R^2(A, A_t), \end{aligned}$$
(11)

whose unique solution is given by \(A = A_{gs}^{-1} \ \sharp _t \ A_t\), where \(A_{gs} = A_s + X (K + \mu L) X^T\) as previously noted. Additionally, the cascaded variant is obtained when \(A_s \sharp _\gamma (X (K + \mu L) X^T)\) instead of \(A_{gs} = A_s + X (K + \mu L) X^T\).

7 Domain Adaptation Algorithms

We now describe the proposed domain adaptation algorithms, based on the above development of approaches reflecting geometric and statistical constraints on the inferred solution. All the proposed algorithms are summarized in Algorithm 1. The algorithms are based on finding a Mahalanobis distance matrix A interpolating source and target covariances (GCA1), incorporating an additional MMD metric (GCA2) and finally, incorporating the source and target manifold geometry (GCA3). It is noteworthy that all the variants rely on computation of the sharp mean, a unifying motif that ties together the various proposed methods. Modeling the Riemannian manifold underlying SDP matrices ensures the optimality and uniqueness of our proposed methods.

figure a

8 Experimental Results

We present experimental results using the standard computer vision testbed used in prior work: the Office [16] and extended Office-Caltech10 [14] benchmark datasets. The Office-Caltech10 dataset contains 10 object categories from an office environment (e.g., keyboard, laptop, and so on) in four image domains: Amazon (A), Caltech256 (C), DSLR (D), and Webcam (W). The Office dataset has 31 categories (the previous 10 categories and 21 additional ones).

An exhaustive comparison of the three proposed methods with a variety of previous methods is summarized by the table in Table 1. The previous methods compared in the table refer to the unsupervised domain adaptation approach where a support vector machine (SVM) classifier is used. The experiments follow the standard protocol established by previous works in domain adaptation using this dataset. The features used (SURF) are encoded with 800-bin bag-of-words histograms and normalized to have zero mean and unit standard deviation in each dimension. As there are four domains, there are 12 ensuing transfer learning problems, denoted in Table 1 below as \(\mathbf{A} \rightarrow \mathbf{D}\) (for Amazon to DSLR, etc.). For each of the 12 transfer learning tasks, the best performing method is indicated in boldface. We used 30 randomized trials for each experiment, and randomly sample the same number of labeled images in the source domain as training set, and use all the unlabeled data in the target domain as the test set. All experiments used a support vector machine (SVM) method to measure classifier accuracy, using a standard libsvm package. The methods compared against in Table 1 include the following alternatives:

  • Baseline-S: This approach uses the projection defined by using PCA in the source domain to project both source and target data.

  • Baseline-T: Here, PCA is used in the target domain to extract a low-dimensional subspace.

  • NA: No adaptation is used and the original subspace is used for both source and target domains.

  • GFK: This approach refers to the geodesic flow kernel [16], which computes the geodesic on the Grassmannian between the PCA-derived source and target subspaces computed from the source and target domains.

  • TCA: This approach refers to the transfer component analysis method [21].

  • SA: This approach refers to the subspace alignment method [14].

  • CORAL: This approach refers to the correlational alignment method [24].

  • GCA1: This is a new proposed method, based on finding the weighted geometric mean of the inverse of the source matrix \(A_s\) and the target matrix \(A_t\).

  • GCA2: This is a new proposed method, based on finding the (non-cascaded) weighted geometric mean of the inverse of the source matrix \(A_m\) and the target matrix \(A_t\).

  • GCA3: This is a new proposed method, based on finding the (non-cascaded) weighted geometric mean of the inverse of the source matrix \(A_{gs}\) and the target matrix \(A_t\).

  • Cascaded-GCA2: This is a new proposed method, based on finding the cascaded geometric mean of the inverse revised source matrix \(A_m\) and target matrix \(A_t\)

  • Cascaded-GCA3: This is a new proposed method, based on finding the cascaded geometric mean of the inverse revised source matrix \(A_{gs}\) and target matrix \(A_t\)

Table 1. Recognition accuracy with unsupervised domain adaptation using SVM classifier (Office dataset + Caltech10). Our proposed methods (labeled GCAXX and Cascaded-GCAXX) perform better than all other methods in a majority of the 12 transfer learning tasks.
Fig. 2.
figure 2

Comparison of the cascaded GCA3 method with correlational alignment (CORAL). The parameters t and \(\gamma \) were varied between 0.1 and 0.9.

Fig. 3.
figure 3

Comparison of the cascaded GCA3 method with correlational alignment (CORAL) in terms of percentage improvement.

Fig. 4.
figure 4

Comparison of the three proposed GCA methods (GCA1, GCA2, and GCA3) with correlational alignment (CORAL).

Fig. 5.
figure 5

Comparison of the three proposed GCA methods (GCA1, GCA2, and GCA3) with correlational alignment (CORAL).

One question that arises in proposed algorithms is how to choose the value of t in computing the weighted sharp mean. Figure 2 illustrates the variation in performance of the cascaded GCA3 method over CORAL over the range \(t, \gamma \in (0.1, 0.9)\), and \(\mu \) fixed for simplicity. Repeating such experiments over all 12 transfer learning tasks, Fig. 3 shows the percentage improvement of the cascaded GCA3 method over correlational alignment (CORAL), using the best discovered value of all three hyperparameters using cross-validation. Figure 4 compares the performance of the proposed GCA1, GCA2, and GCA3 methods where just the t hyperparameter was varied between 0.1 and 0.9 for the Amazon to DSLR domain adaptation task. Note the variation in performance with t occurs at different points for the three points, and while their performance is superior overall to CORAL, their relative performances at the maximum values are not very different from each other. Figure 5 once again repeats the same comparison for the Caltech10 to the Webcam domain adaptation task. As these plots clearly reveal, the values of the hyperparameters has a crucial influence on the performance of all the proposed GCAXX methods. The plot compares the performance of GCA1 to the fixed performance of the CORAL method.

9 Summary and Future Work

In this paper, we introduced a novel formulation of the classic domain adaptation problem in machine learning, based on computing the cascaded geometric mean of second order statistics from source and target domains to align them. Our approach builds on the nonlinear Riemannian geometry of the open cone of symmetric positive definite matrices (SPDs), using which the geometric mean lies along the shortest path geodesic that connects source and target covariances. Our approach has three key advantages over previous work: (a) Simplicity: The Riccati equation is a mathematically elegant solution to the domain adaptation problem, enabling integrating geometric and statistical information. (b) Theory: Our approach exploits the Riemannian geometry of SPD matrices. (c) Extensibility: As our algorithm development indicates, it is possible to easily extend our approach to capture more types of constraints, from geometrical to statistical.

There are many directions for extending our work. We briefly alluded to optimal transport theory as providing an additional theoretical justification for our solution of using the geometric mean of the source and target covariances, a link that deserves further exploration in a subsequent paper. Also, while we did not explore nonlinear variants of our approach, it is possible to extend our approach to develop a deep learning version where the gradient of the three objective functions is used to tune the weights of a multi-layer neural network. As in the case of correlational alignment (CORAL), we anticipate that the deep learning variants may perform better due to the construction of improved features of the training data. The experimental results show that the performance improvement tends to be more significant in some cases than in others.