Skip to main content

Learning Kernels for Unsupervised Domain Adaptation with Applications to Visual Object Recognition

Abstract

Domain adaptation aims to correct the mismatch in statistical properties between the source domain on which a classifier is trained and the target domain to which the classifier is to be applied. In this paper, we address the challenging scenario of unsupervised domain adaptation, where the target domain does not provide any annotated data to assist in adapting the classifier. Our strategy is to learn robust features which are resilient to the mismatch across domains and then use them to construct classifiers that will perform well on the target domain. To this end, we propose novel kernel learning approaches to infer such features for adaptation. Concretely, we explore two closely related directions. In the first direction, we propose unsupervised learning of a geodesic flow kernel (GFK). The GFK summarizes the inner products in an infinite sequence of feature subspaces that smoothly interpolates between the source and target domains. In the second direction, we propose supervised learning of a kernel that discriminatively combines multiple base GFKs. Those base kernels model the source and the target domains at fine-grained granularities. In particular, each base kernel pivots on a different set of landmarks—the most useful data instances that reveal the similarity between the source and the target domains, thus bridging them to achieve adaptation. Our approaches are computationally convenient, automatically infer important hyper-parameters, and are capable of learning features and classifiers discriminatively without demanding labeled data from the target domain. In extensive empirical studies on standard benchmark recognition datasets, our appraches yield state-of-the-art results compared to a variety of competing methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    Note that we assume the set of possible labels are the same across domains.

  2. 2.

    A similar idea was pursued in Gopalan et al. (2011). We contrast it to our work in Sect. 5.

  3. 3.

    The unit-ball condition allows the difference to be represented as a metric in the form of Eq. (13) and the universality ensures that the means are injective such that the difference in the means is zero if and only if the two distributions are the same. For more detailed theoretical analysis, please refer to Gretton et al. (2006).

  4. 4.

    Note that we do not require the landmarks to be i.i.d samples from \(P_S(X)\)—they only need to be representative samples of \(P_L(X)\).

  5. 5.

    In the supplementary material for our previously published work (Gong et al. 2012), we report our results on 31 categories common to Amazon, Webcam and DSLR, to compare directly to published results from the literature (Saenko et al. 2010; Kulis et al. 2011; Gopalan et al. 2011). Despite occasional discrepancies between the published results and the results obtained by our own experimentation on these 31 categories, they demonstrate the same trend—that our proposed methods significantly outperform competing approaches.

  6. 6.

    http://www-scf.usc.edu/~boqinggo/da.html.

  7. 7.

    We did not use dslr as the source domain in these experiments as it is too small to select landmarks.

References

  1. Ando, R., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6, 1817–1853.

    MATH  MathSciNet  Google Scholar 

  2. Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. In ECCV.

  3. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman, J. (2010). A theory of learning from different domains. Machine Learning, 79, 151–175.

    Article  MathSciNet  Google Scholar 

  4. Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations for domain adaptation. In NIPS.

  5. Bergamo, A., & Torresani, L. (2010). Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In NIPS.

  6. Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, Bolly-wood, boomboxes and blenders: Domain adaptation for sentiment classification. In ACL.

  7. Blitzer, J., Foster, D., & Kakade, S. (2011). Domain adaptation with coupled subspaces. In AISTATS.

  8. Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In EMNLP.

  9. Bruzzone, L., & Marconcini, M. (2010). Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE PAMI, 32(5), 770–787.

    Article  Google Scholar 

  10. Chen, M., Weinberger, K., & Blitzer, J. (2011). Co-training for domain adaptation. In NIPS.

  11. Daumé, H., III. (2007). Frustratingly easy domain adaptation. In ACL.

  12. Daumé, H., Kumar, A., & Saha, A. (2010). Co-regularization based semi-supervised domain adaptation. In NIPS.

  13. Daumé, H, I. I. I., & Marcu, D. (2006). Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1), 101–126.

    MATH  MathSciNet  Google Scholar 

  14. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  15. Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2009). Pedestrian detection: A benchmark. In CVPR.

  16. Dredze, M., & Crammer, K. (2008). Online methods for multi-domain learning and adaptation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP ’08) (pp. 689–697).

  17. Duan, L., Tsang, I., Xu, D., & Maybank, S. (2009). Domain transfer SVM for video concept detection. In CVPR.

  18. Duan, L., Xu, D., & Tsang, I. (2012). Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Transactions on Neural Networks and Learning Systems, 23(3), 504–518.

    Article  Google Scholar 

  19. Duan, L., Xu, D., Tsang, I., & Luo, J. (2010) Visual event recognition in videos by learning from web data. In CVPR.

  20. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A. (2007). The PASCAL visual object classes, challenge 2007.

  21. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Comp Vis & Img Under, 106(1), 59–70.

    Article  Google Scholar 

  22. Gong, B., Grauman, K., & Sha, F. (2013a). Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In ICML.

  23. Gong, B., Grauman, K., & Sha, F. (2013b). Reshaping visual datasets for domain adaptation. In NIPS.

  24. Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In CVPR.

  25. Gopalan, R. (2013). Learning cross-domain information transfer for location recognition and clustering. In CVPR.

  26. Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In ICCV.

  27. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A. (2006). A kernel method for the two-sample-problem. In NIPS.

  28. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Scholkopf, B. (2009). Covariate shift by kernel mean matching. In J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine learning. Cambridge: MIT Press.

  29. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Tech. rep., Caltech.

  30. Ham, J., Lee, D. D., Mika, S., Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. In ICML.

  31. Hamm, J., & Lee, D. (2008). Grassmann discriminant analysis: A unifying view on subspace-based learning. In ICML.

  32. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Berlin: Springer.

    Book  MATH  Google Scholar 

  33. Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Scholkopf, B. (2006). Correcting sample selection bias by unlabeled data. In NIPS.

  34. Jain, V., & Learned-Miller, E. (2011). Online domain adaptation of a pre-trained cascade of classifiers. In CVPR.

  35. Kulis, B,, Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR.

  36. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. (2004). Learning the kernel matrix with semidefinite programming. JMLR, 5, 27–72.

    MATH  Google Scholar 

  37. Leggetter, C., & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2), 171–185.

    Article  Google Scholar 

  38. Li, R., & Zickler, T. (2012). Discriminative virtual views for cross-view action recognition. In CVPR.

  39. Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009a). Domain adaptation: Learning bounds and algorithms. Arxiv, preprint arXiv:09023430.

  40. Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009b). Multiple source adaptation and the rényi divergence. In UAI.

  41. Pan, S., Tsang, I., Kwok, J., & Yang, Q. (2009). Domain adaptation via transfer component analysis. IEEE Trans Neural Nets, 99, 1–12.

    Google Scholar 

  42. Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Trans Knowledge and Data Engineering, 22(10), 1345–1359.

    Google Scholar 

  43. Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.

  44. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.

    Article  Google Scholar 

  45. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: A database and web-based tool for image annotation. IJCV, 77, 157–173.

    Article  Google Scholar 

  46. Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In ECCV.

  47. Shi, Y,, & Sha, F. (2012). Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In ICML.

  48. Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227–244.

    Article  MATH  MathSciNet  Google Scholar 

  49. Torralba, A., & Efros, A. (2011). Unbiased look at dataset bias. In CVPR.

  50. Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple kernels for object detection. In ICCV.

  51. Wang, M., & Wang, X. (2011). Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR.

  52. Weinberger, K. Q., & Saul, L. K. (2006). Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70(1), 77–90.

    Article  Google Scholar 

  53. Zheng, J., Liu, M. Y., Chellappa, R., & Phillips, P. J. (2012) A grassmann manifold-based domain adaptation approach. In ICPR.

Download references

Acknowledgments

This work is partially supported by DARPA D11-AP00278 and NSF IIS-1065243 (B.G. and F.S.), and ONR ATL #N00014-11-1-0105 (K.G.). We thank the anonymous reviewers for their constructive comments and suggestions. The Flickr images in Fig. 1 are under a CC Attribution 2.0 Generic license, courtesy of berzowska, IvanWalsh.com, warrantedarrest, HerryLawford, yuichi.sakuraba, zimaowuyu, GrahamAndDairne, Bernt Rostad, Keith Roper, flavorrelish, and deflam.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Boqing Gong.

Additional information

Communicated by Hal Daumé.

Appendices

Appendices

Appendix 1: Derivation of Geodesic Flow Kernel (GFK)

Let \(\varvec{\varOmega }^{\mathrm{T}}\) denote the following matrix

$$\begin{aligned} \varvec{\varOmega }^{\mathrm{T}}= [\varvec{P}_\mathcal {S}\quad \varvec{R}_\mathcal {S}]\left[ \begin{aligned} \varvec{U}_1&\varvec{0}\\ \varvec{0}&\varvec{U}_2\end{aligned} \right] . \end{aligned}$$
(21)

The geodesic flow \(\varvec{\varPhi }(t),\; t\in (0,1)\), between \(\varvec{P}_\mathcal {S}\) and \(\varvec{P}_\mathcal {T}\) can be written as

$$\begin{aligned} \varvec{\varPhi }(t) = \varvec{P}_\mathcal {S}\varvec{U}_1 \varvec{\varGamma }(t) - \varvec{R}_\mathcal {S}\varvec{U}_2 \varvec{\varSigma }(t) = \varvec{\varOmega }^{\mathrm{T}}\left[ \begin{aligned} \varvec{\varGamma }(t)\\ -\varvec{\varSigma }(t)\end{aligned} \right] . \end{aligned}$$
(22)

Recall that the geodesic flow kernel (gfk) is defined as,

$$\begin{aligned} \langle \varvec{z}_i^\infty , \varvec{z}_j^\infty \rangle =\int \limits _0^1 (\varvec{\varPhi }(t)^{\mathrm{T}}\varvec{x}_i)^{\mathrm{T}}(\varvec{\varPhi }(t)^{\mathrm{T}}\varvec{x}_j)\ dt = \varvec{x}_i^{\mathrm{T}}\varvec{G}\varvec{x}_j, \end{aligned}$$
(23)

where

$$\begin{aligned} \varvec{G} = \int \limits _{0}^1{\varvec{\varPhi }(t)\varvec{\varPhi }(t)^{\mathrm{T}}}dt. \end{aligned}$$
(24)

Substituting the expression of \(\varvec{\varPhi }(t)\) of Eq. (22) into above, we have (ignoring \(\varvec{\varOmega }\) for the moment),

$$\begin{aligned} \varvec{G} \propto \int \limits _{0}^1 \left[ \begin{aligned} \varvec{\varGamma }(t)\varvec{\varGamma }(t)&-\varvec{\varGamma }(t)\varvec{\varSigma }(t)\\ -\varvec{\varSigma }(t)\varvec{\varGamma }(t)&\varvec{\varSigma }(t)\varvec{\varSigma }(t)\end{aligned} \right] dt \end{aligned}$$
(25)

Both \(\varvec{\varGamma }(t)\) and \(\varvec{\varSigma }(t)\) are diagonal matrices with elements being \(\cos (t\theta _i)\) and \(\sin (t\theta _i)\). Thus, we can integrate in close-form,

$$\begin{aligned} \lambda _{1i}&= \int \limits _0^1 \cos ^2(t\theta _i) dt =1+\frac{\sin (2\theta _i)}{2\theta _i}, \end{aligned}$$
(26)
$$\begin{aligned} \lambda _{2i}&= - \int \limits _0^1 \cos (t\theta _i) \sin (t\theta _i)dt = \frac{\cos (2\theta _i)-1}{2\theta _i}\end{aligned}$$
(27)
$$\begin{aligned} \lambda _{3i}&= \int \limits _0^1 \sin ^2(t\theta _i) dt = 1-\frac{\sin (2\theta _i)}{2\theta _i}, \end{aligned}$$
(28)

which become the \(i\)th diagonal elements of diagonal matrices \(\varvec{\varLambda }_1,\; \varvec{\varLambda }_2\), and \(\varvec{\varLambda }_3\), respectively. In terms of these matrices, the inner product Eq. (23) is a linear kernel \(\varvec{x}_i^{\mathrm{T}}\varvec{G}\varvec{x}_j\) with the matrix \(\varvec{G}\) given by

$$\begin{aligned} \varvec{G}= \varvec{\varOmega }^{\mathrm{T}}\left[ \begin{aligned} \varvec{\varLambda }_1&\varvec{\varLambda }_2\\ \varvec{\varLambda }_2&\varvec{\varLambda }_3\end{aligned} \right] \varvec{\varOmega }. \end{aligned}$$
(29)

Appendix 2: Derivation of the Rank of Domain (ROD) Metric

Principal Angles and Vectors

Let \(\varvec{P}_\mathcal {S}\) and \(\varvec{P}_\mathcal {T}\) be the basis of two subspaces. The principal angles \(\theta _i\) between the two subspaces are recursively defined as,

$$\begin{aligned} \cos (\theta _i) = \max _{\varvec{s}_i \in \mathrm {span}(\varvec{P}_\mathcal {S})}\max _{\varvec{t}_i \in \mathrm {span}(\varvec{P}_\mathcal {T})} \frac{\langle \varvec{s}_i, \varvec{t}_i\rangle }{\Vert \varvec{s}_i\Vert \Vert \varvec{t}_i\Vert }, \end{aligned}$$
(30)

such that

$$\begin{aligned} \begin{array}{cc} \varvec{s}_k \in \mathrm {span}(\varvec{P}_\mathcal {S}), &{} \varvec{s}_i \bot \varvec{s}_k, \\ \varvec{t}_k \in \mathrm {span}(\varvec{P}_\mathcal {T}), &{} \varvec{t}_i \bot \varvec{t}_k, \\ \end{array} k = 1,2,\ldots i-1. \end{aligned}$$

In the above, \(\varvec{s}_i\) and \(\varvec{t}_i\) are called the principal vectors associated with \(\theta _i\). Essentially, principal vectors are new basis for the two subspaces such that after the change of the basis, the two subspaces maximally overlap. The degrees of overlapping are measured in the principal angles—the smallest angles between basis.

Given the singular value decomposition,

$$\begin{aligned} \varvec{P}_\mathcal {S}^{\mathrm{T}}\varvec{P}_\mathcal {T}= \varvec{U}_1\varvec{\varGamma }\varvec{V}^{\mathrm{T}}\end{aligned}$$
(31)

both the principal angles and vectors can be computed efficiently

$$\begin{aligned} \theta _i = \arccos \gamma _{i}, \varvec{s}_i = (\varvec{P}_\mathcal {S}\varvec{U}_1)_{\cdot ,i}, \varvec{t}_i = (\varvec{P}_\mathcal {T}\varvec{V})_{\cdot ,i}, \end{aligned}$$
(32)

where \(\gamma _{i}\) is the \(i\)th diagonal element of the diagonal matrix \(\varvec{\varGamma }\). \((\varvec{M})_{\cdot ,i}\) returns the \(i\)th column of the matrix \(\varvec{M}\).

Computing ROD

Let \(\varvec{X}_\mathcal {S}\in \mathbb {R}^{\mathsf {N}_S \times \mathsf {D}}\) and \(\varvec{X}_\mathcal {T}\in \mathbb {R}^{\mathsf {N}_T \times \mathsf {D}}\) denote the data from the source and the target domains. We use their PCA subspaces to compute the ROD metric. The optimal dimensionality \(\mathsf {d}^*\) of the subspaces is selected with our subspace disagreement measure, described in Sect. 2.4 in the main text.

The ROD metric integrates both geometrical and statistical information between two domains by

$$\begin{aligned} \mathcal {R}(\mathcal {S},\mathcal {T}) = \frac{1}{\mathsf {d}^*}\sum _{i}^{{\mathsf {d}^*}}\theta _i \left[ KL(\mathcal {S}_i \Vert \mathcal {T}_i) + KL(\mathcal {T}_i\Vert \mathcal {S}_i)\right] , \end{aligned}$$
(33)

where \(\mathcal {S}_i\) and \(\mathcal {T}_i\) are two one-dimensional distributions of \(\varvec{X}_\mathcal {S}^{\mathrm{T}}\varvec{s}_i\) and \(\varvec{X}_\mathcal {T}^{\mathrm{T}}\varvec{t}_i\), respectively. In other words, we project data onto the principal vectors and compare how (dis)similar the data are distributed across domains.

We approximate these two distributions with one- dimensional Gaussians. Note that \(\varvec{X}_\mathcal {S}\) and \(\varvec{X}_\mathcal {T}\) have zero-means. We thus need only to compute the variances in order to specify the Gaussians. These variances can be readily computed from the projections and the covariance matrices of the original data:

$$\begin{aligned} \sigma _{i\mathcal {S}}^2 = \frac{1}{\mathsf {N}_S} \varvec{s}_i^{\mathrm{T}}\varvec{X}_\mathcal {S}^{\mathrm{T}}\varvec{X}_\mathcal {S}\varvec{s}_i,\quad \sigma _{i\mathcal {T}}^2 = \frac{1}{\mathsf {N}_T} \varvec{t}_i^{\mathrm{T}}\varvec{X}_\mathcal {T}^{\mathrm{T}}\varvec{X}_\mathcal {T}\varvec{t}_i, \end{aligned}$$
(34)

In terms of the approximating Gaussians, the ROD metric is computed in close-form

$$\begin{aligned} \mathcal {R}(\mathcal {S},\mathcal {T}) = \frac{1}{\mathsf {d}^*}\sum _{i}^{{\mathsf {d}^*}}\theta _i \left[ \frac{1}{2}\frac{\sigma _{i\mathcal {S}}^2}{\sigma _{i\mathcal {T}}^2} + \frac{1}{2}\frac{\sigma _{i\mathcal {T}}^2}{\sigma _{i\mathcal {S}}^2} -1\right] . \end{aligned}$$
(35)

Appendix 3: Proof of Theorem 1

We first prove the following lemma.

Lemma 1

Under the condition of the Theorem 1, the following inequality holds,

$$\begin{aligned} KL(P_{S}\Vert P_L)\le KL(P_{S}\Vert P_T) \end{aligned}$$
(36)

Proof

We start with

$$\begin{aligned} KL(P_{S}\Vert P_T)&= KL(\alpha P_N+(1-\alpha )P_L\Vert P_T) \\&=\int \left[ \alpha P_N+(1-\alpha )P_L\right] \\&\qquad \log \frac{\alpha P_N+(1-\alpha )P_L}{P_T}\, dX \end{aligned}$$

We now use the property that \(\log \) function is concave to arrive at

$$\begin{aligned} KL(P_{S}\Vert P_T)&\ge \int \left[ \alpha P_N+(1-\alpha )P_L\right] \left[ \alpha \log \frac{P_N}{P_T}\right. \nonumber \\&\quad \left. +(1-\alpha )\log \frac{P_L}{P_T}\right] \, dX \nonumber \\&=\alpha ^{2}KL(P_N\Vert P_T)+(1-\alpha )^{2}KL(P_L\Vert P_T) \nonumber \\&\quad + \alpha (1-\alpha )C(P_L,P_N,P_T), \end{aligned}$$
(37)

where

$$\begin{aligned}&C(P_L,P_N,P_T)\nonumber \\&\quad =\int \left( P_N\log \frac{P_L}{P_T}+P_L\log \frac{P_N}{P_T}\right) \, dX \nonumber \\&\quad =\int \left( P_N\log \frac{P_N}{P_T}-P_N\log \frac{P_N}{P_L}+P_L\log \frac{P_L}{P_T}-P_L\log \frac{P_L}{P_N}\right) \, dX \nonumber \\&\quad =KL(P_N\Vert P_T)-KL(P_N\Vert P_L)+KL(P_L\Vert P_T)-KL(P_L\Vert P_N) \end{aligned}$$
(38)

Substituting Eq. (38) into Eq. (37), we have

$$\begin{aligned} KL(P_{S}\Vert P_T)&\ge \alpha KL(P_N\Vert P_T)+(1-\alpha )KL(P_L\Vert P_T) \nonumber \\&\quad -\alpha (1-\alpha )\left[ KL(P_N\Vert P_L)+KL(P_L\Vert P_N)\right] \end{aligned}$$
(39)

Applying to the right hand side of the inequality the condition of the Theorem 1, we have

$$\begin{aligned} KL(P_{S}\Vert P_T) \ge \left[ \frac{9}{8} - 2\alpha (1-\alpha )\right] A \end{aligned}$$
(40)

where \(A = \max \left\{ KL(P_N\Vert P_L),\, KL(P_L\Vert P_N)\right\} \).

Note that

$$\begin{aligned} \frac{9}{8} - 2\alpha (1-\alpha ) \ge \alpha \end{aligned}$$

as the maximum of \(2\alpha (1-\alpha )+\alpha \) is \(9/8\), attained at \(\alpha = 3/4\). This leads to

$$\begin{aligned} KL(P_{S}\Vert P_T) \ge \alpha A \ge \alpha KL(P_N\Vert P_L) \end{aligned}$$
(41)

To complete the proof the lemma, note that due to the convexity of KL-divergence, we have

$$\begin{aligned} KL(P_{S}\Vert P_L)&= KL(\alpha P_N+(1-\alpha )P_L\Vert P_L)\\&\le \alpha KL(P_N\Vert P_L) \end{aligned}$$

Combining the last two inequalities together, we complete the proof of the lemma.\(\square \)

Proof of the Theorem We start by applying the convex property of the KL-divergence again,

$$\begin{aligned} KL(P_{S}\Vert Q_{T})&= KL(P_{S}\Vert \beta P_T+(1-\beta )P_L) \nonumber \\&\le \beta KL(P_{S}\Vert P_T)+(1-\beta )KL(P_{S}\Vert P_L) \nonumber \\&\le \beta KL(P_{S}\Vert P_T)+(1-\beta )KL(P_{S}\Vert P_T) \nonumber \\&\le KL (P_S \Vert P_T) \end{aligned}$$
(42)

where we have applied the Lemma 1 in the penultimate inequality. The last inequality states the desired result of the theorem.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Gong, B., Grauman, K. & Sha, F. Learning Kernels for Unsupervised Domain Adaptation with Applications to Visual Object Recognition. Int J Comput Vis 109, 3–27 (2014). https://doi.org/10.1007/s11263-014-0718-4

Download citation

Keywords

  • Domain adaptation
  • Kernels
  • Object recognition
  • Cross-dataset bias