Skip to main content

Online Meta-learning for Multi-source and Semi-supervised Domain Adaptation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12361))

Included in the following conference series:

Abstract

Domain adaptation (DA) is the topical problem of adapting models from labelled source datasets so that they perform well on target datasets where only unlabelled or partially labelled data is available. Many methods have been proposed to address this problem through different ways to minimise the domain shift between source and target datasets. In this paper we take an orthogonal perspective and propose a framework to further enhance performance by meta-learning the initial conditions of existing DA algorithms. This is challenging compared to the more widely considered setting of few-shot meta-learning, due to the length of the computation graph involved. Therefore we propose an online shortest-path meta-learning framework that is both computationally tractable and practically effective for improving DA performance. We present variants for both multi-source unsupervised domain adaptation (MSDA), and semi-supervised domain adaptation (SSDA). Importantly, our approach is agnostic to the base adaptation algorithm, and can be applied to improve many techniques. Experimentally, we demonstrate improvements on classic (DANN) and recent (MCD and MME) techniques for MSDA and SSDA, and ultimately achieve state of the art results on several DA benchmarks including the largest scale DomainNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    One may not think of domain adaptation as being sensitive to initial condition, but given the lack of target domain supervision to guide learning, different initialization can lead to a significant 10–15% difference in accuracy (see Supplementary material).

  2. 2.

    Other settings such as optimizer, iterations and data augmentation are not clearly stated in  [38], making it hard to replicate their results.

  3. 3.

    We tried training with up to 50 k, and found it did not lead to clear improvement. So, we train all models for 10 k iterations to minimise cost.

  4. 4.

    Using GeForce RTX 2080 GPU. Xeon Gold 6130 CPU @ 2.10GHz.

  5. 5.

    https://github.com/fmcarlucci/JigenDG.

  6. 6.

    https://github.com/VisionLearningGroup/VisionLearningGroup.github.io.

References

  1. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: NeurIPS (2016)

    Google Scholar 

  2. Balaji, Y., Chellappa, R., Feizi, S.: Normalized wasserstein distance for mixture distributions with applications in adversarial learning and domain adaptation. In: ICCV (2019)

    Google Scholar 

  3. Balaji, Y., Sankaranarayanan, S., Chellappa, R.: Metareg: towards domain generalization using meta-regularization. In: NeurIPS (2018)

    Google Scholar 

  4. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn. 79, 151–175 (2010)

    Article  MathSciNet  Google Scholar 

  5. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: NeurIPS (2006)

    Google Scholar 

  6. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: NeurIPS (2016)

    Google Scholar 

  7. Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: CVPR (2019)

    Google Scholar 

  8. Chang, W.G., You, T., Seo, S., Kwak, S., Han, B.: Domain-specific batch normalization for unsupervised domain adaptation. In: CVPR (2019)

    Google Scholar 

  9. Daumé, H.: Frustratingly easy domain adaptation. In: ACL (2007)

    Google Scholar 

  10. Donahue, J., Hoffman, J., Rodner, E., Saenko, K., Darrell, T.: Semi-supervised domain adaptation with instance constraints. In: CVPR (2013)

    Google Scholar 

  11. Dou, Q., Castro, D.C., Kamnitsas, K., Glocker, B.: Domain generalization via model-agnostic learning of semantic features. In: NeurIPS (2019)

    Google Scholar 

  12. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)

    Google Scholar 

  13. Finn, C., Rajeswaran, A., Kakade, S.M., Levine, S.: Online meta-learning. In: ICML (2019)

    Google Scholar 

  14. Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., Pontil, M.: Bilevel programming for hyperparameter optimization and meta-learning. In: ICML (2018)

    Google Scholar 

  15. French, G., Mackiewicz, M., Fisher, M.: Self-ensembling for visual domain adaptation. In: ICLR (2018)

    Google Scholar 

  16. Ganin, Y., et al.: Domain-adversarial training of neural networks. In: JMLR (2016)

    Google Scholar 

  17. Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: NeurIPS (2005)

    Google Scholar 

  18. Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: ICML (2018)

    Google Scholar 

  19. Kim, M., Sahu, P., Gholami, B., Pavlovic, V.: Unsupervised visual domain adaptation: a deep max-margin gaussian process approach. In: CVPR (2019)

    Google Scholar 

  20. Lee, C.Y., Batra, T., Baig, M.H., Ulbricht, D.: Sliced wasserstein discrepancy for unsupervised domain adaptation. In: CVPR (2019)

    Google Scholar 

  21. Li, D., Yang, Y., Song, Y.Z., Hospedales, T.: Learning to generalize: Meta-learning for domain generalization. In: AAAI (2018)

    Google Scholar 

  22. Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: ICCV (2017)

    Google Scholar 

  23. Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.Z., Hospedales, T.M.: Episodic training for domain generalization. In: The IEEE International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  24. Li, Z., Zhou, F., Chen, F., Li, H.: Meta-SGD: learning to learn quickly for few-shot learning. arXiv:1707.09835 (2017)

  25. Liu, H., Simonyan, K., Yang, Y.: Darts: differentiable architecture search. In: ICLR (2019)

    Google Scholar 

  26. Liu, M.Y., Tuzel, O.: Coupled generative adversarial networks. In: NeurIPS (2016)

    Google Scholar 

  27. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: ICML (2015)

    Google Scholar 

  28. Long, M., Cao, Z., Wang, J., Jordan, M.I.: Conditional adversarial domain adaptation. In: NeurIPS (2018)

    Google Scholar 

  29. Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. In: NeurIPS (2016)

    Google Scholar 

  30. Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: ICML (2017)

    Google Scholar 

  31. Luo, Y., Zheng, L., Guan, T., Yu, J., Yang, Y.: Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In: CVPR (2019)

    Google Scholar 

  32. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  33. Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based hyperparameter optimization through reversible learning. In: ICML (2015)

    Google Scholar 

  34. Mancini, M., Porzi, L., Rota Bulò, S., Caputo, B., Ricci, E.: Boosting domain adaptation by discovering latent domains. In: CVPR (2018)

    Google Scholar 

  35. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation with multiple sources. In: NeurIPS (2009)

    Google Scholar 

  36. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms. arXiv:1803.02999 (2018)

  37. Parisotto, E., Ghosh, S., Yalamanchi, S.B., Chinnaobireddy, V., Wu, Y., Salakhutdinov, R.: Concurrent meta reinforcement learning. arXiv:1903.02710 (2019)

  38. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: CVPR (2019)

    Google Scholar 

  39. Rajeswaran, A., Finn, C., Kakade, S., Levine, S.: Meta-learning with implicit gradients. In: NeurIPS (2019)

    Google Scholar 

  40. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2016)

    Google Scholar 

  41. Saito, K., Kim, D., Sclaroff, S., Darrell, T., Saenko, K.: Semi-supervised domain adaptation via minimax entropy. In: ICCV (2019)

    Google Scholar 

  42. Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Adversarial dropout regularization. In: ICLR (2018)

    Google Scholar 

  43. Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR (2018)

    Google Scholar 

  44. Schmidhuber, J.: Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4, 131–139 (1992)

    Article  Google Scholar 

  45. Schmidhuber, J., Zhao, J., Wiering, M.: Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Mach. Learn. 28, 105–130 (1997)

    Article  Google Scholar 

  46. Thrun, S., Pratt, L. (eds.): Learning to Learn. Kluwer Academic Publishers, Boston (1998)

    MATH  Google Scholar 

  47. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR (2017)

    Google Scholar 

  48. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: Maximizing for domain invariance. arXiv:1412.3474 (2014)

  49. Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S.: Deep hashing network for unsupervised domain adaptation. In: (IEEE) Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  50. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016)

    Google Scholar 

  51. Xu, P., Gurram, P., Whipps, G., Chellappa, R.: Wasserstein distance based domain adaptation for object detection. arXiv:1909.08675 (2019)

  52. Xu, R., Chen, Z., Zuo, W., Yan, J., Lin, L.: Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In: CVPR (2018)

    Google Scholar 

  53. Xu, Z., van Hasselt, H.P., Silver, D.: Meta-gradient reinforcement learning. In: NeurIPS (2018)

    Google Scholar 

  54. Yao, T., Pan, Y., Ngo, C.W., Li, H., Mei, T.: Semi-supervised domain adaptation with subspace learning for visual recognition. In: CVPR (2015)

    Google Scholar 

  55. Zhao, H., Zhang, S., Wu, G., Moura, J.M., Costeira, J.P., Gordon, G.J.: Adversarial multiple source domain adaptation. In: NeurIPS (2018)

    Google Scholar 

  56. Zheng, Z., Oh, J., Singh, S.: On learning intrinsic rewards for policy gradient methods. In: NeurIPS (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Da Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 467 KB)

Appendices

A Short-Path Gradient Descent

Optimizing Eq. 3 naively by Algorithm 1 would be costly and ineffective. It is costly because in the case of domain adaptation (unlike for example, few-shot learning [12], the inner loop requires many iterations). So back-propagating through the whole optimization path to update the initial \(\varTheta \) in the outer loop will produce multiple high-order gradients. For example, if the inner loop applies j iterations, we will have

$$\begin{aligned} \begin{aligned} \varTheta _{}^{(1)}&= \varTheta _{}^{} - \alpha \nabla _{\varTheta _{}^{(0)}} \mathcal {L}_{\text {uda}}(.) \\ \dots \\ \varTheta _{}^{(j)}&= \varTheta _{}^{(j-1)} - \alpha \nabla _{\varTheta _{}^{(j-1)}}\mathcal {L}_{\text {uda}}(.) \end{aligned} \end{aligned}$$
(13)

then the outer loop will update the initial condition as

$$\begin{aligned} \begin{aligned} \varTheta ^* = \varTheta - \alpha \overbrace{\nabla _{\varTheta }\mathcal {L}_{\text {sup}}(\varTheta _{}^{(j)} , \mathcal {D}_{\text {val}})}^{\text {Meta Gradient}} \end{aligned} \end{aligned}$$
(14)

where higher-order gradient will be required for all items \(\scriptstyle \nabla _{\varTheta _{}^{(0)}}\mathcal {L}_{\text {uda}}(.), \dots ,\nabla _{\varTheta _{}^{(j-1)}}\mathcal {L}_{\text {uda}}(.)\) in the update of Eq. 14.

One intuitive way of eliminating higher-order gradients for computing Eq. 14 is making \(\nabla _{\varTheta _{}^{(0)}}\mathcal {L}_{\text {uda}}(.),\dots ,\nabla _{\varTheta _{}^{(j-1)}}\mathcal {L}_{\text {uda}}(.)\) constant during the optimization. Then, Eq. 14 is equivalent to

$$\begin{aligned} \begin{aligned} \varTheta ^* = \varTheta - \alpha \overbrace{\nabla _{\varTheta ^{(j)}}\mathcal {L}_{\text {sup}}(\varTheta ^{(j)} , \mathcal {D}_\text {val})}^{\text {First-order Meta Gradient}} \end{aligned} \end{aligned}$$
(15)

However, in order to compute Eq. 15, one still needs to store the optimization path of Eq. 13 in memory and back-propagate through it to optimize \(\varTheta \), which requires high computational load. Therefore, we propose a practical solution an iterative meta-learning algorithm to iteratively optimize the model parameters during training.

Shortest Path Optimization.  To obtain the meta gradient in Eq. 15 in a more efficient way, we propose a more scalable and efficient meta-learning method using shortest-path gradient (S-P.G.)  [36]. Before the optimization of Eq. 13, we copy the parameters \(\varTheta \) as \(\tilde{\varTheta }^{(0)}\) and use \(\tilde{\varTheta }^{(0)}\) in the inner-level algorithm.

$$\begin{aligned} \begin{aligned} \tilde{\varTheta }^{(j)}= {\left\{ \begin{array}{ll} \tilde{\varTheta }^{(0)} - \alpha \nabla _{\varTheta ^{(0)}}\mathcal {L}_{\text {uda}}(\tilde{\varTheta }^{(0)}, \mathcal {D}_\text {tr}), \\ \dots \\ \tilde{\varTheta }^{(j-1)} -\alpha \nabla _{\varTheta ^{(0)}}\mathcal {L}_{\text {uda}}(\tilde{\varTheta }^{(j-1)}, \mathcal {D}_\text {tr}) \end{array}\right. } \end{aligned} \end{aligned}$$
(16)

then, after finishing the optimization in Eq. 16, we can get the shortest-path gradient between two items \(\tilde{\varTheta _i}^{(j)}\) and \(\varTheta _i\).

$$\begin{aligned} \begin{aligned} \nabla _{\varTheta }^{\text {short}} = \varTheta - \tilde{\varTheta }^{(j)} \end{aligned} \end{aligned}$$
(17)

Different from Eq. 15, we use this shortest-path gradient \(\nabla _{\varTheta }^{\text {short}}\) and initial parameter \(\varTheta \) to compute \(\mathcal {L}_{\text {sup}}(.)\) as,

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {sup}}(\varTheta _i - \nabla _{\varTheta _i}^{\text {short}} , \mathcal {D}_{\text {val}}) \end{aligned} \end{aligned}$$
(18)

Then, one-step meta update of Eq. 18 will be,

$$\begin{aligned} \begin{aligned} \varTheta _i^*&= \varTheta _i - \alpha \nabla _{\varTheta _i}\mathcal {L}_{\text {sup}}(\varTheta _i - \nabla _{\varTheta _i}^{\text {short}} , \mathcal {D}_{\text {val}}) \\&=\varTheta _i - \alpha \nabla _{\varTheta _i-\nabla _{\varTheta _i}^{\text {short}}}\mathcal {L}_{\text {sup}}(\varTheta _i - \nabla _{\varTheta _i}^{\text {short}} , \mathcal {D}_{\text {val}}) \\&= \varTheta _i - \alpha \nabla _{\tilde{\varTheta _i}^{(j)}}\mathcal {L}_{\text {sup}}(\tilde{\varTheta _i}^{(j)} , \mathcal {D}_{\text {val}}) \end{aligned} \end{aligned}$$
(19)

Effectiveness: We can see that one update of Eq. 19 corresponds to that of Eq. 15, which proves that using shortest-path optimization has the equivalent effectiveness to the first-order meta optimization. Scalability/Efficiency: The computation memory of the first-order meta-learning increases linearly with the inner-loop update steps, which is constrained by the total GPU memory. However, for the shortest-path optimization, storing the optimization graph is no longer necessary, which makes it scalable and efficient. We also experimentally evaluate that one step shortest-path optimization is 7x faster than one-step first-order meta optimization in our setting. The overall algorithm flow is shown in Algorithm 2.

B Additional Illustrative Schematics

To better explain the contrast between our online meta-learning domain adaptation approach with the sequential meta-learning approach, we add a schematic illustration in Figure 4. The main difference between sequential and online meta-learning approaches is how do we distribute the meta and DA updates. Sequential meta-learning approach performs meta updates and DA updates sequentially. And online meta-learning conducts the alternative meta and DA updates throughout the whole training procedure.

Fig. 4.
figure 4

Illustrative schematics of sequential and online meta domain adaptation. Left: Optimization paths of different approaches on domain adaptation loss (shading). (Solid line) Vanilla gradient descent on a DA objective from a fixed start point. (Multi-segment line) Online meta-learning iterates meta and gradient descent updates. (Two segment line) Sequential meta-learning provides an alternative approximation: update initial condition, then perform gradient descent. Right: (Top) Sequential meta-learning performs meta updates and DA updates sequentially. (Bottom) Online meta-learning alternates between meta-optimization and domain adaptation.

C Additional Experiments

Visualization of the Learned Features.  We visualize the learned features of MCD and Meta-MCD on PACS when sketch is the target domain as shown in Fig. 5. We can see that both MCD and Meta-MCD can learn discriminative features. However, the features learned by Meta-MCD is more separable than vanilla MCD. This explains why our Meta-MCD performs better than the vanilla MCD method.

Effect of Varying  S. Our online meta-learning method has iteration hyper-parameters S and J. We fix \(J=1\) throughout, and analyze the effect of varying S here using the DomainNet MSDA experiment with ResNet-18. The result in Table 7 shows that MetaDA is rather insensitive to this hyperparameter.

Varying the Number of Source Domains in MSDA.  For multi-source DA, the performance of both Meta-DA and the baselines is expected to drop with fewer sources (same for SSDA if fewer labeled target domain points). To disentangle the impact of the number of sources for baseline vs Meta-DA we compare MSDA by Meta-MCD on PACS with 2 vs 3 sources. The results for Meta-MCD vs vanilla MCD are 82.30% vs 80.07% (two source, gap 2.23%) and 87.24% vs 84.79% (three source, gap 2.45%). Meta-DA margin is similar with reduction of domains. Most difference is accounted for by the impact on the base DA algorithm.

Fig. 5.
figure 5

t-SNE  [32] visualization of learned MCD (left) and Meta-MCD (right) features on PACS (sketch as target domain). Different colors indicate different categories.

Table 7. MetaDA is insensitive to the update ratio hyperparameter S – Results for MSDA ResNet-18 performance on DomainNet.
Table 8. Test accuracy on PACS. * our run.
Table 9. Test accuracy on Digit-Five.

Other Base DA Methods.  Besides the base DA methods evaluated in the main paper (DANN, MCD and MME), our method is applicable to any base domain adaptation method. We use the published code of JiGenFootnote 5 and M\(^3\)SDAFootnote 6, and further apply our Meta-DA on the existing code. The results are shown in Table 8 and 9. From the results, we can see that our Meta-JiGen and Meta-M\(^3\)SDA-\(\beta \) improves over the base methods by 3.42% and 1.2% accuracy respectively, which confirms our Meta-DA’s generality. The reason we excluded these from the main results is that: (i) Re-running JiGen’s published code on our compute environment failed to replicate their published numbers. (ii) M\(^3\)SDA as a base algorithm is very slow to run comprehensive experiments on. Nevertheless, these results provide further evidence that Meta-DA can be a useful module going forward to plug in and improve future new base DA methods as well as those evaluated here.

Table 10. Test accuracy of MCD on PACS (sketch) with different initialization.

Initialization Dependence of Domain Adaptation.  One may not think of domain adaptation as being sensitive to initial condition, but given the lack of target domain supervision to guide learning, different initialization can lead to a significant difference in accuracy. To illustrate this we re-ran MCD-based DA on PACS with sketch target using different initializations. From the results in Table 10, we can see that both different classic initialization heuristics, and simple perturbation of a given initial condition with noise can lead to significant differences in final performance. This confirms that studying methods for tuning initialization provide a valid research direction for advancing DA performance.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, D., Hospedales, T. (2020). Online Meta-learning for Multi-source and Semi-supervised Domain Adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58517-4_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58516-7

  • Online ISBN: 978-3-030-58517-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics