Domain adversarial neural networks for domain generalization: when it works and how to improve

Theoretically, domain adaptation is a well-researched problem. Further, this theory has been well-used in practice. In particular, we note the bound on target error given by Ben-David et al. (Mach Learn 79(1–2):151–175, 2010) and the well-known domain-aligning algorithm based on this work using Domain Adversarial Neural Networks (DANN) presented by Ganin and Lempitsky (in International conference on machine learning, pp 1180–1189). Recently, multiple variants of DANN have been proposed for the related problem of domain generalization, but without much discussion of the original motivating bound. In this paper, we investigate the validity of DANN in domain generalization from this perspective. We investigate conditions under which application of DANN makes sense and further consider DANN as a dynamic process during training. Our investigation suggests that the application of DANN to domain generalization may not be as straightforward as it seems. To address this, we design an algorithmic extension to DANN in the domain generalization case. Our experimentation validates both theory and algorithm.


Introduction
In general, in machine learning, we assume the training data for our learning algorithm is well representative of the testing data.That is, we assume our training data follows the same distribution as our testing data.Of primary interest to this paper is the case where this assumption fails to hold: we consider learning in the presence of multiple domains.We formalize the multiple domain problem of interest as the case where (at train-time) we observe k domains referred to as sources which have distributions P 1 , P 2 , . . ., P k over some space X .At test-time, we are evaluated on a distinct target domain which has distribution Q over X .All of these feature distributions have (potentially) distinct labelling functions and our goal is to learn the labeling function on the target.Typically, we assume some restriction on observation of the target domain at train-time.In the literature, a large amount of work is concerned with the problem of Domain Adaptation (DA) which assumes access to samples from Q, but restricts access to the labels of these samples.More recently, there has also been an active investigation on the problem of Domain Generalization (DG) which instead assumes absolutely no access to the target domain.In spite of these restrictions, in both cases, the goal is for our learning algorithm trained on sources to perform well when evaluated on the target.
One popular approach to DA is the use of a Domain Adversarial Neural Network (DANN) originally proposed by Ganin and Lempitsky [1].Intuitively, this approach attempts to align the source and target domains by learning feature representations of both which are indiscernible by a domain discriminator trained to distinguish between the two distributions.Informally speaking, this seems like a sensible approach to DA.By accomplishing this domain alignment, the neural network should still be adept at the learned task when it is evaluated on the target domain at test-time.While DANN was originally proposed for DA, the adoption of this reasoning has motivated adaptations of this approach for DG [2][3][4][5].In fact, very early works in DG [6] are similarly motivated by the goal of domain-agnostic feature representations.
Still, it is worth noting that the original proposal of DANN [1] was motivated by theory.In particular, Ganin and Lempitsky base their algorithm on the target-error bound given by Ben-David et al. [7,8].Under appropriate assumptions, interpretation of the bound suggests domain alignment as achieved through DANN should improve performance on the target distribution, but importantly, it motivates alignment between the source and target.Counter to this, DANN variants for DG generally align multiple source domains because no access to target data is permitted.This shortcoming gives rise to the question of primary interest to this paper: Is there a justification for source alignment using DANN in DG? Specifically, we are concerned with a target-error bound similar to those provided by Ben-David et al. [8].To answer this question, we appeal to a recent theoretical proposal by Albuquerque et al. [3] which uses a reference object (i.e., the set of mixture distributions of the sources) to derive a target-error bound in the domain generalization setting.Building on this framework, we provide answers to two important considerations: 1. What additional reference objects (besides sets of mixture distributions) satisfy the primary condition used to derive target-error bounds in DG? 2. How does the target-error bound behave as a dynamic quantity during the training process?Ultimately, answering these two questions allows us to formulate a novel extension of the Domain Adversarial Neural Network.We validate experimentally that this extension improves performance and otherwise agrees with our theoretical expectations.

Domain Adversarial Neural Network (DANN)
In this section, we cover the necessary background on Domain Adversarial Neural Networks (DANN).We first present the original bound on target-error in the case of unsupervised DA [7,8] which motivates the DANN algorithm proposed by Ganin and Lempitsky [1].Following this, we outline the key differences introduced by a DANN variant proposed by Matsuura and Harada [2].
Although this variant achieves state-of-the-art (DANN) performance in DG, we point out the main concerns we have regarding the justification of this approach.

In Domain Adaptation
As mentioned, we begin with a motivating result of Ben-David et al. [8].Intuitively, this result describes bounds on the target-error controlled, in part, by a computable measure of divergence between distributions.While we provide a more detailed exposition of the problem setup in Appendix A, we begin by listing here the key terms to familiarize the reader.

Setup
For a binary hypothesis h, a distribution P, and a labeling function f for P, we define the error E P (h) of h on the distribution P as follows This is our primary measure of the quality of a hypothesis when predicting on a distribution P. To measure differences in distribution, we use the Hdivergence which is an adaptation of the A-distance [9].In particular, given two distributions P, Q over a space X and a corresponding hypothesis class H ⊆ {h | h : X → {0, 1}}, the H-divergence [8] is defined where I h = {x ∈ X | h(x) = 1}.Generally, it is more useful to consider the the H∆H-divergence, specifically, where Ben-David et al. [8] define the symmetric difference hypothesis class H∆H as the set of functions characteristic to disagreements between hypotheses. 1 This special case of the H-divergence will be the measure of divergence in all considered bounds.

The Motivating Bound
We can now present the result of Ben-David et al. [8] based on the triangle inequality of classification error [7,10].This bound is the key motivation behind DANN [1].For proof and a discussion on sample complexity, see Appendix A.
Theorem 1 (modified from Ben-David et al. [8]; Theorem 2) Let X be a space and H be a class of hypotheses corresponding to this space.Suppose P and Q are distributions over X .Then for any h ∈ H, with λ the error of an ideal joint hypothesis for Q, P.
This statement provides an upper bound on the target-error.Thus, minimizing this upper bound is a good proxy for the minimization of the target-error itself.The first term λ is a property of the dataset and hypothesis class which we typically assume to be small, but should not be ignored.As Ben-David et al. [8] note, this may be interpreted as a realizability assumption which requires the existence of some hypothesis in our search space that does well on both distributions (simultaneously).If this hypothesis does not exist, we cannot hope to do adaptation by minimizing the source-risk [11].Notably, λ also plays an important role in algorithms like DANN which modify the distributions over which they learn since these algorithms implicitly change λ.We discuss this issue in detail in Section 2.3.
The latter terms are more explicitly controllable.The source-error E P (h) can be minimized as usual by Empirical Risk Minimization (ERM).The divergence can be empirically computed using another result of Ben-David et al. [8].While we give this result in the Appendix (Proposition 7 and Proposition 8, respectively), previous interpretation by Ganing and Lempitsky [1] suggests to minimize the divergence by learning indiscernible representations of the distributions -i.e., aligning the domains. 2As we describe in the following, this may be accomplished by maximizing the errors of a domain discriminator trained to distinguish the distributions.

The DANN Algorithm
Ganin et al. [1] separate the neural network used to learn the task into a feature extractor network r θ and task-specific network c σ , parameterized respectively by θ and σ.A binary domain discriminator d µ outputting probabilities is trained to distinguish between the source and target distribution based on the representation learned by r θ .Meanwhile, r θ is trained to learn a representation that is not only useful for the task at hand, but also adept at "fooling" the domain discriminator (i.e., maximizing its errors).In details, given an empirical sample P = (x i ) n i=1 from the source distribution P and a sample Q = (x i ) n i=1 from the target distribution Q, the domain adversarial training objective is described where By this specification, d µ • r θ (x) is meant to estimate the probability x was drawn from Q and L D represents the binary cross-entropy loss for a domain discriminator trained to distinguish P and Q. Combining this with a taskspecific loss L P T we get the formulation given by Ganin et al. [1] min where λ (in this context) is a trade-off parameter.The above is generally implemented by simultaneous gradient descent.We remark a solution to this optimization problem is easily approximated by incorporating a Gradient Reversal Layer (GRL) between r θ and d µ [1].

In Domain Generalization
Recent adaptions to the above formulation have been proposed in context of DG.Here, we focus on the proposal of Matsuura and Harada [2] since their empirical results are one of the more competitive DG methods to date.In DG, since no access to Q is given, one cannot actually compute L D as described above -it assumes at least unlabeled examples from Q.Given this, Matsuura and Harada [2] propose a modification which operates on k source samples where 1[•] is the indicator function.Now, d µ is a multi-class domain discriminator trained to distinguish between sources; it outputs the estimated probabilities that x is drawn from each source.Hence, L SD is essentially a multi-class cross-entropy loss.Given the source samples Pj = (x j i ) n i=1 ∀j ∈ [k] drawn respectively from the source distributions P 1 , P 2 , . . ., P k , we substitute this into Eq.( 6): which gives a domain adversarial training objective aimed at aligning the sources (while also maintaining good task performance).Hereon, we often refer to this as a source-source DANN, rather than a source-target DANN as was given in Eq. ( 6).On the surface, there seems to be no justification for the source-source DANN.If we recall the interpretation of Theorem 1, there is one key difference: rather than aligning the source and target domains P and Q as suggested by the divergence term in Theorem 1, the objective in Eq. ( 8) aligns source domains P i and P j ∀(i, j) ∈ [k] 2 whose divergences do not appear in the upper bound.Thus, the motivating argument is lost in this new formulation.
If we look to recent literature, preliminary theoretical work to motivate this modification of DANN does exist [3].We start from this work in the derivation of our own results.

A Gap Between Theory and Algorithm
To be totally precise, the algorithm given above does not actually minimize d H∆H (P i , P j ) for any i, j.As we have noted, the idea to "align domains" through a common feature representation is simply an interpretation following the convention of Ganin and Lempitsky [1].If the class from which we select d µ is G and the class from which we select r θ is F, the algorithm actually approximates minimization of d G∆G (P i • r −1 θ , P j • r −1 θ ) with respect to θ.Here, the notation P i • r −1 θ denotes the pushforward of P i by r θ which is (intuitively) the image of P i in the feature space.While this technicality will be unimportant for our discussions in the remainder of this text, it can potentially have significant negative ramifications.So, we discuss it in some detail here.
In particular, this gap between theory and algorithm implies that learning indiscernible representations of the source and target distributions while also minimizing the source error is not always sufficient for reducing the bound in Theorem 1.The problem arises because the ideal joint error (which is usually assumed small in the original problem) does not always remain small after feature transformation as in DANN.That is, while the ideal-joint error between P i and P j may be small, this may not be true of P i • r −1 θ and P j • r −1 θ .This fact was recently observed independently by Johansson et al. [12] and Zhao et al. [13].Johansson et al. point out that learning a particular feature representation will always increase the ideal joint error (as compared to the original problem) whenever this feature representation is not invertible.Zhao et al. compliment this result by providing a lowerbound on target error in case the marginal label distributions3 have large deviation.In particular, the Jenson-Shannon (JS) divergence between the the label distributions should be at least as large as the JS divergence between the source and target feature distributions for the lowerbound to hold.If it is, the lowerbound shows simultaneous minimization of the source-error and the H∆H-divergence actually increases target-error.
In practice, as we are aware, it is not clear to what extent non-invertible feature representations increase the ideal joint error.Further, it is not easy to test whether the JS-divergence of the label distributions is larger than the JSdivergence of the source and target feature distributions.For this reason, in this work, we will simply assume the ideal joint error remains small after feature transformation; i.e., we do not explicitly consider any settings in which there are negative ramifications of the known gap between theory and algorithm for DANN.If these issues are of significant concern for a particular application (i.e., if the marginal label shift is known to be large), a recent modification of DANN which uses importance weighting has been proposed by Tachet des Combes, Zhao, et al. [14].This modification aims to correct the short-comings of standard DANN in case of label-shift.While we do not explicitly experiment with this method, our theoretical discussion and algorithmic extension still apply in context of this variation on DANN.

Understanding Domain Alignment in Domain Generalization
Our discussion of source-source DANN for DG begins with the motivating target-error bound proposed by Albuquerque et al. [3].Originally, given a set of source distributions {P i }, the bound uses the set of mixture distributions having these sources as components -we refer to this set as M. Below, we consider a more general adaptation of this result.Although the proof strategy is largely similar, we do provide proof for this more general re-statement.
Proposition 2 (adapted from Albuquerque et al. [3]; Proposition 2) Let X be a space and let H be a class of hypotheses corresponding to this space.Let Q and the collection {P i } k i=1 be distributions over X and let {ϕ i } k i=1 be a collection of nonnegative coefficients with i ϕ i = 1.Let the object O be a set of distributions such that for every S ∈ O the following holds i ϕ i d H∆H (P i , S) ≤ max i,j d H∆H (P i , P j ).
Then, for any h ∈ H, where λϕ = i ϕ i λ i and each λ i is the error of an ideal joint hypothesis for Q and P i .
Proof Let h ∈ H.For each P i apply Theorem 1 and multiply the equation by ϕ i to achieve Taking λϕ = i ϕ i λ i , we may sum over all k of these inequalities as below Since i ϕ i = 1 we can rewrite this as Now, for each P i , the following is true because the H-divergence abides by the triangle inequality where Since this is true for each P i , we may write where the last inequality is due to the choice S * ∈ O. Recalling S * is also a minimizer of d H∆H (Q, •) yields the result.
As suggested by Albuquerque et al. [3], interpreting this result provides a reasonable motivation for the use of source-source DANN in DG.The first term is a convex combination of ideal-joint errors between each source and the target.As before, we assume this is small and remains small after feature transformation by r θ when we apply DANN; i.e., recall Section 2.3.Later, we discuss some differences between the ideal-error terms we give in our bound and the ideal-error terms in the original bound of Albuquerque et al. [3].The second term is a convex combination of the source errors.ERM on a mixture of the sources is appropriate for controlling this term.In both of the previous convex sums, the coefficients are assumed to be fixed, but arbitrary, replicating a natural data generation process where amounts of data from each source are not assumed.Ben-David et al. [8] model data arising from multiple sources in this way and provide generalization bounds as well.For the third term, when O is fixed as the set of mixtures M, Albuquerque et al. [3] suggest this term demonstrates the importance of diverse source distributions, so that the unseen target Q might be "near" M. We extend this discussion later, showing how this term can change dynamically throughout the training process.The final term is a maximum over the source-source divergences.Application of the interpretation by Ganin and Lempitsky [1] -to align domains through representation learning -motivates the suggestion of Matsuura and Harada [2] to maximize the errors of a multi-class (source-source) domain discriminator.A more precise application might be to train all combinations of binary domain discriminator, but as Albuquerque et al. [3] point out, this leads to a polynomial number of discriminators.As a practical surrogate, we opt to employ the best empirical strategy to date [2].Another option might be to instead use a collection of one-versus-all classifiers in place of a multi-class classifier [3].Note, neither method precisely minimizes Eq. ( 10), so we treat this as an implementation choice.

A Remark on Differences
As mentioned briefly, a reader familiar with the original statement of Albuquerque et al. [3] will notice two differences: 1) rather than limiting consideration to the set of mixtures M, this statement holds for all sets O which satisfy Condition (9) and 2) λ ϕ is a different quantity for the ideal joint-error between Q and {P i }.
On the latter point, rather than λ ϕ , Albuquerque et al. [3] use the following definition of the ideal joint error given by Zhao et al. [15] as below where S * ∈ M is the mixture distribution closest to Q.As the original statement of Albuquerque et al. [3] defines O = M, this definition is a perfectly reasonable choice.But, since our re-statement considers more general objects O, we have removed this dependence on M. As is visible in the proof, λ ϕ does remove this dependence.In general, λ * and λ ϕ are incomparable.If one attempts to compare them, it will become evident that some assumptions must be made -e.g., on the relationship between the {ϕ i } i (which are arbitrary but fixed) and the coefficients used to form the mixture for S * (which are dependent on Q).One reason to prefer λ ϕ is that it does not require a single hypothesis to have low error on all sources simultaneously.Ben-David et al. [8] provide a larger discussion on the benefits of various approaches when combining data from multiple sources.The former difference is of primary interest in this paper.Condition (9) may be considered to be the key fact about M which allows the derivation of Eq. (10).By identifying this, we open the possibility of considering more general objects satisfying Condition (9).In the following, we demonstrate the existence of such objects O and discuss the benefit they add.

Beyond Mixture Distributions
Consideration of general objects O which satisfy Condition ( 9) is only useful if such objects exist (besides M).The following example provides proof.See Figure 1 for an illustrative picture.with a ≤ 0 and slide it back and forth, we can never perfectly discern P 1 or P 2 from S and therefore we will never achieve the maximum divergence 2.
Example 1 Let X be the real line (−∞, ∞) and let H be the set of hypotheses {ha(.)}a∈R where ha(.) is characteristic to the ray (−∞, a].Then, H∆H is the set of hypotheses {h a,b (.)} (a,b)∈R 2 where h a,b (.) is characteristic to the interval [a, b].Let P 1 be the uniform distribution U(0, 2), let P 2 be U(2, 4), and let S be U(1, 3).Then S is not a mixture distribution of the components P 1 and P 2 , but 2 = max i,j d H∆H (P i , P j ) for all non-negative coefficients {ϕ i } i which sum to 1.
In the context of this example, we might consider the object O = M + {S} to quickly see that more than just M can satisfy Condition (9).If S is a unique minimizer of the third term in Eq. ( 10) and does not increase the final term, then using O in place of M actually produces a strictly tighter bound.Later we more generally expand on this and other benefits of considering O = M.Still, one simple example cannot fully justify the existence of useful O = M.For a more general perspective, it is useful to think of things geometrically.Albuquerque et al. [3] often refer to M as the convex-hull of the sources.In this same vein, we point out that d H∆H is a pseudometric4 and therefore, shares most of the nice properties required of metrics used in the vast mathematical literature on metric spaces.Viewing a metric space as a topological space, it is common to think of open balls as the "the fundamental unit" or "basis" of the Using this object, the following result provides some useful information on the types of objects O which satisfy Condition (9).See Figure 2 for a helpful visualization of our results.
Proposition 3 Let X be a space and let H be a class of hypotheses corresponding to this space.Let the collection {P i } k i=1 be distributions over X and let {ϕ i } k i=1 be a collection of non-negative coefficients with i ϕ i = 1.Now, set ρ = maxu,v d H∆H (Pu, Pv).We show three results, 1. M ⊆ i Bρ(P i ).
Proof We begin with a proof of (1).Let S ∈ M arbitrarily.The result follows by first observing, for all P i , The first inequality follows by a property of the M shown by Albuquerque et al. [3]; for reference, we provide proof of this in Lemma 2 in the Appendix.The second inequality follows because ρ is defined as the largest source-source divergence.Now, if this is true for all P i , then S is by definition contained in every H, ρ-ball in the intersection i Bρ(P i ).If an element is contained in every component set of an intersection, then it is contained in the intersection.And, we have shown (1).Next, we show (2).By definition of Bρ(P i ), if S ∈ Bρ(P i ) then d H∆H (P i , S) ≤ ρ.
(21) Domain Adversarial Neural Networks for Domain Generalization We again recall that ρ = maxu,v d H∆H (Pu, Pv).Hence, we have shown (2).Finally, we demonstrate (3).To see this, note that if S / ∈ i Bρ(P i ), then by definition for all i we have that d H∆H (P i , S) > ρ.We follow the chain of inequalities below to arrive at our result Hence, we have shown (3) and are done.
Statements 1 and 2 in conjunction show there are intuitive objects Oi.e., i B ρ (P i ) -which both contain M and satisfy Condition (9).Statement 3 provides an intuitive boundary for O. Thus, comparison of O to the union and intersection of closed balls, respectively, provides necessary and sufficient conditions for satisfying Condition (9).

The Benefits of Looking Beyond Mixtures
While the above discussion is useful in its own right, a more careful discussion of practical ramifications is needed.

Computationally Tighter Bounds
First, we point out that different objects O can lead to computationally tighter bounds in Eq. ( 10).For a concrete example, we prove i B ρ (P i ) can lead to tighter bounds than M below.The proof follows a similar logic as presented following Example 1.In fact, for Example 1, it is true that i B ρ (P i ) contains M + {S}, and thus, may reap the discussed benefit.
Proposition 4 Let X be a space and let H be a class of hypotheses corresponding to this space.Let Q and the collection {P i } k i=1 be distributions over X .Let P * be the distribution in i Bρ(P i ) closest to Q and let S * ∈ M be the mixture distribution closest to Q.Then, Now, further, suppose the only solution to is contained in i Bρ(P i ) \ M.Then, we have Proof To see the first claim, note by Proposition 3, M ⊆ i Bρ(P i ).So it is clear that min Since P * and S * are arguments minimizing left-and right-hand-side, respectively, we are done.Now, we show the second claim.Equation 26 holds irregardless of our additional assumption, so we need only show that min But this is clear because if we assume the contrary -that the two quantities are equal -the implication is that a solution to Equation 24 is contained in M, a contradiction.Therefore, we have our result.
Now, for DANN, our hypothesis will usually be a neural network.In this case, the benefit of tightness may be considered irrelevant because the large VC-Dimension of neural networks [16] is the dominant term in any bound on error (i.e., using the PAC framework).Still, this conversation is not complete without considering the recent success of PAC-Bayesian formulations (e.g., see Dziugaite et al. [17]) which provide much tighter bounds when the hypothesis is a stochastic neural network.In Appendix A, we discuss a PAC-Bayesian distribution psuedometric [18] analogous to d H∆H .Because this psuedometric shares the important properties of d H∆H , these results are easily re-framed in this more modern formulation as well -where tightness may be a primary concern.

Intuitive Analysis
Second, we point out that a particular object O can be easier to analyze.This fact will become evident as we develop an algorithmic extension to DANN for DG.Ultimately, we find that the novel object i B ρ (P i ) may be manipulated to provide key motivating insights in algorithm design.

The H∆H-Divergence as a Dynamic Quantity
As mentioned, Albuquerque et al. [3] interpret Proposition 2 as showing the necessity of diverse source distributions to control the third term min S∈O d H∆H (S, Q) when O = M. Logically, when distributions are heterogeneous, M presumably contains more elements, and so, the unseen target is more likely to be "close."When O = i B ρ (P i ), this is easier to see because the size of O is directly dependent on the maximum divergence between the sources (by the definition of ρ).In particular, reducing the maximum divergence and re-computing O could lead to removal of a unique minimizer for min S∈O d H∆H (S, Q). 5 In the context of the DANN algorithm, this is worrisome.Namely, during training, the point of using DANN is to effectively reduce the maximum divergence between sources and we expect this divergence to be decreasing as the feature representations of the source distributions are modified.In fact, under mild assumptions, we can formally show that DANN acts like a contraction mapping, and therefore, can only decrease the pairwise source-divergences.So, it is possible min S∈O d H∆H (S, Q) increases as the changing object O shrinks during training.Below we consider gradient descent on a smooth proxy of the H∆H-Divergence in the simple, two-distribution case.The map r θ acts as the feature extractor affected by DANN.
Proposition 5 Let D be a space of empirical samples over X .Let r θ : X → X be a deterministic representation function parameterized by the real vector θ ∈ R m .Further, denote by r θ ( P) the application of r θ to every point of and suppose it is differentiable with K-Lipschitz gradients.Further, suppose θ * is the unique local minimum of on a bounded subset Ω ⊂ R m .Then for θ ∈ Ω such that θ = θ * , the function τ : Ω → R m defined τ (θ) = θ − γ∇ θ (θ) has the property for some constant β θ dependent on θ.In particular, for all θ ∈ Ω, there is γ so that 0 < β θ < 1.
Proof We proceed by first showing an import inequality for functions with the assumed properties, in particular, using a derivation presented by Wright [19].Note first, by Taylor's Theorem, for vectors u, v ∈ R n , we have where the first line, as mentioned, is by Taylor's Theorem, the second is by addition and subtraction of ∇ (u) T v, the third is because the norm of a vector product is never larger than the vector product, and the fourth is by the Lipshitz property assumed on the gradients of .With this inequality, we let θ ∈ Ω with θ = θ * .Taking u = θ and v = −γ∇ (θ) achieves Next, we note that for θ = θ * we have 0 ≤ (θ * ) < (θ) because θ * was assumed to be the unique local minimum of Ω.Then, we may set which, in combination with Eq. ( 30) yields our first desired result (Eq.( 28)).
Next, we show that for all θ = θ * , we can pick γ which forces 0 < β θ < 1.We first note that it is sufficient to show since we may simply multiply by the reciprocal of the lower-bound and add one to realize the result.Next, we point out that there is some constant M > 0 such that ||∇ (θ)|| ≤ KM .This follows by where the equality holds because θ * is a local minimum, the first inequality holds by the assumed Lipshitz property, and the second inequality holds because Ω was assumed to be bounded.Without loss of generality, suppose M ≥ 1 (Eq.( 33) holds regardless).Then our problem reduces further.In particular, it suffices to pick γ such that − (θ) since this lower bound is larger than or equal to that of Eq. ( 32).First, clearly, the upper bound holds when 0 < γ < 2 K , so this immediately restricts our choice of γ.For the lower bound, we consider two cases for the value of (θ) and demonstrate there is Second, suppose (θ) we have both sides of this inequality yields the desired lower bound.Further, we still have γ < 2 K , so the desired upper bound holds and we have our result.
A key takeaway from the above is the presence of competing objectives during training.These objectives require balance.While DANN reduces the source-divergences to account for the final term in Eq. (10), we should also (somehow) consider the diversity of our sources throughout training to account for the effected term min S∈O d H∆H (S, Q).Another insight the reader gains (i.e., from reading the proof) is that the upper bound on γ is constant and the lower bound goes to 0 as (θ) → 0. An interpretation of these bounds suggests the practical importance of an annealing schedule on γ during DANN training.In our own experiments, we anneal γ by a constant factor (i.e., step decay).

An Algorithmic Extension to DANN
Motivated by the argument presented in Section 3.3, this section devises an extension to DANN.While DANN acts to align domains, as noted, its success in the context of domain generalization is also dependent on the heterogeneity of the source distributions throughout the training process.Therefore, in an attempt to balance these objectives, we propose an addition to source-source DANN which acts to diversify the sources throughout the training.

Theoretical Motivation
We recall the intersection of closed balls O = i B ρ (P i ); this is the main object of interest as it controls the size of the divergences in the upper bound of Proposition 2.More specifically, we are concerned with the quantity min P∈ i Bρ(Pi) d H∆H (P, Q).Intuitively, if we want to reduce this quantity we should find some means to increase ρ.One might propose to accomplish this by modifying our source distributions -e.g., through data augmentation -, but clearly, modifying our source distributions in an uncontrolled manner is not wise.This ignores the structure of the space of distributions under consideration and whichever distribution governs our sampling from this spaceinformation that is, in part, given by our sample of sources itself.In this sense, while increasing ρ, we should preserve the structure of i B ρ (P i ) as much as possible.Proposition 6 identifies conditions we must satisfy if we wish to increase ρ and modify our source distributions in a way that is guaranteed to reduce the third term of the upperbound in Eq. (10).
Proposition 6 Let X be a space and let H be a class of hypotheses corresponding to this space.Let D be the space of distributions over X and let the collection {P i } k i=1 and the collection {R i } k i=1 be contained in D. Now, consider the collection of mixture distributions {S i } i defined so that for each set A, Pr Si (A) = αPr Pi (A) + βPr Ri (A).Further, set ρ = max i,j d H∆H (P i , P j ) and ρ * = max i,j d H∆H (S i , S j ).
Proof Let Q ∈ i Bρ(P i ) be arbitrary.Then, by definition, for all i, we have that Then, for all i, we have where the first inequality follows by Lemma 2, the second inequality follows because Q ∈ i Bρ(P i ) so the divergence is bounded by ρ for all i, the third inequality follows because, in general, the H-divergence abides by the triangle-inequality, the fourth inequality follows again because Q ∈ i Bρ(P i ), and the last inequality follows because we have assumed Now, this is true for all i, so by definition of i B ρ * (S i ), we have that Q ∈ i B ρ * (S i ).
Since Q was an arbitrary element of i Bρ(P i ), we have shown i Bρ(P i ) ⊆ i B ρ * (S i ) and we have our result.
The above statement suggests that if we want to diversify our training distributions, we should train on a collection of modified source distributions {S i } i .The modified distributions are mixture distributions whose components are pairs of our original source distributions {P i } i and new auxiliary distributions {R i } i .The choice of {R i } i is constrained to guarantee the new intersection i B ρ * (S i ) (with modified sources) contains the original intersection i B ρ (P i ).Ultimately, this means we can guarantee min S∈ i B ρ * (Si) d H∆H (S, Q) ≤ min P∈ i Bρ(Pi) d H∆H (P, Q).

Algorithm
Empirically speaking, our modified source samples { Ŝi } i will be a mix of examples from the original sources {P i } i and the auxiliary distributions {R i } idrawn from each proportionally to the mixture weights α and β.We plan to generate samples from the auxiliary distributions {R i } i and our interpretation of Proposition 6 suggests we should do so subject to the constraint below max i,j d H∆H (S i , S j ) − β max i d H∆H (R i , P i ) ≥ ρ. ( Because ρ is a property of our original dataset, it is independent of the distributions {R i } i .This suggests that we should generate { Ri } i to maximize the left hand side.Maximizing this requires: (Req.I) maximizing the largest divergence between the new source samples { Ŝi } i and (Req.II) minimizing the largest divergence between our auxiliary samples { Ri } i and our original source samples { Pi } i .Algorithmically, we can coarsely approximate these divergences, again appealing to the interpretation provided by Ben-David et al. [8] and Ganin and Lempitsky [1]: (Req.I) requires that our domain discriminator make fewer errors when discriminating the new source samples { Ŝi } i and (Req.II) requires that the auxiliary samples { Ri } i and the original sources { Pi } i be indiscernible by our domain discriminator.
To implement these requirements, we modify our dataset through gradient descent.Suppose that Pi is an empirical sample from the distribution P i .We can alter data-points a j ∼ Pj to generate data-points b j ∼ Rj by setting x j (0) = a j and iterating the below update rule to minimize L SD for T steps and then taking b j = x j (T ).Importantly, we do not modify the domain labels in this modification.So, our updates satisfy requirement (Req.I) because minimization of L SD approximates minimization of our domain discriminator's errors, and further, satisfy (Req.II) because a i and b i are identically labeled, so minimization of the domain discriminator's errors suggests that these examples should be indiscernible (i.e., assigned the same correct label).While this update rule seemingly accomplishes our algorithmic goals, we must recall the final upper bound we wish to minimize (see Eq. ( 10)).The first two terms in this bound, λ ϕ and i ϕ i E Pi (h), relate to our classification error -i.e., to the task-specific network c σ .If our generated distributions { Ri } i distort the underlying class information, these terms may grow uncontrollably.To account for this, we further modify the update rule of Eq. ( 42) to minimize the change in the probability distribution output by the task classifier.We measure the change caused by our updates using the loss L KL -i.e., the KL-Divergence [20].This gives the modified update rule

Algorithm 1 DANNCE (DANN with Cooperative Examples)
Input: A collection of mini-batches {x j i } i with labels {y j i } i for each j ∈ [k], classifier c σ parameterized by σ, feature extractor r θ parameterized by θ, domain discriminator d µ parameterized by µ Parameters: Perturbation probability β between 0 and 1. Number of update steps T .Learning rate η.Dependencies: SourceSourceDANN (i.e., any DANN algorithm to optimize for main-text Eq. 8 given a batch) for j = 1 : k do for example x j ∈ batch {x j i } i do p ← random uniform draw from [0, 1] if p ≤ β then for t = 1 : T do Update x j (t) using Eq. ( 43) DANNCE Update end for Run DANN as usual.

Interpretation
In totality, this algorithm may be seen as employing a style of adversarial training where, rather than generating examples to fool a task classifiere.g., the single-source DG approach of Volpi et al. [21] -, we instead generate examples to exploit the weaknesses of the feature extractor r θ whose goal is to fool the domain discriminator.In this sense, the generated examples can be interpreted as cooperating with the domain discriminator.Hence, we refer to the technique as DANN with Cooperative Examples, or DANNCE.For details on our implementation of DANNCE please see the pseudo-code in Algorithm Block 1.Additional details can also be found in Appendix B.

Experimentation
In this section, we aim at addressing the primary point argued throughout this paper: the application of DANN to DG can benefit from (algorithmic) consideration of source diversity.To this end, our modus operandi is the comparison to recent state-of-the-art methods using a source-source DANN, or other domain alignment techniques, for domain generalization.See Appendix B and code provided in supplement for all implementation details and additional experiments.

Datasets and Hyper-Parameters
We evaluate our method on two multi-source DG datasets.( 1) PACS [22] contains 4 different styles of images (Photo, Art, Cartoon, and Sketch) with 7 common object categories.(2) Office-Home [23] also contains 4 different styles of images (Art, Clipart, Product, and Real-W[orld]) with 65 common categories of daily objects.For both datasets, we follow standard experimental setups.We use 1 domain as target and the remaining 3 domains as sources.We report the average classification accuracy of the unseen target over 3 runs, using the model state at the last epoch to avoid peaking at the target.We select our hyper-parameters using leave-one-source-out CV [24]; this again avoids using the target in any way.Because some methods select parameters using a source train/val split, we use only the training data of the standard splits for fairness.Other parameters of our setup, unrelated to our own method, are selected based on the environment of Matsuura and Harada [2] (MMLD) -a SOTA source-source DANN technique.For full details, see Appendix B.

Our Models
For the feature extractor r θ we use AlexNet [25] for PACS and ResNet-18 [26] for PACS and OfficeHome.Both are pretrained on ImageNet with the last fullyconnected (FC) layer removed.For task classifier c σ and domain discriminator d µ we use only FC layers.For ERM (often called Vanilla or Deep All ) only r θ and c σ are used and the model is trained on a mixture of all sources; this is a traditional DG baseline.For DANN, we add the domain discriminator d µ and additionally update r θ with L SD (see Eq. ( 8)).Because we ultimately compare against DANN as a baseline, we must ensure our implementation is state-of-the-art.Therefore, we generally follow the implementation described by Matsuura and Harada [2], adding a commonly used Entropy Loss [27,28] and phasing-in the impact of L SD on r θ by setting λ = 2/(1 + exp(−κ • p)) − 1 in Eq. ( 8) with p = epoch/max epoch and κ = 10.
For our proposed method, DANNCE, we use the same baseline DANN, but update 50% of the images (i.e., β = 0.5) to cooperate with the domain discriminator following Eq.( 43).The number of update steps per image is 5 (i.e., T = 5).

Experimental Baselines
As mentioned, we focus on comparison to other methods proposing domain alignment for DG.Albuquerque et al. [3] (G2DM) and Li et al. [5] (MMD-AAE) propose variants of DANN, 6 and in particular, align domains by making updates to the feature extractor.As noted, Matsuura and Harada [2] (MMLD) propose the DANN setup most similar to our baseline DANN.For MMLD, Matsuura and Harada [2] additionally propose a source domain mixing algorithm -we denote this by MMLD-K with K the number of domains after re-clustering.Shankar et al. [30] (CrossGRAD) and Zhou et al. [31] (DDAIG), contrary to our work, generate examples which maximize the domain loss.Because, they do not update the feature extractor with the domain loss L SD as we do, this may actually be viewed as domain-alignment by data generation (see Liu et al. [32] who first propose this technique).For MMD-AAE and CrossGRAD, we use results reported by Zhou et al. [31] because the original methods do not test on our datasets.

Analysis of Performance
Generally, in DG, the comparison of performance is subjective across different experimental setups -a problem highlighted by a recent commentary on the experimental rigor of DG setups [33].As such, we include reported results from other experimental setups, predominantly, to show our DANN implementation is a competitive baseline.This much is visible in Table 1.For 2 out of 3 setups, our DANN alone has higher overall accuracy than any other method.
Our focus, then, is the validation of our main argument using our strong DANN baseline.In this context, shown in Table 1, ablation of DANNCE reveals substantial improvement upon the traditional source-source DANN in all PACS setups and marginal improvement in the OfficeHome setup.This is somewhat intuitive as OfficeHome has a staggering 65 categories to classify compared to 7 in PACS.Ultimately, the performance gains demonstrated by addition of DANNCE agrees with our main argument: increasing diversity when aligning domains can have practical benefits in DG.

Analysis of Loss Curves
To measure domain diversity, we use the loss of the domain discriminator (averaged per epoch).This loss is used to proxy the H-divergence (an inverse relationship).A lower loss should then indicate more domain diversity, and, has the benefit of dynamically measuring diversity during training.Figure 3 shows the domain discriminator loss across epochs for our implementations of DANN and DANNCE using AlexNet on PACS.We generally see after epoch 15, the loss for DANNCE is lowest.Figure 4 further shows the effect of increasing the number of steps per image update.This suggests that increasing the number of updates has some control over the source domain diversity as intended.Finally, in both Figures, epochs 10 to 24 show the (inverted) smooth proxy for the domain divergence is increasing.This agrees with the formal claim made in Prop. 5.Although the trend changes after epoch 24, this is likely due to a decrease in γ at this epoch, and thus, does not necessarily disagree with our formal claim.

Domain Adaptation Theory
Many works extend the theoretical framework of Ben-David et al. [8] to motivate new variants of DANN.Zhao et al. [15] consider the multi-source setting, Schoenauer-Sebag et al. [34] consider a multi-domain setting in which all domains have labels but large variability across domains must be handled, and Zhang et al. [35] consider theoretical extensions to the multi-class setting using a margin-loss.Besides the theoretical perspective of Ben-David et al. in DA, there are many other works to consider.Mansour et al. [36] consider the case of general loss functions rather than the 01-error.Kuroki et al. [37] consider a domain-divergence which depends on the source-domain and, through this dependence, always produces a tighter bound.Many works also consider intergral probability metrics including: Redko et al. [38], Shen et al. [39], and Johansson et al. [12].As has been discussed in this paper, the assumptions of various domain adaptation theories are of particular importance.Consequently, these assumptions are also important for DG.We discuss some assumptions in more detail in the next part.

Assumptions in DA
Ben-David et al. [11] show control of their divergence term as well as the ideal joint error λ (so that both are small) give necessary and sufficient conditions for a large class of domain adaptation learners.These are the conditions which we control (in case of the divergence term) and assume (in case of the ideal joint error).Other assumptions for DA include the co-variate shift assumption in which the marginal feature distributions are assumed to change but the feature conditional label distributions across domains remain constant.As we have discussed, Zhao et al. [13] show that this assumption is not always enough in the context of DANN and Johansson et al. [12] provide similar conceptualizations.Still, this assumption can be useful in the context of model selection [40,41].Another common assumption is label shift: the marginal label distributions disagree, but the label conditional feature distributions are the same.Again, this is related to the concern of Zhao et al. [13] since significant disagreement in the label distributions can cause DANN to fail miserably.Lipton et al. [42] provide adaptation algorithms for this particular situation.Another assumption one can make for the benefit of algorithm design is the notion of generalized label shift in which the label distributions may disagree and the feature conditional label distributions agree in an intermediate feature space.As we have noted, Tachet des Combes, Zhao, et al. [14] propose this assumption, devise new theoretical arguments under this assumptions, and suggest a number of algorithms based on their proposal.

Domain Generalization Theory
For DG, there is decidedly less theoretical work, but throughout our text, we have attempted to compare to the most relevant (and recent) -a bound proposed by Albuquerque et al. [3].Albeit, some different theoretical perspectives on DG do exist.In particular, Blanchard et al. [43,44] and Deng et al. [45] consider domain generalization from the perspective of a meta-distribution which governs our observation of domains.Asymptotically, as we observe more domains, we can be more confident on the success of our algorithm.While this approach is interesting, our paper instead focuses on the case where we only have a relatively small number of domains from which to learn.

Algorithms in DG
Besides DANN and other domain-aligning algorithms mentioned in this text, there are of course additional algorithmic perspectives on DG too.An early work in DG by Munadet et al. [6] proposes a kernel-based algorithm aimed at producing domain-invariant features with a strong theoretical justification.More recently, a common thread is the use of meta-learning (e.g. to simulate domain-holdout) seen in Li et al. [46], Balaji et al. [24], and Dou et al. [47].Some authors, such as Wang et al. [48] and Carlucci et al. [49], make additional assumptions on the domains to be seen and use this in algorithm design.Lastly, some works focus on the neural network components themselves, e.g., Li et al. [22].These architecture changes can be very effective (see Seo et al. [50] for impressive results when modifying batch-normalization). Related to our paper's main point, we primarily focus on comparison to other methods proposing domain alignment for DG, especially those which are, in some sense, model agnostic.These additional references are discussed in Experimental Baselines.

Discussion
In this work, we investigate the applicability of source-source DANN for domain generalization.Our theoretical results and interpretation suggest a complex relationship between the heterogeneity of the source domains and the usual process of domain alignment.Motivated by this, we construct an algorithmic extension for DANN which diversifies the sources via gradient-based image updates.Our empirical results and analyses support our findings.
One of the motivations of our algorithm is also one of the predominant limitations of our study.In particular, the behavior of DANN as a dynamic process is not well understood.Studying it as such can reveal to us new information.For example, in the proof of Proposition 5, we saw the importance of annealing the learning rate for DANN.We also use Proposition 5 to motivate our algorithm design, but there are certainly open questions on the dynamic behavior of DANN and DANNCE.For example, it would be interesting to consider the competing objectives we have discussed in a more analytically tractable environment.Even for simple distributions, it is an open question as to how the hyper-parameters of DANNCE -which intuitively balance the competing objectives -may be optimally selected.On a related note, although we have assumed the ideal joint error is generally small, we have also pointed out that this is not always the case [13].While our promising results indicate this may not be an issue in practice, it is still interesting to consider this from a more theoretical perspective as well.Finally, it is important to point out that our empirical investigation was limited to images.It is interesting to consider how our technique might extend to natural language or other areas where gradient-based algorithms are used for learning.
approximating the empirical H∆H-divergence by training a classifier to distinguish between the source and target distributions.To minimize the empirical H∆H-divergence, we should maximize this classifiers errors.Thus, this proposition can be viewed as motivation for our -and many other authors' -choice to approximate minimization of the divergence by maximization of a domain classifier's errors.
Proposition 7 (Ben-David et al. [8] Lemma 2) Provided a symmetric hypothesis class 7 and samples P, Q of size n Proof We proceed in a similar fashion to [8].Let h ∈ H and consider the quantity We note two obvious facts.Every x must belong to the sample Q or P and every x must have h(x) ∈ {0, 1}.Therefore, we can rewrite 1 = 2n 2n and we have 1 A6) By taking the common denominator 2n, we may then write Now, for any sample P of size n, we have Following the chain of inequalities and taking a maximum on both sides, we therefore have Finally, to complete the proof, we note that Since H is assumed symmetric, we therefore have and we are done.

A.1.3 Proof of Theorem 1
Here, we present Theorem 2.1 of the main text (referenced in the Appendix as Theorem 1).We begin with a required Lemma for the final proof.
Lemma 1 ([8] Lemma 3) Let X be a space and H a class of hypotheses corresponding to this space.Let P and Q be distributions over X .Then for any hypotheses h 1 , h 2 ∈ H, we have Proof We proceed in a similar fashion to [8].By definition of the H-divergence we have Here, the second equality follows directly from the definition of H∆H, the fourth equality follows from Main Text Eq. 6, and the last inequality follows by properties of the supremum.
Using Lemma 1, we may present the proof of Theorem 1.Our statement is modified, omitting empirical quantities.We invite the reader to view Theorem 2 of [8] for the full result.
Proof We proceed in a similar fashion to [8].First, note the triangle inequality of classification error [7,10] which states that given any labeling functions h 1 , h 2 , h 3 , we have ) Then, let h ∈ H, let η = arg min h∈H E Q (h)+E P (h), and let f P , f Q be the true labeling functions of P, Q on X , respectively.Given this, we have Here, the second inequality comes from considering both the cases E B (η, h) > E A (η, h) and E A (η, h) > E B (η, h), the third inequality comes from Lemma 1, and all other inequalities follow from the triangle inequality of classification error.

A.1.4 Sample Complexity of Theorem 1
Here, we present Proposition 8.This Proposition contributes the main result required to derive generalization bounds for Theorem 1.Since Theorem 1 is modified from Theorem 2 of Ben-David et al. [8], we direct the reader to this proof for further details.

Remark on Sample Complexity
In general, we choose to omit discussion of sample complexity from the main text.In the usual case, where H is a class of neural networks, the VC Dimension [51] is usually larger8 than the number of samples.As can be seen in the statement of Proposition 8, this causes problems in the interpretation and assumptions of the generalization bound.Despite this fact, Ganin and Lempitsky [1] have shown that (empirically) this is a non-issue for application of DANN.With this said, some readers may rightly desire tighter bounds on empirical quantities when dealing with neural networks.Recently, some works have shown success in deriving much tighter bounds on empirical quantities (like error) for stochastic neural networks using the PAC-Bayesian framework Dziugaite et al. [17].Within the PAC-Bayesian framework, Germain et al. [18] provide a distribution divergence psuedometric very similar to the H∆Hdivergence.As mentioned in the main text, the important property we use in our results is the psuedometric property, so we expect our results to hold in this more modern formulation as well.
In any case, we remark that generalization bounds can be derived for Theorem 1 and other results in this paper by application of the below statement.For a more detailed discussion, where empirical quantities are considered and generalization bounds are discussed in a variety of circumstances, we direct the reader to the original work of Ben-David et al. [8].
Proposition 8 (Ben-David et al. [8] Lemma 2; Kifer et al. [9] Theorem 3.2) Let X be a space and H be a class of hypotheses corresponding to this space with VC dimension d.Suppose P and Q are distributions over X with corresponding samples P and Q of size n.Suppose dH ( P, Q) is the empirical H-divergence between samples.Then, for any δ ∈ (0, 1) the following holds with probability at least

A.2 On the H∆H-divergence with Comparison to a PAC-Bayesian Distribution Divergence
Here, we prove some useful facts about the H∆H-divergence.In fact, these are the essential properties used to prove our formal claims in the main text.Most of these are known and have been used by other authors, but we restate and prove them here for completeness.One important point of this discussion is to demonstrate the relation between a second distribution divergence proposed by Germain et al. [18] within the PAC-Bayes framework.As we will show, this PAC-Bayesian divergence exhibits the same properties.The consequence is that much of our formal discussion holds for this more modern divergence as well.

A.2.1 A Nice Property of Mixture Distributions
Below we provide a nice property of mixture distributions when considering their divergence.We are aware of variants of this result which have been observed by both Zhao et al. [15] and Albuquerque et al. [3] in derivation of their own bounds.We use this result in most of our proofs involving mixtures.
Lemma 2 Let X be a space and let H be a class of hypotheses corresponding to this space.Let the collection {P i } k i=1 be distributions over X .Now, suppose also that Q is a mixture of the component distributions {P i } i ; that is, for any set A, we have Pr Q (A) = i α i Pr Pi (A) with i α i = 1 and α i ≥ 0, ∀i.Then, for any distribution P, the following holds (A20) Here, the results follow mostly by definition or arithmetic, but we highlight some exceptions.The third equality follows because the coefficients {α j } j sum to 1.The only inequality follows by application of the triangle inequality (for absolute values)

A.2.3 Comparison to the Domain Disagreement [18]
The domain disagreement is another distribution divergence proposed by Germain et al. [18].As noted by the these authors, the divergence is in fact designed as the PAC-Bayesian analog of the H∆H-divergence.We define the domain disagreement below for a distribution ρ over H dis ρ (P, S) = E (h,h )∼ρ 2 [E P (h, h ) − E S (h, h )] . (A25) As it turns out, the domain disagreement abides by a triangle-inequality and further satisfies Lemma 2. The former is a simple consequence of the fact that the domain disagreement is also a pseudometric [18].The latter is not so trivial to see, but we provide a quick sketch of the required steps below.Assuming S is a mixture as in Lemma 2 we have dis ρ (P, S) = E (h,h )∼ρ 2 [E P (h, h ) − E S (h, h )] = E (h,h )∼ρ 2 E P (h, h ) The steps above generally follow from arithmetic similar to Eq. (A20) or by linearity of the expectation.The last inequality uses properties of the absolute value.
Harking back to the main point, we remind the reader that Lemma 2 and the triangle-inequality the primary tools needed for our results.As such, much of our formal discussion holds for this more modern divergence as well.
All experiments were run on an NVIDIA GeForce RTX 2080 Ti GPU 11GB.We used the helpful Weights and Biases tool [53] during experimentation for visualizing our model training and results.

B.3 Network Architectures
In this section, we provide details of the network architectures of our model components.

Feature Extractors
We use AlexNet [25] for PACS and ResNet-18 [26] for PACS and OfficeHome.In both cases, we pretrain on ImageNet [54] with the last FC layer removed.We note that we used a Caffe version of AlexNet implemented in PyTorch to follow related recent works [2,49] which showed consistently competitive Deep All baseline accuracies.The exact implementations can be found in our codebase (/src/models/caffenet/models.py and /src/models/resnet.py).The pretrained model for AlexNet is also included in our supplementary materials.For ResNet-18, it is loaded from torchvision in the code.

Classifiers
The class classifier for is a single fully connected (FC) layer.The domain discriminator follows the design of Matsuura and Harada [2] and is a simple stack of fully connected layers: Again following [2], the class classifier has an xavier (glorot) uniform initialization [55] with gain set to 0.1, while the domain classifier uses the PyTorch [52] default initialization (version 1.4).The exact implementation can be found in the code, specifically module /src/models.

B.4 Hyper-Parameters of DANNCE
While we generally try to follow Matsuura and Harada [2] as closely as possible to ensure our baseline DANN is state-of-the-art, we cannot use existing hyper-parameter choices for our novel algorithm (i.e., DANNCE).To perform the image updates (Line 5, Algorithm 1) we use the Adam optimizer [56].Generally, we fix β = 0.5 and t = 5 in Algorithm 1.To maintain realistic image values, we clamp the pixel-values of the resulting images after each update based on the max and min pixel values of the PACS dataset.Yosinski et al.
[57] also use image-space gradient updates and further identify the addition of Gaussian blurring to be an important parameter for producing realistic images.Based on one of the optimal settings described by Yosinski et al.
[57], we use Gaussian blurring once every 4 steps of our algorithm.We provide ablation of the effect of blurring in Table B1 which reveals that blurring may indeed be important for our method when applied to images, but importantly also shows that our gain in performance does not only come from

Fig. 1 :
Fig. 1: A visualization of Example 1. Best viewed in color.The green line gives the value b of a hypothesis in {h a,b (.)} (a,b) with a ≤ 0. Such a hypothesis would perfectly discern P 1 and P 2 .From this, it follows that d H∆H (P 1 , P 2 ) = 2 because a hypothesis in {h a,b (.)} (a,b) can achieve 2 and 2 is the maximum value for any divergence.Note, from this, it already follows that Eq (18) holds because each term on the right-hand-side is bounded above by 2, and therefore, so is their convex combination.Still, we can analyze the example further.If we imagine the red line also gives the value b of a hypothesis in {h a,b (.)} (a,b)with a ≤ 0 and slide it back and forth, we can never perfectly discern P 1 or P 2 from S and therefore we will never achieve the maximum divergence 2.

Fig. 2 :
Fig. 2: An informal visualization.Blue dots represent sources.Purple lines define the boundaries of M. Grey lines give the boundaries of the closed H, ρballs around each source (defined in Proposition 3).Green colored areas define the boundary of i B ρ (P i ).Distributions within the yellow area may satisfy Condition (9).Distributions outside the yellow area (red dots) do not satisfy Condition (9).

d 2 j
H∆H (P, Q) ≤ i α i d H∆H (P, P i ).(A19) Proof The result follows from the chain below d H∆H (P, Q) = 2 sup h∈H∆H Pr P (I h ) − Pr Q (I h ) = 2 sup h∈H∆H Pr P (I h ) − j α j Pr Pj (I h ) = 2 sup h∈H∆H j α j Pr P (I h ) − j α j Pr Pj (I h ) = 2 sup h∈H∆H j α j Pr P (I h ) − Pr Pj (I h ) ≤ α j sup h∈H∆H Pr P (I h ) − Pr Pj (I h ) = j α j d H∆H (P, P j ).

Table 1 :
PACS and OfficeHome Results in Accuracy (%).avg: Average of the target domain accuracies.gain: avg gain over the respective ERM (if reported).