1 Introduction

In general, in machine learning, we assume the training data for our learning algorithm is well representative of the testing data. That is, we assume our training data follows the same distribution as our testing data. Of primary interest to this paper is the case where this assumption fails to hold: we consider learning in the presence of multiple domains. We formalize the multiple domain problem of interest as the case where (at train-time) we observe k domains referred to as sources which have distributions \({\mathbb {P}}_1, {\mathbb {P}}_2, \ldots , {\mathbb {P}}_k\) over some space \({\mathcal {X}}\). At test-time, we are evaluated on a distinct target domain which has distribution \({\mathbb {Q}}\) over \({\mathcal {X}}\). All of these feature distributions have (potentially) distinct labelling functions and our goal is to learn the labeling function on the target. Typically, we assume some restriction on observation of the target domain at train-time. In the literature, a large amount of work is concerned with the problem of Domain Adaptation (DA) which assumes access to samples from \({\mathbb {Q}}\), but restricts access to the labels of these samples. More recently, there has also been an active investigation on the problem of Domain Generalization (DG) which instead assumes absolutely no access to the target domain. In spite of these restrictions, in both cases, the goal is for our learning algorithm trained on sources to perform well when evaluated on the target.

One popular approach to DA is the use of a Domain Adversarial Neural Network (DANN) originally proposed by Ganin and Lempitsky (2015). Intuitively, this approach attempts to align the source and target domains by learning feature representations of both which are indiscernible by a domain discriminator trained to distinguish between the two distributions. Informally speaking, this seems like a sensible approach to DA. By accomplishing this domain alignment, the neural network should still be adept at the learned task when it is evaluated on the target domain at test-time. While DANN was originally proposed for DA, the adoption of this reasoning has motivated adaptations of this approach for DG (Albuquerque et al., 2020; Li et al., 2018b, c; Matsuura & Harada, 2020). In fact, very early works in DG (Muandet et al., 2013) are similarly motivated by the goal of domain-agnostic feature representations.

Still, it is worth noting that the original proposal of DANN (Ganin & Lempitsky, 2015) was motivated by theory. In particular, Ganin and Lempitsky base their algorithm on the target-error bound given by Ben-David et al. (2007, 2010a). Under appropriate assumptions, interpretation of the bound suggests domain alignment as achieved through DANN should improve performance on the target distribution, but importantly, it motivates alignment between the source and target. Counter to this, DANN variants for DG generally align multiple source domains because no access to target data is permitted. This shortcoming gives rise to the question of primary interest to this paper:

Is there a justification for source alignment using DANN in DG?

Specifically, we are concerned with a target-error bound similar to those provided by Ben-David et al. (2010a). To answer this question, we appeal to a recent theoretical proposal by Albuquerque et al. (2020) which uses a reference object (i.e., the set of mixture distributions of the sources) to derive a target-error bound in the domain generalization setting. Building on this framework, we provide answers to two important considerations:

  1. 1.

    What additional reference objects (besides sets of mixture distributions) satisfy the primary condition used to derive target-error bounds in DG?

  2. 2.

    How does the target-error bound behave as a dynamic quantity during the training process?

Ultimately, answering these two questions allows us to formulate a novel extension of the Domain Adversarial Neural Network. We validate experimentally that this extension improves performance and otherwise agrees with our theoretical expectations.

2 Domain Adversarial Neural Network (DANN)

In this section, we cover the necessary background on Domain Adversarial Neural Networks (DANN). We first present the original bound on target-error in the case of unsupervised DA (Ben-David et al., 2007, 2010a) which motivates the DANN algorithm proposed by Ganin and Lempitsky (Ganin & Lempitsky, 2015). Following this, we outline the key differences introduced by a DANN variant proposed by Matsuura and Harada (2020). Although this variant achieves state-of-the-art (DANN) performance in DG, we point out the main concerns we have regarding the justification of this approach.

2.1 In domain adaptation

As mentioned, we begin with a motivating result of Ben-David et al. (2010a). Intuitively, this result describes bounds on the target-error controlled, in part, by a computable measure of divergence between distributions. While we provide a more detailed exposition of the problem setup in Appendix A, we begin by listing here the key terms to familiarize the reader.

2.1.1 Setup

For a binary hypothesis h, a distribution \({\mathbb {P}}\), and a labeling function f for \({\mathbb {P}}\), we define the error \({\mathcal {E}}_{\mathbb {P}}(h)\) of h on the distribution \({\mathbb {P}}\) as follows

$$\begin{aligned} {\mathcal {E}}_{\mathbb {P}}(h) = {\textbf{E}}_{x \sim {\mathbb {P}}} \left|h(x) - f(x) \right|= {\textbf{E}}_{x \sim {\mathbb {P}}} \left[ 1[h(x) \ne f(x)]\right] . \end{aligned}$$
(1)

This is our primary measure of the quality of a hypothesis when predicting on a distribution \({\mathbb {P}}\). To measure differences in distribution, we use the \({\mathcal {H}}\)-divergence which is an adaptation of the \({\mathcal {A}}\)-distance Kifer et al. (2004). In particular, given two distributions \({\mathbb {P}}\), \({\mathbb {Q}}\) over a space \({\mathcal {X}}\) and a corresponding hypothesis class \({\mathcal {H}} \subseteq \{h \mid h: {\mathcal {X}} \rightarrow \{0,1\}\}\), the \({\mathcal {H}}\)-divergence Ben-David et al. (2010a) is defined

$$\begin{aligned} d_{\mathcal {H}}({\mathbb {P}}, {\mathbb {Q}}) = 2 \sup _{h \in {\mathcal {H}}} \left|\textrm{Pr}_{\mathbb {P}}(I_h) - \textrm{Pr}_{\mathbb {Q}}(I_h)\right|\end{aligned}$$
(2)

where \(I_h = \{x \in {\mathcal {X}} \mid h(x) = 1\}\). Generally, it is more useful to consider the the \({\mathcal {H}}\Delta {\mathcal {H}}\)-divergence, specifically, where Ben-David et al. (2010a) define the symmetric difference hypothesis class \({\mathcal {H}}\Delta {\mathcal {H}}\) as the set of functions characteristic to disagreements between hypotheses.Footnote 1 This special case of the \({\mathcal {H}}\)-divergence will be the measure of divergence in all considered bounds.

2.1.2 The motivating bound

We can now present the result of Ben-David et al. (2010a) based on the triangle inequality of classification error (Crammer et al., 2007; Ben-David et al., 2007). This bound is the key motivation behind DANN (Ganin & Lempitsky, 2015). For proof and a discussion on sample complexity, see Appendix A.

Theorem 1

(modified from Ben-David et al. (2010a), Theorem 2) Let \({\mathcal {X}}\) be a space and \({\mathcal {H}}\) be a class of hypotheses corresponding to this space. Suppose \({\mathbb {P}}\) and \({\mathbb {Q}}\) are distributions over \({\mathcal {X}}\). Then for any \(h \in {\mathcal {H}}\),

$$\begin{aligned} {\mathcal {E}}_{\mathbb {Q}}(h) \le \lambda + {\mathcal {E}}_{\mathbb {P}}(h) + \tfrac{1}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {P}}) \end{aligned}$$
(3)

with \(\lambda\) the error of an ideal joint hypothesis for \({\mathbb {Q}}\), \({\mathbb {P}}\).

This statement provides an upper bound on the target-error. Thus, minimizing this upper bound is a good proxy for the minimization of the target-error itself. The first term \(\lambda\) is a property of the dataset and hypothesis class which we typically assume to be small, but should not be ignored. As Ben-David et al. (2010a) note, this may be interpreted as a realizability assumption which requires the existence of some hypothesis in our search space that does well on both distributions (simultaneously). If this hypothesis does not exist, we cannot hope to do adaptation by minimizing the source-risk (Ben-David et al., 2010b). Notably, \(\lambda\) also plays an important role in algorithms like DANN which modify the distributions over which they learn since these algorithms implicitly change \(\lambda\). We discuss this issue in detail in Sect. 2.3.

The latter terms are more explicitly controllable. The source-error \({\mathcal {E}}_{\mathbb {P}}(h)\) can be minimized as usual by Empirical Risk Minimization (ERM). The divergence can be empirically computed using another result of Ben-David et al. (2010a). While we give this result in the Appendix (Propositions 7 and 8, respectively), previous interpretation by Ganin and Lempitsky (2015) suggests to minimize the divergence by learning indiscernible representations of the distributions—i.e., aligning the domains.Footnote 2 As we describe in the following, this may be accomplished by maximizing the errors of a domain discriminator trained to distinguish the distributions.

2.1.3 The DANN algorithm

Ganin and Lempitsky (2015) separate the neural network used to learn the task into a feature extractor network \(r_\theta\) and task-specific network \(c_\sigma\), parameterized respectively by \(\theta\) and \(\sigma\). A binary domain discriminator \(d_\mu\) outputting probabilities is trained to distinguish between the source and target distribution based on the representation learned by \(r_\theta\). Meanwhile, \(r_\theta\) is trained to learn a representation that is not only useful for the task at hand, but also adept at “fooling” the domain discriminator (i.e., maximizing its errors). In details, given an empirical sample \(\hat{{\mathbb {P}}} = (x_i)_{i=1}^n\) from the source distribution \({\mathbb {P}}\) and a sample \(\hat{{\mathbb {Q}}} = (x'_i)_{i=1}^n\) from the target distribution \({\mathbb {Q}}\), the domain adversarial training objective is described

$$\begin{aligned} \begin{aligned} \min _\mu \max _\theta \ \frac{1}{2n} \sum _{i=1}^n {\mathcal {L}}_D(\mu , \theta , x_i, 0) + {\mathcal {L}}_D(\mu , \theta , x'_i, 1) \end{aligned} \end{aligned}$$
(4)

where

$$\begin{aligned} \begin{aligned} -{\mathcal {L}}_D(\mu , \theta , x, y) =&(1-y) \log (1 - d_\mu \circ r_\theta (x)) + y \log (d_\mu \circ r_\theta (x)). \end{aligned} \end{aligned}$$
(5)

By this specification, \(d_\mu \circ r_\theta (x)\) is meant to estimate the probability x was drawn from \({\mathbb {Q}}\) and \({\mathcal {L}}_D\) represents the binary cross-entropy loss for a domain discriminator trained to distinguish \({\mathbb {P}}\) and \({\mathbb {Q}}\). Combining this with a task-specific loss \({\mathcal {L}}_T^{\mathbb {P}}\) we get the formulation given by Ganin and Lempitsky (2015)

$$\begin{aligned} \begin{aligned} \min _{\sigma , \theta }&\max _\mu \frac{1}{2n} \sum _{i=1}^n {\mathcal {L}}_T^{\mathbb {P}}(\sigma , \theta , x_i) - \frac{\lambda }{2n} \sum _{j=1}^n {\mathcal {L}}_D(\mu ,\theta , x_j, 0) + {\mathcal {L}}_D(\mu , \theta , x'_j, 1) \end{aligned} \end{aligned}$$
(6)

where \(\lambda\) (in this context) is a trade-off parameter. The above is generally implemented by simultaneous gradient descent. We remark a solution to this optimization problem is easily approximated by incorporating a Gradient Reversal Layer (GRL) between \(r_\theta\) and \(d_\mu\) (Ganin & Lempitsky, 2015).

2.2 In domain generalization

Recent adaptions to the above formulation have been proposed in context of DG. Here, we focus on the proposal of Matsuura and Harada (2020) since their empirical results are one of the more competitive DG methods to date. In DG, since no access to \({\mathbb {Q}}\) is given, one cannot actually compute \({\mathcal {L}}_D\) as described above—it assumes at least unlabeled examples from \({\mathbb {Q}}\). Given this, Matsuura and Harada (2020) propose a modification which operates on k source samples

$$\begin{aligned} -{\mathcal {L}}_{SD}(\mu , \theta , x, y) = \sum _{i=1}^k 1[i = y] \log ((d_\mu \circ r_\theta (x))_i) \end{aligned}$$
(7)

where \(1[\cdot ]\) is the indicator function. Now, \(d_\mu\) is a multi-class domain discriminator trained to distinguish between sources; it outputs the estimated probabilities that x is drawn from each source. Hence, \({\mathcal {L}}_{SD}\) is essentially a multi-class cross-entropy loss. Given the source samples \(\hat{{\mathbb {P}}}_j = (x_i^j)_{i=1}^n \ \forall j \in [k]\) drawn respectively from the source distributions \({\mathbb {P}}_1, {\mathbb {P}}_2, \ldots , {\mathbb {P}}_k\), we substitute this into Eq. (6):

$$\begin{aligned} \begin{aligned} \min _{\sigma , \theta } \max _\mu \ \frac{1}{kn} \sum _{i=1}^n \sum _{j=1}^k {\mathcal {L}}_T^{{\mathbb {P}}_j}(\sigma , \theta , x_i^j) + \frac{\lambda }{kn} \sum _{i=1}^n \sum _{j=1}^k {\mathcal {L}}_{SD}(\mu , \theta , x_i^j, j) \end{aligned} \end{aligned}$$
(8)

which gives a domain adversarial training objective aimed at aligning the sources (while also maintaining good task performance). Hereon, we often refer to this as a source-source DANN, rather than a source-target DANN as was given in Eq. (6). On the surface, there seems to be no justification for the source-source DANN. If we recall the interpretation of Theorem 1, there is one key difference: rather than aligning the source and target domains \({\mathbb {P}}\) and \({\mathbb {Q}}\) as suggested by the divergence term in Theorem 1, the objective in Eq. (8) aligns source domains \({\mathbb {P}}_i\) and \({\mathbb {P}}_j \ \forall (i,j) \in [k]^2\) whose divergences do not appear in the upper bound. Thus, the motivating argument is lost in this new formulation. If we look to recent literature, preliminary theoretical work to motivate this modification of DANN does exist (Albuquerque et al., 2020). We start from this work in the derivation of our own results.

2.3 A gap between theory and algorithm

To be totally precise, the algorithm given above does not actually minimize \(d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j)\) for any ij. As we have noted, the idea to “align domains” through a common feature representation is simply an interpretation following the convention of Ganin and Lempitsky (2015). If the class from which we select \(d_\mu\) is \({\mathcal {G}}\) and the class from which we select \(r_\theta\) is \({\mathcal {F}}\), the algorithm actually approximates minimization of \(d_{{\mathcal {G}} \Delta {\mathcal {G}}}({\mathbb {P}}_i \circ r_\theta ^{-1}, {\mathbb {P}}_j \circ r_\theta ^{-1})\) with respect to \(\theta\). Here, the notation \({\mathbb {P}}_i \circ r_\theta ^{-1}\) denotes the pushforward of \({\mathbb {P}}_i\) by \(r_\theta\) which is (intuitively) the image of \({\mathbb {P}}_i\) in the feature space. While this technicality will be unimportant for our discussions in the remainder of this text, it can potentially have significant negative ramifications. So, we discuss it in some detail here.

In particular, this gap between theory and algorithm implies that learning indiscernible representations of the source and target distributions while also minimizing the source error is not always sufficient for reducing the bound in Theorem 1. The problem arises because the ideal joint error (which is usually assumed small in the original problem) does not always remain small after feature transformation as in DANN. That is, while the ideal-joint error between \({\mathbb {P}}_i\) and \({\mathbb {P}}_j\) may be small, this may not be true of \({\mathbb {P}}_i \circ r_\theta ^{-1}\) and \({\mathbb {P}}_j \circ r_\theta ^{-1}\). This fact was recently observed independently by Johansson et al. (2019) and Zhao et al. (2019). Johansson et al. point out that learning a particular feature representation will always increase the ideal joint error (as compared to the original problem) whenever this feature representation is not invertible. Zhao et al. compliment this result by providing a lowerbound on target error in case the marginal label distributionsFootnote 3 have large deviation. In particular, the Jenson–Shannon (JS) divergence between the the label distributions should be at least as large as the JS divergence between the source and target feature distributions for the lowerbound to hold. If it is, the lowerbound shows simultaneous minimization of the source-error and the \({\mathcal {H}}\Delta {\mathcal {H}}\)-divergence actually increases target-error.

In practice, as we are aware, it is not clear to what extent non-invertible feature representations increase the ideal joint error. Further, it is not easy to test whether the JS-divergence of the label distributions is larger than the JS-divergence of the source and target feature distributions. For this reason, in this work, we will simply assume the ideal joint error remains small after feature transformation; i.e., we do not explicitly consider any settings in which there are negative ramifications of the known gap between theory and algorithm for DANN. If these issues are of significant concern for a particular application (i.e., if the marginal label shift is known to be large), a recent modification of DANN which uses importance weighting has been proposed by Tachet et al. (2020). This modification aims to correct the short-comings of standard DANN in case of label-shift. While we do not explicitly experiment with this method, our theoretical discussion and algorithmic extension still apply in context of this variation on DANN.

3 Understanding domain alignment in domain generalization

Our discussion of source-source DANN for DG begins with the motivating target-error bound proposed by Albuquerque et al. (2020). Originally, given a set of source distributions \(\{{\mathbb {P}}_i\}\), the bound uses the set of mixture distributions having these sources as components—we refer to this set as \({\mathcal {M}}\). Below, we consider a more general adaptation of this result. Although the proof strategy is largely similar, we do provide proof for this more general re-statement.

Proposition 2

(adapted from Albuquerque et al. (2020); Proposition 2) Let \({\mathcal {X}}\) be a space and let \({\mathcal {H}}\) be a class of hypotheses corresponding to this space. Let \({\mathbb {Q}}\) and the collection \(\{{\mathbb {P}}_i\}_{i=1}^k\) be distributions over \({\mathcal {X}}\) and let \(\{\varphi _i\}_{i=1}^k\) be a collection of non-negative coefficients with \(\sum _i \varphi _i = 1\). Let the object \({\mathcal {O}}\) be a set of distributions such that for every \({\mathbb {S}} \in {\mathcal {O}}\) the following holds

$$\begin{aligned} \sum \nolimits _i \varphi _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) \le \max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j). \end{aligned}$$
(9)

Then, for any \(h \in {\mathcal {H}}\),

$$\begin{aligned} \begin{aligned} {\mathcal {E}}_{\mathbb {Q}}(h) \le \lambda _\varphi + \sum \nolimits _i \varphi _i {\mathcal {E}}_{{\mathbb {P}}_i}(h)&+ \tfrac{1}{2}\min \nolimits _{{\mathbb {S}} \in {\mathcal {O}}}d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}}) \\ {}&+ \tfrac{1}{2}\max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j) \end{aligned} \end{aligned}$$
(10)

where \(\lambda _\varphi = \sum _i \varphi _i \lambda _i\) and each \(\lambda _i\) is the error of an ideal joint hypothesis for \({\mathbb {Q}}\) and \({\mathbb {P}}_i\).

Proof

Let \(h \in {\mathcal {H}}\). For each \({\mathbb {P}}_i\) apply Theorem 1 and multiply the equation by \(\varphi _i\) to achieve

$$\begin{aligned} \varphi _i{\mathcal {E}}_{\mathbb {Q}}(h) \le \varphi _i \lambda _i + \varphi _i{\mathcal {E}}_{{{\mathbb {P}}}_i}(h) + \frac{\varphi _i}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {P}}_i) \end{aligned}$$
(11)

Taking \(\lambda _\varphi = \sum _i \varphi _i \lambda _i\), we may sum over all k of these inequalities as below

$$\begin{aligned} \sum \nolimits _i \varphi _i{\mathcal {E}}_{\mathbb {Q}}(h) \le \lambda _\varphi + \sum \nolimits _i \varphi _i{\mathcal {E}}_{{{\mathbb {P}}}_i}(h) + \frac{\varphi _i}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {P}}_i). \end{aligned}$$
(12)

Since \(\sum _i \varphi _i = 1\) we can rewrite this as

$$\begin{aligned} {\mathcal {E}}_{\mathbb {Q}}(h) \le \lambda _\varphi + \sum \nolimits _i \varphi _i{\mathcal {E}}_{{{\mathbb {P}}}_i}(h) + \frac{1}{2}\sum \nolimits _i \varphi _i d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {P}}_i). \end{aligned}$$
(13)

Now, for each \({\mathbb {P}}_i\), the following is true because the \({\mathcal {H}}\)-divergence abides by the triangle inequality

$$\begin{aligned} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {P}}_i) \le d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {S}}^*) + d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {S}}^*, {\mathbb {P}}_i) \end{aligned}$$
(14)

where

$$\begin{aligned} {\mathbb {S}}^* \in \mathop {\mathrm {arg\,min}}\limits \nolimits _{{\mathbb {S}}\in {\mathcal {O}}} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {S}}). \end{aligned}$$
(15)

Since this is true for each \({\mathbb {P}}_i\), we may write

$$\begin{aligned} \begin{aligned} \frac{1}{2}\sum \nolimits _i \varphi _i d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {P}}_i)&\le \frac{1}{2}\sum \nolimits _i \varphi _i d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {S}}^*) + \frac{1}{2}\sum \nolimits _i \varphi _i d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {S}}^*, {\mathbb {P}}_i) \\&= \frac{1}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {S}}^*) + \frac{1}{2}\sum \nolimits _i \varphi _i d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {S}}^*, {\mathbb {P}}_i) \\&\le \frac{1}{2} d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, {\mathbb {S}}^*) + \frac{1}{2}\max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j) \end{aligned} \end{aligned}$$
(16)

where the last inequality is due to the choice \({\mathbb {S}}^* \in {\mathcal {O}}\). Recalling \({\mathbb {S}}^*\) is also a minimizer of \(d_{{\mathcal {H}} \Delta {\mathcal {H}}}({\mathbb {Q}}, \cdot )\) yields the result. \(\square\)

As suggested by Albuquerque et al. (2020), interpreting this result provides a reasonable motivation for the use of source-source DANN in DG. The first term is a convex combination of ideal-joint errors between each source and the target. As before, we assume this is small and remains small after feature transformation by \(r_\theta\) when we apply DANN; i.e., recall Sect. 2.3. Later, we discuss some differences between the ideal-error terms we give in our bound and the ideal-error terms in the original bound of Albuquerque et al. (2020). The second term is a convex combination of the source errors. ERM on a mixture of the sources is appropriate for controlling this term. In both of the previous convex sums, the coefficients are assumed to be fixed, but arbitrary, replicating a natural data generation process where amounts of data from each source are not assumed. Ben-David et al. (2010a) model data arising from multiple sources in this way and provide generalization bounds as well. For the third term, when \({\mathcal {O}}\) is fixed as the set of mixtures \({\mathcal {M}}\), Albuquerque et al. (2020) suggest this term demonstrates the importance of diverse source distributions, so that the unseen target \({\mathbb {Q}}\) might be “near" \({\mathcal {M}}\). We extend this discussion later, showing how this term can change dynamically throughout the training process. The final term is a maximum over the source-source divergences. Application of the interpretation by Ganin and Lempitsky (2015)—to align domains through representation learning—motivates the suggestion of Matsuura and Harada (2020) to maximize the errors of a multi-class (source-source) domain discriminator. A more precise application might be to train all combinations of binary domain discriminator, but as Albuquerque et al. (2020) point out, this leads to a polynomial number of discriminators. As a practical surrogate, we opt to employ the best empirical strategy to date Matsuura and Harada (2020). Another option might be to instead use a collection of one-versus-all classifiers in place of a multi-class classifier Albuquerque et al. (2020). Note, neither method precisely minimizes Eq. (10), so we treat this as an implementation choice.

A remark on differences

As mentioned briefly, a reader familiar with the original statement of Albuquerque et al. (2020) will notice two differences: (1) rather than limiting consideration to the set of mixtures \({\mathcal {M}}\), this statement holds for all sets \({\mathcal {O}}\) which satisfy Condition (9) and (2) \(\lambda _\varphi\) is a different quantity for the ideal joint-error between \({\mathbb {Q}}\) and \(\{{\mathbb {P}}_i\}\).

On the latter point, rather than \(\lambda _\varphi\), Albuquerque et al. (2020) use the following definition of the ideal joint error given by Zhao et al. (2018) as below

$$\begin{aligned} \lambda _* = \min _{h \in {\mathcal {H}}} {\mathcal {E}}_{\mathbb {Q}}(h) + {\mathcal {E}}_{{\mathbb {S}}^*}(h) \end{aligned}$$
(17)

where \({\mathbb {S}}^* \in {\mathcal {M}}\) is the mixture distribution closest to \({\mathbb {Q}}\). As the original statement of Albuquerque et al. (2020) defines \({\mathcal {O}} = {\mathcal {M}}\), this definition is a perfectly reasonable choice. But, since our re-statement considers more general objects \({\mathcal {O}}\), we have removed this dependence on \({\mathcal {M}}\). As is visible in the proof, \(\lambda _\varphi\) does remove this dependence. In general, \(\lambda _*\) and \(\lambda _\varphi\) are incomparable. If one attempts to compare them, it will become evident that some assumptions must be made—e.g., on the relationship between the \(\{\varphi _i\}_i\) (which are arbitrary but fixed) and the coefficients used to form the mixture for \({\mathbb {S}}^*\) (which are dependent on \({\mathbb {Q}}\)). One reason to prefer \(\lambda _\varphi\) is that it does not require a single hypothesis to have low error on all sources simultaneously. Ben-David et al. (2010a) provide a larger discussion on the benefits of various approaches when combining data from multiple sources.

The former difference is of primary interest in this paper. Condition (9) may be considered to be the key fact about \({\mathcal {M}}\) which allows the derivation of Eq. (10). By identifying this, we open the possibility of considering more general objects satisfying Condition (9). In the following, we demonstrate the existence of such objects \({\mathcal {O}}\) and discuss the benefit they add.

3.1 Beyond mixture distributions

Consideration of general objects \({\mathcal {O}}\) which satisfy Condition (9) is only useful if such objects exist (besides \({\mathcal {M}}\)). The following example provides proof. See Fig. 1 for an illustrative picture.

Example 1

Let \({\mathcal {X}}\) be the real line \((-\infty , \infty )\) and let \({\mathcal {H}}\) be the set of hypotheses \(\{h_a(.)\}_{a \in {\mathbb {R}}}\) where \(h_a(.)\) is characteristic to the ray \((-\infty , a]\). Then, \({\mathcal {H}}\Delta {\mathcal {H}}\) is the set of hypotheses \(\{h_{a,b}(.)\}_{(a,b) \in {\mathbb {R}}^2}\) where \(h_{a,b}(.)\) is characteristic to the interval [ab]. Let \({\mathbb {P}}_1\) be the uniform distribution \({\mathcal {U}}(0,2)\), let \({\mathbb {P}}_2\) be \({\mathcal {U}}(2,4)\), and let \({\mathbb {S}}\) be \({\mathcal {U}}(1,3)\). Then \({\mathbb {S}}\) is not a mixture distribution of the components \({\mathbb {P}}_1\) and \({\mathbb {P}}_2\), but

$$\begin{aligned} \begin{aligned} 2&= \max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j) \ge \sum \nolimits _i \varphi _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) \end{aligned} \end{aligned}$$
(18)

for all non-negative coefficients \(\{\varphi _i\}_i\) which sum to 1.

In the context of this example, we might consider the object \({\mathcal {O}} = {\mathcal {M}} + \{{\mathbb {S}}\}\) to quickly see that more than just \({\mathcal {M}}\) can satisfy Condition (9). If \({\mathbb {S}}\) is a unique minimizer of the third term in Eq. (10) and does not increase the final term, then using \({\mathcal {O}}\) in place of \({\mathcal {M}}\) actually produces a strictly tighter bound. Later we more generally expand on this and other benefits of considering \({\mathcal {O}} \ne {\mathcal {M}}\).

Fig. 1
figure 1

A visualization of Example 1. Best viewed in color. The green line gives the value b of a hypothesis in \(\{h_{a,b}(.)\}_{(a,b)}\) with \(a \le 0\). Such a hypothesis would perfectly discern \({\mathbb {P}}_1\) and \({\mathbb {P}}_2\). From this, it follows that \(d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_1, {\mathbb {P}}_2)=2\) because a hypothesis in \(\{h_{a,b}(.)\}_{(a,b)}\) can achieve 2 and 2 is the maximum value for any divergence. Note, from this, it already follows that Eq. (18) holds because each term on the right-hand-side is bounded above by 2, and therefore, so is their convex combination. Still, we can analyze the example further. If we imagine the red line also gives the value b of a hypothesis in \(\{h_{a,b}(.)\}_{(a,b)}\) with \(a \le 0\) and slide it back and forth, we can never perfectly discern \({\mathbb {P}}_1\) or \({\mathbb {P}}_2\) from \({\mathbb {S}}\) and therefore we will never achieve the maximum divergence 2

Still, one simple example cannot fully justify the existence of useful \({\mathcal {O}} \ne {\mathcal {M}}\). For a more general perspective, it is useful to think of things geometrically. Albuquerque et al. (2020) often refer to \({\mathcal {M}}\) as the convex-hull of the sources. In this same vein, we point out that \(d_{{\mathcal {H}}\Delta {\mathcal {H}}}\) is a pseudometricFootnote 4 and therefore, shares most of the nice properties required of metrics used in the vast mathematical literature on metric spaces. Viewing a metric space as a topological space, it is common to think of open balls as the “the fundamental unit” or “basis” of the metric space. Loosely, borrowing this idea, we can define the (closed) \({\mathcal {H}},\rho\)-ball as below

$$\begin{aligned} {\mathcal {B}}_\rho ({\mathbb {P}}) = \{{\mathbb {S}} \mid d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}, {\mathbb {S}}) \le \rho \}. \end{aligned}$$
(19)

Using this object, the following result provides some useful information on the types of objects \({\mathcal {O}}\) which satisfy Condition (9). See Fig. 2 for a helpful visualization of our results.

Fig. 2
figure 2

An informal visualization. Blue dots represent sources. Purple lines define the boundaries of \({\mathcal {M}}\). Grey lines give the boundaries of the closed \({\mathcal {H}},\rho\)-balls around each source (defined in Proposition 3). Green colored areas define the boundary of \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\). Distributions within the yellow area may satisfy Condition (9). Distributions outside the yellow area (red dots) do not satisfy Condition (9) (Color figure online)

Proposition 3

Let \({\mathcal {X}}\) be a space and let \({\mathcal {H}}\) be a class of hypotheses corresponding to this space. Let the collection \(\{{\mathbb {P}}_i\}_{i=1}^k\) be distributions over \({\mathcal {X}}\) and let \(\{\varphi _i\}_{i=1}^k\) be a collection of non-negative coefficients with \(\sum _i \varphi _i = 1\). Now, set \(\rho = \max \nolimits _{u,v} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_u, {\mathbb {P}}_v)\). We show three results,

  1. 1.

    \({\mathcal {M}} \subseteq \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\).

  2. 2.

    If \({\mathbb {S}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\), then Condition (9) holds.

  3. 3.

    If \({\mathbb {S}} \notin \bigcup \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\), then Condition (9) fails to hold.

Proof

We begin with a proof of (1). Let \({\mathbb {S}} \in {\mathcal {M}}\) arbitrarily. The result follows by first observing, for all \({\mathbb {P}}_i\),

$$\begin{aligned} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) \le \sum \nolimits _j \alpha _j d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j) \le \rho . \end{aligned}$$
(20)

The first inequality follows by a property of the \({\mathcal {M}}\) shown by Albuquerque et al. (2020); for reference, we provide proof of this in Lemma 2 in the Appendix. The second inequality follows because \(\rho\) is defined as the largest source-source divergence. Now, if this is true for all \({\mathbb {P}}_i\), then \({\mathbb {S}}\) is by definition contained in every \({\mathcal {H}},\rho\)-ball in the intersection \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\). If an element is contained in every component set of an intersection, then it is contained in the intersection. And, we have shown (1).

Next, we show (2). By definition of \({\mathcal {B}}_\rho ({\mathbb {P}}_i)\), if \({\mathbb {S}} \in {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) then \(d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) \le \rho\). Since \({\mathbb {S}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) this is true for all \(i \in [k]\). Then,

$$\begin{aligned} \begin{aligned} \sum \nolimits _i \varphi _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) \le \sum \nolimits _j \varphi _j \rho = \rho . \end{aligned} \end{aligned}$$
(21)

We again recall that \(\rho = \max \nolimits _{u,v} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_u, {\mathbb {P}}_v)\). Hence, we have shown (2).

Finally, we demonstrate (3). To see this, note that if \({\mathbb {S}} \notin \bigcup \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\), then by definition for all i we have that \(d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) > \rho\). We follow the chain of inequalities below to arrive at our result

$$\begin{aligned} \begin{aligned}&\sum \nolimits _i \varphi _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {S}}) \\&\quad > \sum \nolimits _i \varphi _i \rho \\&\quad = \max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j). \end{aligned} \end{aligned}$$
(22)

Hence, we have shown (3) and are done. \(\square\)

Statements 1 and 2 in conjunction show there are intuitive objects \({\mathcal {O}}\)—i.e., \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\)—which both contain \({\mathcal {M}}\) and satisfy Condition (9). Statement 3 provides an intuitive boundary for \({\mathcal {O}}\). Thus, comparison of \({\mathcal {O}}\) to the union and intersection of closed balls, respectively, provides necessary and sufficient conditions for satisfying Condition (9).

3.2 The benefits of looking beyond mixtures

While the above discussion is useful in its own right, a more careful discussion of practical ramifications is needed.

Computationally tighter bounds

First, we point out that different objects \({\mathcal {O}}\) can lead to computationally tighter bounds in Eq. (10). For a concrete example, we prove \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) can lead to tighter bounds than \({\mathcal {M}}\) below. The proof follows a similar logic as presented following Example 1. In fact, for Example 1, it is true that \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) contains \({\mathcal {M}} + \{{\mathbb {S}}\}\), and thus, may reap the discussed benefit.

Proposition 4

Let \({\mathcal {X}}\) be a space and let \({\mathcal {H}}\) be a class of hypotheses corresponding to this space. Let \({\mathbb {Q}}\) and the collection \(\{{\mathbb {P}}_i\}_{i=1}^k\) be distributions over \({\mathcal {X}}\). Let \({\mathbb {P}}^*\) be the distribution in \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) closest to \({\mathbb {Q}}\) and let \({\mathbb {S}}^* \in {\mathcal {M}}\) be the mixture distribution closest to \({\mathbb {Q}}\). Then,

$$\begin{aligned} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}^*, {\mathbb {Q}}) \le d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}^*, {\mathbb {Q}}). \end{aligned}$$
(23)

Now, further, suppose the only solution to

$$\begin{aligned} \min _{{\mathbb {P}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i) }d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}, {\mathbb {Q}}) \end{aligned}$$
(24)

is contained in \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i) {\setminus } {\mathcal {M}}\). Then, we have

$$\begin{aligned} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}^*, {\mathbb {Q}}) < d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}^*, {\mathbb {Q}}). \end{aligned}$$
(25)

Proof

To see the first claim, note by Proposition 3, \({\mathcal {M}} \subseteq \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\). So it is clear that

$$\begin{aligned} \min _{{\mathbb {P}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}, {\mathbb {Q}}) \le \min _{{\mathbb {S}} \in {\mathcal {M}}} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}}). \end{aligned}$$
(26)

Since \({\mathbb {P}}^*\) and \({\mathbb {S}}^*\) are arguments minimizing left- and right-hand-side, respectively, we are done.

Now, we show the second claim. Equation 26 holds irregardless of our additional assumption, so we need only show that

$$\begin{aligned} \min _{{\mathbb {P}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}, {\mathbb {Q}}) \ne \min _{{\mathbb {S}} \in {\mathcal {M}}} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}}). \end{aligned}$$
(27)

But this is clear because if we assume the contrary—that the two quantities are equal—the implication is that a solution to Eq. 24 is contained in \({\mathcal {M}}\), a contradiction. Therefore, we have our result. \(\square\)

Now, for DANN, our hypothesis will usually be a neural network. In this case, the benefit of tightness may be considered irrelevant because the large VC-Dimension of neural networks (Bartlett et al., 2019) is the dominant term in any bound on error (i.e., using the PAC framework). Still, this conversation is not complete without considering the recent success of PAC-Bayesian formulations (e.g., see Dziugaite and Roy (2017)) which provide much tighter bounds when the hypothesis is a stochastic neural network. In Appendix A, we discuss a PAC-Bayesian distribution psuedometric (Germain et al., 2020) analogous to \(d_{{\mathcal {H}} \Delta {\mathcal {H}}}\). Because this psuedometric shares the important properties of \(d_{{\mathcal {H}} \Delta {\mathcal {H}}}\), these results are easily re-framed in this more modern formulation as well—where tightness may be a primary concern.

Intuitive analysis

Second, we point out that a particular object \({\mathcal {O}}\) can be easier to analyze. This fact will become evident as we develop an algorithmic extension to DANN for DG. Ultimately, we find that the novel object \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) may be manipulated to provide key motivating insights in algorithm design.

3.3 The \({\mathcal {H}}\Delta {\mathcal {H}}\)-divergence as a dynamic quantity

As mentioned, Albuquerque et al. (2020) interpret Proposition 2 as showing the necessity of diverse source distributions to control the third term \(\min _{{\mathbb {S}} \in {\mathcal {O}}} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}})\) when \({\mathcal {O}} = {\mathcal {M}}\). Logically, when distributions are heterogeneous, \({\mathcal {M}}\) presumably contains more elements, and so, the unseen target is more likely to be “close." When \({\mathcal {O}} = \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\), this is easier to see because the size of \({\mathcal {O}}\) is directly dependent on the maximum divergence between the sources (by the definition of \(\rho\)). In particular, reducing the maximum divergence and re-computing \({\mathcal {O}}\) could lead to removal of a unique minimizer for \(\min _{{\mathbb {S}} \in {\mathcal {O}}} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}})\).Footnote 5 In the context of the DANN algorithm, this is worrisome. Namely, during training, the point of using DANN is to effectively reduce the maximum divergence between sources and we expect this divergence to be decreasing as the feature representations of the source distributions are modified. In fact, under mild assumptions, we can formally show that DANN acts like a contraction mapping, and therefore, can only decrease the pairwise source-divergences. So, it is possible \(\min _{{\mathbb {S}} \in {\mathcal {O}}} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}})\) increases as the changing object \({\mathcal {O}}\) shrinks during training. Below we consider gradient descent on a smooth proxy of the \({\mathcal {H}} \Delta {\mathcal {H}}\)-Divergence in the simple, two-distribution case. The map \(r_\theta\) acts as the feature extractor affected by DANN.

Proposition 5

Let \({\mathfrak {D}}\) be a space of empirical samples over \({\mathcal {X}}\). Let \(r_\theta : {\mathcal {X}} \rightarrow {\mathcal {X}}\) be a deterministic representation function parameterized by the real vector \(\theta \in {\mathbb {R}}^m\). Further, denote by \(r_\theta (\widehat{{\mathbb {P}}})\) the application of \(r_\theta\) to every point of \(\widehat{{\mathbb {P}}} \in {\mathfrak {D}}\). Fix \(\widehat{{\mathbb {P}}}, \widehat{{\mathbb {Q}}} \in {\mathfrak {D}}\), let \({\mathcal {L}}: {\mathfrak {D}} \times {\mathfrak {D}} \rightarrow [0, \infty )\). Define \(\ell (\theta ) = {\mathcal {L}}(r_\theta (\widehat{{\mathbb {P}}}), r_\theta (\widehat{{\mathbb {Q}}}))\) and suppose it is differentiable with K-Lipschitz gradients. Further, suppose \(\theta ^*\) is the unique local minimum of \(\ell\) on a bounded subset \(\Omega \subset {\mathbb {R}}^m\). Then for \(\theta \in \Omega\) such that \(\theta \ne \theta ^*\), the function \(\tau : \Omega \rightarrow {\mathbb {R}}^m\) defined \(\tau (\theta ) = \theta - \gamma \nabla _\theta \ell (\theta )\) has the property

$$\begin{aligned} {\mathcal {L}}(r_{\tau (\theta )}(\widehat{{\mathbb {P}}}), r_{\tau (\theta )}(\widehat{{\mathbb {Q}}})) \le \beta _\theta {\mathcal {L}}(r_\theta (\widehat{{\mathbb {P}}}), r_\theta (\widehat{{\mathbb {Q}}})) \end{aligned}$$
(28)

for some constant \(\beta _\theta\) dependent on \(\theta\). In particular, for all \(\theta \in \Omega\), there is \(\gamma\) so that \(0< \beta _\theta < 1\).

Proof

We proceed by first showing an import inequality for functions \(\ell\) with the assumed properties, in particular, using a derivation presented by Wright (2016). Note first, by Taylor’s Theorem, for vectors \(u,v \in {\mathbb {R}}^n\), we have

$$\begin{aligned} \begin{aligned} \ell (u + v)&= \ell (u) + \int _{0}^1 \nabla \ell (u + \xi v)^\textrm{T}v \ d\xi \\&= \ell (u) + \nabla \ell (u)^\textrm{T}v + \int _{0}^1 \nabla \left[ \ell (u + \xi v) - \nabla \ell (u) \right] ^\textrm{T}v \ d\xi \\&\le \ell (u) + \nabla \ell (u)^\textrm{T}v + \int _{0}^1 ||\nabla \ell (u + \xi v) - \nabla \ell (u) ||\ ||v ||\ d\xi \\&\le \ell (u) + \nabla \ell (u)^\textrm{T}v + \int _{0}^1 \xi K||v ||^2 \ d\xi \\&= \ell (u) + \nabla \ell (u)^\textrm{T}v + \frac{1}{2}K ||v ||^2. \end{aligned} \end{aligned}$$
(29)

where the first line, as mentioned, is by Taylor’s Theorem, the second is by addition and subtraction of \(\nabla \ell (u)^\textrm{T}v\), the third is because the norm of a vector product is never larger than the vector product, and the fourth is by the Lipshitz property assumed on the gradients of \(\ell\).

With this inequality, we let \(\theta \in \Omega\) with \(\theta \ne \theta ^*\). Taking \(u = \theta\) and \(v = -\gamma \nabla \ell (\theta )\) achieves

$$\begin{aligned} \begin{aligned} \ell (\tau (\theta ))&\le \ell (\theta ) - \gamma \nabla \ell (\theta )^\textrm{T}\nabla \ell (\theta ) + \frac{\gamma ^2K}{2} ||\nabla \ell (\theta )||^2 \\&= \ell (\theta ) + \gamma (\tfrac{1}{2}\gamma K - 1)||\nabla \ell (\theta )||^2. \end{aligned} \end{aligned}$$
(30)

Next, we note that for \(\theta \ne \theta ^*\) we have \(0 \le \ell (\theta ^*) < \ell (\theta )\) because \(\theta ^*\) was assumed to be the unique local minimum of \(\Omega\). Then, we may set

$$\begin{aligned} \beta _\theta = 1 + \gamma \left( \tfrac{1}{2} \gamma K - 1\right) \frac{||\nabla \ell (\theta )||^2}{\ell (\theta )} \end{aligned}$$
(31)

which, in combination with Eq. (30) yields our first desired result (Eq. (28)).

Next, we show that for all \(\theta \ne \theta ^*\), we can pick \(\gamma\) which forces \(0< \beta _\theta < 1\). We first note that it is sufficient to show

$$\begin{aligned} \frac{-\ell (\theta )}{||\nabla \ell (\theta )||^2}< \gamma \left( \tfrac{1}{2} \gamma K - 1\right) < 0 \end{aligned}$$
(32)

since we may simply multiply by the reciprocal of the lower-bound and add one to realize the result. Next, we point out that there is some constant \(M > 0\) such that \(\vert \vert \nabla \ell (\theta )\vert \vert \le KM\). This follows by

$$\begin{aligned} ||\nabla \ell (\theta )||= ||\nabla \ell (\theta ) - \nabla \ell (\theta ^*)|\vert \le K \vert \vert \theta - \theta ^*\vert \vert \le KM \end{aligned}$$
(33)

where the equality holds because \(\theta ^*\) is a local minimum, the first inequality holds by the assumed Lipshitz property, and the second inequality holds because \(\Omega\) was assumed to be bounded. Without loss of generality, suppose \(M \ge 1\) (Eq. (33) holds regardless). Then our problem reduces further. In particular, it suffices to pick \(\gamma\) such that

$$\begin{aligned} \frac{-\ell (\theta )}{K^2M^2}< \gamma \left( \tfrac{1}{2} \gamma K - 1\right) < 0 \end{aligned}$$
(34)

since this lower bound is larger than or equal to that of Eq. (32). First, clearly, the upper bound holds when \(0< \gamma < \tfrac{2}{K}\), so this immediately restricts our choice of \(\gamma\). For the lower bound, we consider two cases for the value of \(\ell (\theta )\) and demonstrate there is \(\gamma\) with \(0< \gamma < \tfrac{2}{K}\) in both.

First, suppose \(\ell (\theta ) \ge \tfrac{1}{2}KM^2\). Then, if \(\tfrac{2}{K}> \gamma > \tfrac{1}{K}\) we have

$$\begin{aligned} \begin{aligned} 0&> \gamma \left( \tfrac{1}{2} \gamma K - 1\right)> \frac{-1}{2K} = \frac{-KM^2}{2K^2M^2} > \frac{-\ell (\theta )}{K^2M^2}. \end{aligned} \end{aligned}$$
(35)

Second, suppose \(\ell (\theta ) < \tfrac{1}{2}KM^2.\) Then if \(\gamma\) is such that

$$\begin{aligned} \frac{2}{K}> \gamma > \frac{1 - \sqrt{1 - \tfrac{2\ell (\theta )}{KM^2}}}{K} \end{aligned}$$
(36)

we have

$$\begin{aligned} \begin{aligned}&\gamma \left( \tfrac{1}{2} \gamma K - 1\right) + \frac{\ell (\theta )}{K^2M^2} \\&\qquad > \frac{K}{2} \left( \frac{1 - \sqrt{1 - \tfrac{2\ell (\theta )}{KM^2}}}{K}\right) ^2 - \frac{1 - \sqrt{1 - \tfrac{2\ell (\theta )}{KM^2}}}{K} + \frac{\ell (\theta )}{K^2M^2} \\&\qquad = 0. \end{aligned} \end{aligned}$$
(37)

Subtracting \(\tfrac{\ell (\theta )}{K^2M^2}\) from both sides of this inequality yields the desired lower bound. Further, we still have \(\gamma < \frac{2}{K}\), so the desired upper bound holds and we have our result.

Then, in any case, for each \(\theta \ne \theta ^*\), we can select \(\gamma\) so that \(0< \beta _\theta < 1\). \(\square\)

A key takeaway from the above is the presence of competing objectives during training. These objectives require balance. While DANN reduces the source-divergences to account for the final term in Eq. (10), we should also (somehow) consider the diversity of our sources throughout training to account for the effected term \(\min _{{\mathbb {S}} \in {\mathcal {O}}} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}})\). Another insight the reader gains (i.e., from reading the proof) is that the upper bound on \(\gamma\) is constant and the lower bound goes to 0 as \(\ell (\theta ) \rightarrow 0\). An interpretation of these bounds suggests the practical importance of an annealing schedule on \(\gamma\) during DANN training. In our own experiments, we anneal \(\gamma\) by a constant factor (i.e., step decay).

4 An algorithmic extension to DANN

Motivated by the argument presented in Sect. 3.3, this section devises an extension to DANN. While DANN acts to align domains, as noted, its success in the context of domain generalization is also dependent on the heterogeneity of the source distributions throughout the training process. Therefore, in an attempt to balance these objectives, we propose an addition to source-source DANN which acts to diversify the sources throughout the training. Note, while the theoretical principles of our approach are certainly applicable to other feature matching methods in the literature (see Sect. 6), the implementation of the algorithm we devise in this section may be different (i.e., if the feature matching method is not based on loss-modification and gradient updates).

Theoretical motivation

We recall the intersection of closed balls \({\mathcal {O}} = \bigcap \nolimits _i {\mathcal {B}}_{\rho } ({\mathbb {P}}_i)\); this is the main object of interest as it controls the size of the divergences in the upper bound of Proposition 2. More specifically, we are concerned with the quantity \(\min _{{\mathbb {P}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}, {\mathbb {Q}})\). Intuitively, if we want to reduce this quantity we should find some means to increase \(\rho\). One might propose to accomplish this by modifying our source distributions—e.g., through data augmentation –, but clearly, modifying our source distributions in an uncontrolled manner is not wise. This ignores the structure of the space of distributions under consideration and whichever distribution governs our sampling from this space – information that is, in part, given by our sample of sources itself. In this sense, while increasing \(\rho\), we should preserve the structure of \(\bigcap \nolimits _i {\mathcal {B}}_{\rho } ({\mathbb {P}}_i)\) as much as possible. Proposition 6 identifies conditions we must satisfy if we wish to increase \(\rho\) and modify our source distributions in a way that is guaranteed to reduce the third term of the upperbound in Eq. (10).

Proposition 6

Let \({\mathcal {X}}\) be a space and let \({\mathcal {H}}\) be a class of hypotheses corresponding to this space. Let \({\mathfrak {D}}\) be the space of distributions over \({\mathcal {X}}\) and let the collection \(\{{\mathbb {P}}_i\}_{i=1}^k\) and the collection \(\{{\mathbb {R}}_i\}_{i=1}^k\) be contained in \({\mathfrak {D}}\). Now, consider the collection of mixture distributions \(\{{\mathbb {S}}_i\}_i\) defined so that for each set A, \(\textrm{Pr}_{{\mathbb {S}}_i}(A) = \alpha \textrm{Pr}_{{\mathbb {P}}_i}(A) + \beta \textrm{Pr}_{{\mathbb {R}}_i}(A)\). Further, set \(\rho = \max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {P}}_j)\) and \(\rho ^* = \max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}_i, {\mathbb {S}}_j)\). Then \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i) \subseteq \bigcap \nolimits _i {\mathcal {B}}_{\rho ^*} ({\mathbb {S}}_i)\) whenever \(\rho ^* - \beta \max _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) \ge \rho .\)

Proof

Let \({\mathbb {Q}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) be arbitrary. Then, by definition, for all i, we have that

$$\begin{aligned} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {Q}}) \le \rho . \end{aligned}$$
(38)

Then, for all i, we have

$$\begin{aligned} \begin{aligned}&d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}_i, {\mathbb {Q}}) \le \alpha d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {Q}}) + \beta d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {Q}}) \\&\quad \le \alpha \rho + \beta d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {Q}}) \\&\quad \le \alpha \rho + \beta d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) + \beta d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}_i, {\mathbb {Q}}) \\&\quad \le (\alpha + \beta ) \rho + \beta d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) \\&\quad = \rho + \beta d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) \\&\quad \le \rho + \beta \max \nolimits _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) \\&\quad \le \rho ^* \end{aligned} \end{aligned}$$
(39)

where the first inequality follows by Lemma 2, the second inequality follows because \({\mathbb {Q}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\) so the divergence is bounded by \(\rho\) for all i, the third inequality follows because, in general, the \({\mathcal {H}}\)-divergence abides by the triangle-inequality, the fourth inequality follows again because \({\mathbb {Q}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\), and the last inequality follows because we have assumed

$$\begin{aligned} \rho ^* - \beta \max \nolimits _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) \ge \rho . \end{aligned}$$
(40)

Now, this is true for all i, so by definition of \(\bigcap \nolimits _i {\mathcal {B}}_{\rho ^*} ({\mathbb {S}}_i)\), we have that \({\mathbb {Q}} \in \bigcap \nolimits _i {\mathcal {B}}_{\rho ^*} ({\mathbb {S}}_i)\). Since \({\mathbb {Q}}\) was an arbitrary element of \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)\), we have shown \(\bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i) \subseteq \bigcap \nolimits _i {\mathcal {B}}_{\rho ^*} ({\mathbb {S}}_i)\) and we have our result. \(\square\)

The above statement suggests that if we want to diversify our training distributions, we should train on a collection of modified source distributions \(\{{\mathbb {S}}_i\}_i\). The modified distributions are mixture distributions whose components are pairs of our original source distributions \(\{{\mathbb {P}}_i\}_i\) and new auxiliary distributions \(\{{\mathbb {R}}_i\}_i\). The choice of \(\{{\mathbb {R}}_i\}_i\) is constrained to guarantee the new intersection \(\bigcap \nolimits _i {\mathcal {B}}_{\rho ^*} ({\mathbb {S}}_i)\) (with modified sources) contains the original intersection \(\bigcap \nolimits _i {\mathcal {B}}_{\rho } ({\mathbb {P}}_i)\). Ultimately, this means we can guarantee \(\min _{{\mathbb {S}} \in \bigcap \nolimits _i {\mathcal {B}}_{\rho ^*}({\mathbb {S}}_i)} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}, {\mathbb {Q}}) \le \min _{{\mathbb {P}} \in \bigcap \nolimits _i {\mathcal {B}}_\rho ({\mathbb {P}}_i)} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {P}}, {\mathbb {Q}})\).

Algorithm

Empirically speaking, our modified source samples \(\{\hat{{\mathbb {S}}}_i\}_i\) will be a mix of examples from the original sources \(\{{\mathbb {P}}_i\}_i\) and the auxiliary distributions \(\{{\mathbb {R}}_i\}_i\)—drawn from each proportionally to the mixture weights \(\alpha\) and \(\beta\). We plan to generate samples from the auxiliary distributions \(\{{\mathbb {R}}_i\}_i\) and our interpretation of Proposition 6 suggests we should do so subject to the constraint below

$$\begin{aligned} \max \nolimits _{i,j} d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {S}}_i, {\mathbb {S}}_j)- \beta \max \nolimits _i d_{{\mathcal {H}}\Delta {\mathcal {H}}}({\mathbb {R}}_i, {\mathbb {P}}_i) \ge \rho . \end{aligned}$$
(41)

Because \(\rho\) is a property of our original dataset, it is independent of the distributions \(\{{\mathbb {R}}_i\}_i\). This suggests that we should generate \(\{\hat{{\mathbb {R}}}_i\}_i\) to maximize the left hand side. Maximizing this requires: (Req.I) maximizing the largest divergence between the new source samples \(\{\hat{{\mathbb {S}}}_i\}_i\) and (Req.II) minimizing the largest divergence between our auxiliary samples \(\{\hat{{\mathbb {R}}}_i\}_i\) and our original source samples \(\{\hat{{\mathbb {P}}}_i\}_i\). Algorithmically, we can coarsely approximate these divergences, again appealing to the interpretation provided by Ben-David et al. (2010a) and Ganin and Lempitsky (2015): (Req.I) requires that our domain discriminator make fewer errors when discriminating the new source samples \(\{\hat{{\mathbb {S}}}_i\}_i\) and (Req.II) requires that the auxiliary samples \(\{\hat{{\mathbb {R}}}_i\}_i\) and the original sources \(\{\hat{{\mathbb {P}}}_i\}_i\) be indiscernible by our domain discriminator.

To implement these requirements, we modify our dataset through gradient descent. Suppose that \(\hat{{\mathbb {P}}}_i\) is an empirical sample from the distribution \({\mathbb {P}}_i\). We can alter data-points \(a^j \sim \hat{{\mathbb {P}}}_j\) to generate data-points \(b^j \sim \hat{{\mathbb {R}}}_j\) by setting \(x^j(0) = a^j\) and iterating the below update rule to minimize \({\mathcal {L}}_{SD}\) for T steps

$$\begin{aligned} \begin{aligned} x^j(t) \leftarrow x^j(t-1) - \eta \nabla _{x} {\mathcal {L}}_{SD}(\mu , \theta , x^j(t-1), j) \end{aligned} \end{aligned}$$
(42)

and then taking \(b^j = x^j(T)\). Importantly, we do not modify the domain labels in this modification. So, our updates satisfy requirement (Req.I) because minimization of \({\mathcal {L}}_{SD}\) approximates minimization of our domain discriminator’s errors, and further, satisfy (Req.II) because \(a_i\) and \(b_i\) are identically labeled, so minimization of the domain discriminator’s errors suggests that these examples should be indiscernible (i.e., assigned the same correct label).

While this update rule seemingly accomplishes our algorithmic goals, we must recall the final upper bound we wish to minimize (see Eq. (10)). The first two terms in this bound, \(\lambda _\varphi\) and \(\sum _i \varphi _i{\mathcal {E}}_{{\mathbb {P}}_i}(h)\), relate to our classification error—i.e., to the task-specific network \(c_\sigma\). If our generated distributions \(\{\hat{{\mathbb {R}}}_i\}_i\) distort the underlying class information, these terms may grow uncontrollably. To account for this, we further modify the update rule of Eq. (42) to minimize the change in the probability distribution output by the task classifier. We measure the change caused by our updates using the loss \({\mathcal {L}}_{KL}\)—i.e., the KL-Divergence (Kullback, 1997). This gives the modified update rule

$$\begin{aligned} \begin{aligned} x^j_i(t) \leftarrow&x^{j}_i(t-1) - \eta \nabla _{x} \left[ {\mathcal {L}}_{SD}(\mu , \theta , x^{j}_i(t-1), j) \right. \\ {}&\quad \left. + {\mathcal {L}}_{KL}(c_\sigma \circ r_\theta (x^j_i(0)), c_\sigma \circ r_\theta (x^{j}_i(t-1)))\right] . \end{aligned} \end{aligned}$$
(43)
figure a

Interpretation

In totality, this algorithm may be seen as employing a style of adversarial training where, rather than generating examples to fool a task classifier—e.g., the single-source DG approach of Volpi et al. (2018), we instead generate examples to exploit the weaknesses of the feature extractor \(r_\theta\) whose goal is to fool the domain discriminator. In this sense, the generated examples can be interpreted as cooperating with the domain discriminator. Hence, we refer to the technique as DANN with Cooperative Examples, or DANNCE. For details on our implementation of DANNCE please see the pseudo-code in Algorithm Block 1. Additional details can also be found in Appendix B.

5 Experimentation

In this section, we aim at addressing the primary point argued throughout this paper: the application of DANN to DG can benefit from (algorithmic) consideration of source diversity. While our theoretical discussion heavily focuses on convex hulls and \({\mathcal {H}},\rho\)-balls, we remind the reader that our theoretical results and algorithm are actually applicable to any distribution; these aforementioned geometric objects are only used as a theoretical reference to compare the target to the training data. So, since these geometric objects are challenging to compute for distributions, we instead validate our theoretical insights through the algorithm they produce. Namely, our modus operandi is comparison to recent state-of-the-art methods using a source-source DANN, or other domain alignment techniques, for domain generalization. See Appendix B and code provided in supplement for all implementation details and additional experiments.

Datasets and hyper-parameters

We evaluate our method on two multi-source DG datasets. (1) PACS (Li et al., 2017) contains 4 different styles of images (Photo, Art, Cartoon, and Sketch) with 7 common object categories. (2) Office-Home (Venkateswara et al., 2017) also contains 4 different styles of images (Art, Clipart, Product, and Real-W[orld]) with 65 common categories of daily objects. For both datasets, we follow standard experimental setups. We use 1 domain as target and the remaining 3 domains as sources. We report the average classification accuracy of the unseen target over 3 runs, using the model state at the last epoch to avoid peaking at the target. We select our hyper-parameters using leave-one-source-out CV (Balaji et al., 2018); this again avoids using the target in any way. Because some methods select parameters using a source train/val split, we use only the training data of the standard splits for fairness. Other parameters of our setup, unrelated to our own method, are selected based on the environment of Matsuura and Harada (Matsuura & Harada, 2020) (MMLD)—a SOTA source-source DANN technique. For full details, see Appendix B.

Our models

For the feature extractor \(r_\theta\) we use AlexNet (Krizhevsky et al., 2012) for PACS and ResNet-18 (He et al., 2016) for PACS and OfficeHome. Both are pretrained on ImageNet with the last fully-connected (FC) layer removed. For task classifier \(c_\sigma\) and domain discriminator \(d_\mu\) we use only FC layers. For ERM (often called Vanilla or Deep All) only \(r_\theta\) and \(c_\sigma\) are used and the model is trained on a mixture of all sources; this is a traditional DG baseline. For DANN, we add the domain discriminator \(d_\mu\) and additionally update \(r_\theta\) with \({\mathcal {L}}_{SD}\) (see Eq. (8)). Because we ultimately compare against DANN as a baseline, we must ensure our implementation is state-of-the-art. Therefore, we generally follow the implementation described by Matsuura and Harada (2020), adding a commonly used Entropy Loss (Bengio et al., 1992; Shu et al., 2018) and phasing-in the impact of \({\mathcal {L}}_{SD}\) on \(r_\theta\) by setting \(\lambda =2/(1+\exp (-\kappa \cdot p))-1\) in Eq. (8) with \(p=\text {epoch}/\text {max\_epoch}\) and \(\kappa = 10\).

For our proposed method, DANNCE, we use the same baseline DANN, but update 50% of the images (i.e., \(\beta =0.5\)) to cooperate with the domain discriminator following Eq. (43). The number of update steps per image is 5 (i.e., \(T=5\)).

Table 1 PACS and OfficeHome results in accuracy (%)

Experimental baselines

As mentioned, we focus on comparison to other methods proposing domain alignment for DG.  Albuquerque et al. (2020) (G2DM) and Li et al. (2018b) (MMD-AAE) propose variants of DANN,Footnote 6 and in particular, align domains by making updates to the feature extractor. As noted, Matsuura and Harada (2020) (MMLD) propose the DANN setup most similar to our baseline DANN. For MMLD, Matsuura and Harada (2020) additionally propose a source domain mixing algorithm—we denote this by MMLD-K with K the number of domains after re-clustering. Shankar et al. (2018) (CrossGRAD) and Zhou et al. (2020) (DDAIG), contrary to our work, generate examples which maximize the domain loss. Because, they do not update the feature extractor with the domain loss \({\mathcal {L}}_{SD}\) as we do, this may actually be viewed as domain-alignment by data generation (see Liu et al. (2019) who first propose this technique). For MMD-AAE and CrossGRAD, we use results reported by  Zhou et al. (2020) because the original methods do not test on our datasets.

Analysis of performance

Generally, in DG, the comparison of performance is subjective across different experimental setups—a problem highlighted by a recent commentary on the experimental rigor of DG setups (Gulrajani & Lopez-Paz, 2020). As such, we include reported results from other experimental setups, predominantly, to show our DANN implementation is a competitive baseline. This much is visible in Table 1. For 2 out of 3 setups, our DANN alone has higher overall accuracy than any other method.

Our focus, then, is the validation of our main argument using our strong DANN baseline. In this context, shown in Table 1, ablation of DANNCE reveals substantial improvement upon the traditional source-source DANN in all PACS setups and (seemingly) marginal improvement in the OfficeHome setup. While performance improvements on OfficeHome may seem marginal, they actually present a reasonable improvement, since OfficeHome has a staggering 65 categories to classify compared to 7 in PACS.Footnote 7 Ultimately, the performance gains demonstrated by addition of DANNCE agrees with our main argument: increasing diversity when aligning domains can have practical benefits in DG.

Fig. 3
figure 3

Cooperative Examples (bottom) and corresponding original image after pre-processing (top) for PACS setup with target sketch and ResNet-18. Gradient updates appear to introduce relatively large changes in color hues/tints. Changes in image texture are also present (see Fig. 4)

Fig. 4
figure 4

Cooperative Examples (bottom) and original images (top) magnified to illustrate change in image texture. Setup is identical to Fig. 3

Fig. 5
figure 5

Domain Discriminator Loss of DANN and DANNCE on PACS. For each target, we show the loss of its corresponding sources during training

Fig. 6
figure 6

Domain Discriminator Loss of DANNCE with 5 and 20 Steps of Image Updates

Analysis of loss curves

To measure domain diversity, we use the loss of the domain discriminator (averaged per epoch). This loss is used to proxy the \({\mathcal {H}}\)-divergence (an inverse relationship). A lower loss should then indicate more domain diversity, and, has the benefit of dynamically measuring diversity during training. Figure 5 shows the domain discriminator loss across epochs for our implementations of DANN and DANNCE using AlexNet on PACS. We generally see after epoch 15, the loss for DANNCE is lowest. Figure 6 further shows the effect of increasing the number of steps per image update. This suggests that increasing the number of updates has some control over the source domain diversity as intended. Finally, in both Figures, epochs 10 to 24 show the (inverted) smooth proxy for the domain divergence is increasing. This agrees with the formal claim made in Proposition 5. Although the trend changes after epoch 24, this is likely due to a decrease in \(\gamma\) at this epoch, and thus, does not necessarily disagree with our formal claim.

6 Related works

6.1 Domain adaptation theory

Many works extend the theoretical framework of Ben-David et al. (2010a) to motivate new variants of DANN. Zhao et al. (2018) consider the multi-source setting, Schoenauer-Sebag et al. (2019) consider a multi-domain setting in which all domains have labels but large variability across domains must be handled, and Zhang et al. (2019, 2020) consider theoretical extensions to the multi-class setting using a margin-loss. Besides the theoretical perspective of Ben-David et al. in DA, there are many other works to consider. Mansour et al. (2009) consider the case of general loss functions rather than the 01-error. Kuroki et al. (2019) consider a domain-divergence which depends on the source-domain and, through this dependence, always produces a tighter bound. Flamary et al. (2016) frame domain adaptation in terms of optimal transport. Many works also consider intergral probability metrics including: Redko et al. (2017), Shen et al. (2018), and Johansson et al. (2019). As has been discussed in this paper, the assumptions of various domain adaptation theories are of particular importance. Consequently, these assumptions are also important for DG. We discuss some assumptions in more detail in the next part.

6.2 Assumptions in DA

Ben-David et al. (2010b) show control of their divergence term as well as the ideal joint error \(\lambda\) (so that both are small) give necessary and sufficient conditions for a large class of domain adaptation learners. These are the conditions which we control (in case of the divergence term) and assume (in case of the ideal joint error). Other assumptions for DA include the co-variate shift assumption in which the marginal feature distributions are assumed to change but the feature conditional label distributions across domains remain constant. As we have discussed, Zhao et al. (2019) show that this assumption is not always enough in the context of DANN and Johansson et al. (2019) provide similar conceptualizations. Still, this assumption can be useful in the context of model selection (Sugiyama et al., 2007; You et al., 2019). Another common assumption is label shift: the marginal label distributions disagree, but the label conditional feature distributions are the same. Again, this is related to the concern of Zhao et al. (2019) since significant disagreement in the label distributions can cause DANN to fail miserably. Lipton et al. (2018) provide adaptation algorithms for this particular situation. Another assumption one can make for the benefit of algorithm design is the notion of generalized label shift in which the label distributions may disagree and the feature conditional label distributions agree in an intermediate feature space. As we have noted, Tachet et al. (2020) propose this assumption, devise new theoretical arguments under this assumptions, and suggest a number of algorithms based on their proposal.

6.3 Domain generalization theory

For DG, there is decidedly less theoretical work, but throughout our text, we have attempted to compare to the most relevant (and recent)—a bound proposed by Albuquerque et al. (2020). Albeit, some different theoretical perspectives on DG do exist. Li et al. (2020) consider the case where the feature conditional distribution of the target’s latent space is a linear combination of the sources; effectively moving the convex-hull concept to a learned feature-conditional latent space. Ye et al. (2021) consider the learnability of a DG problem, providing rigorous definitions of which problems one can expect to solve and which problems one cannot. The accompanying generalization bounds assume this definition of learnability, whereas bounds in our work do not. Instead, our bounds are applicable to all distributions and may be thought of as incorporating some idea of “learnability” into the bound itself via the hypothesis class and reference objects like the set of mixtures. In case the number of domains sampled may be larger, Blanchard et al. (2011, 2021) and Deng et al. (2020) consider domain generalization from the perspective of a meta-distribution which governs our observation of domains. Asymptotically, as we observe more domains, we can be more confident on the success of our algorithm. While this approach is interesting, our paper instead focuses on the case where we only have a relatively small number of domains from which to learn. In general, it is important to realize DG is a challenging problem where some assumptions must be made in order to provably guarantee the success of a learning algorithm. Different theoretical frameworks with different assumptions may be more or less applicable to different real-world problems.

6.4 Algorithms in DG

Besides DANN and other domain-aligning algorithms mentioned in this text, there are of course additional algorithmic perspectives on DG too. An early work in DG by Muandet et al. (2013) proposes a kernel-based algorithm aimed at producing domain-invariant features with a strong theoretical justification. More recently, a common thread is the use of meta-learning (e.g. to simulate domain-holdout) seen in Li et al. (2018a), Balaji et al. (2018), and Dou et al. (2019). Some authors, such as Wang et al. (2019) and Carlucci et al. (2019), make additional assumptions on the domains to be seen and use this in algorithm design. As mentioned, similiar to our own algorithm, many works emphasize the importance of increasing the diversity of the source-data during training: Volpi et al. (2018), Albuquerque et al. (2020), Zhou et al. (2021), and Zhang et al. (2022). In addition in the distinctions present in algorithm design, our work also differs from these in its emphasis on the competing objectives this produces in feature-matching algorithms and accompanying theoretical analysis. Lastly, some works focus on the neural network components themselves, e.g., Li et al. (2017). These architecture changes can be very effective (see Seo et al. (2019) for impressive results when modifying batch-normalization). Related to our paper’s main point, we primarily focus on comparison to other methods proposing domain alignment for DG, especially those which are, in some sense, model agnostic. These additional references are discussed in Experimental Baselines.

7 Discussion

In this work, we investigate the applicability of source-source DANN for domain generalization. Our theoretical results and interpretation suggest a complex relationship between the heterogeneity of the source domains and the usual process of domain alignment. Motivated by this, we construct an algorithmic extension for DANN which diversifies the sources via gradient-based image updates. Our empirical results and analyses support our findings.

One of the motivations of our algorithm is also one of the predominant limitations of our study. In particular, the behavior of DANN as a dynamic process is not well understood. Studying it as such can reveal to us new information. For example, in the proof of Proposition 5, we saw the importance of annealing the learning rate for DANN. We also use Proposition 5 to motivate our algorithm design, but there are certainly open questions on the dynamic behavior of DANN and DANNCE. For example, it would be interesting to consider the competing objectives we have discussed in a more analytically tractable environment. Even for simple distributions, it is an open question as to how the hyper-parameters of DANNCE—which intuitively balance the competing objectives—may be optimally selected. On a related note, although we have assumed the ideal joint error is generally small, we have also pointed out that this is not always the case Zhao et al. (2019). While our promising results indicate this may not be an issue in practice, it is still interesting to consider this from a more theoretical perspective as well. Finally, it is important to point out that our empirical investigation was limited to images. It is interesting to consider how our technique might extend to natural language or other areas where gradient-based algorithms are used for learning.