On the benefits of representation regularization in invariance based domain generalization

A crucial aspect of reliable machine learning is to design a deployable system for generalizing new related but unobserved environments. Domain generalization aims to alleviate such a prediction gap between the observed and unseen environments. Previous approaches commonly incorporated learning the invariant representation for achieving good empirical performance. In this paper, we reveal that merely learning the invariant representation is vulnerable to the related unseen environment. To this end, we derive a novel theoretical analysis to control the unseen test environment error in the representation learning, which highlights the importance of controlling the smoothness of representation. In practice, our analysis further inspires an efficient regularization method to improve the robustness in domain generalization. The proposed regularization is orthogonal to and can be straightforwardly adopted in existing domain generalization algorithms that ensure invariant representation learning. Empirical results show that our algorithm outperforms the base versions in various datasets and invariance criteria.


Introduction
Most research in deep learning assumes that models are trained and tested from a fixed distribution.However, such deep models generally failed to adopt in the real-world applications, because the test environment is often different from training (or observed) environments.Thus, the capacity in generalizing the new environment is crucial for developing reliable and deployable deep learning algorithms (e.g [9]).
To this end, Domain Generalization is recently proposed and studied to alleviate the prediction gap between the observed training (S) and unseen test (T ) environments.Taking the advantage of the learned inductive bias from multiple observed sources, the prediction on the test environment can be guaranteed in some specific scenarios [4].
Meanwhile, extrapolation to a new environment is challenging since the environmental distribution-shifts are inevitable and unknown in advance.Such changes typically include covariate shift [25], conditional shift [15,3] or both.Based on different distribution-shift assumptions, a widely adopted principle is to learn a representation to satisfy several invariance criteria [6] among the observed environments (i.e, sources S).Through minimizing the source prediction risk and enforcing the invariance, the prediction performance can be improved in many empirical scenarios [18,15].
Although learning invariance is popular in domain generalization with certain practical success, its theoretical counterpart still remains elusive.For instance, Is it sufficient to merely learn an invariant representation and minimize source risks to guarantee a good performance in a new environment?What are the sufficient conditions to guarantee a small test-environment error?
Contributions In this paper, we aim to address these fundamental problems in domain generalization.Concretely, (1) We reveal the limitation of representation learning in domain generalization through barely ensuring invariance criteria, which can lead to a over-matching on the observed environments.e,g the complex or non-smooth representation function will be vulnerable to an unseen distributionshift; (2) We derive novel theoretical analysis to upper bound the unseen test environment error in the context of representation learning, which highlights the importance of controlling the complexity of the representation function.We further formally demonstrate the Lipschitz property as the sufficient conditions to ensure the smoothness of the representation function; (3) In practice, we propose the Jacobian matrix regularization as a new criteria in various invariance criteria and datasets, and the empirical results suggest an improved performance in predicting the test environment.

Background and Motivation
Throughout this paper, we have T observed (source) environments S 1 (x, y), . . ., S T (x, y) with x ∈ X , y ∈ Y.The goal of domain generalization is to learn a proper representation φ : X → Z and classifier h : Z → Y to have a good performance on the (unseen) test environment T (x, y).
Specifically, let L denote the prediction loss, domain generalization can be formulated as minimizing the following loss: where INV(φ, S 1 , . . ., S T ) is an auxiliary task to ensure the invariance among the observable source environments, which have various forms: 1. Marginal feature invariance [8] through enforcing ∀t, y. 3. Label conditional invariance [3,12] through enforcing The aforementioned invariance principles have been broadly adopted in domain generalization with various empirical algorithms.However, the following counterexamples reveal that merely optimizing Eq. ( 1) with different invariance criteria is not a sufficient condition for guaranteeing a reliable prediction in the unseen (test) environment.
In Fig. 1, we illustrated the limitations of these three invariance principles.Specifically: (1) Enforcing marginal invariance is problematic when the conditional distributions are different.In Fig. 1(a), two observed environments have different label portions.A simple linear embedding function φ can ensure S 1 (z) = S 2 (z).However, when we adopt a shared classifier h, the output prediction distribution ŷ = h(z) are identical.Clearly, it is problematic since the label distributions between the environment can be significant different.
(2) Compared with marginal invariance, feature and label conditional invariances impose stronger principles.However, the prediction can be still vulnerable in the test environment due to the over-matching.Specifically, in Fig. 1(b, Left), if we adopt the embedding function φ and classifier h as: Then, in the latent space z, ∀y ∈ Y we have the conditional invariance with S 1 (y|z) = S 2 (y|z) and S 1 (z|y) = S 2 (z|y) and zero prediction error in the observed environments with E (x,y)∼S t L(h • φ(x), y) = 0.However, in the test time, if the unseen environment has a consistent shift in Fig. 1(b, Right) such that ∀y, d TV (T (x|Y = y) S 2 (x|Y = y)) = with 0 < < 0.5, then the prediction error w.r.t.(0-1) binary loss is E (x,y)∼T L(h • φ(x), y) = , which is vulnerable and non-ignorable in the consistent distribution shift.Moreover, this problem can be much more severe in high-dimensional dataset and over-parametrized deep neural network.
The limitation of Eq (1) is the potential over-matching in the embedding function, where there exist infinite φ to minimize Eq. (1) in Fig. 1(b).However, some embedding are rather complex which are poorly generalized to the new environment.In fact, only a subset of φ are more robust for the consist environment shift, which suggests a proper model selection w.r.t.φ: In the follow sections, we will derive theoretical results to demonstrate the influence of model selection w.r.t.φ.

Theoretical Analysis
We aim at proposing a formal understanding of the regularization term in predicting the unseen test environment.Let the embedding being a random transformation (or transition probability kernel) Φ(z|x) : X → Z, where the deterministic representation function is a special case with Φ(z|X = x) = δ φ(x) , where δ is the delta function.The conditional distribution defined on the latent space Z is denoted as S(z) = Φ(z|x)S(x)dx and S(z|Y = y) = Φ(z|x)S(x|Y = y)dx.Before presenting the theoretical results, there are two additional elements to be clarified: Performance Metric Throughout this paper we use Balanced Error Rate (BER) rather than the conventional ERM to measure the performance since the training and test environments can be highly label-distribution imbalanced.Specifically, the prediction risk w.r.t. the classifier h and embedding distribution Φ is Intuitively, BER measure the uniform-average classification error on each class.
Invariance Criteria In our analysis, we mainly focus on the the feature-conditional invariance since the label information is generally discrete or low-dimensional, which is relatively straightforward to realize in practice.We will further justify the feature-conditional invariance can also induce the label-conditional invariance and marginal invariance, shown in Lemma 1.
Based on these two elements, we can demonstrate the risk of test environment in the context of representation learning.
Theorem 1 Supposing i) observed source environments are S 1 (x, y), . . ., S T (x, y) and unseen test environment is T (x, y); ii) the prediction loss L is bounded in [0, 1]; iii) the embedding distribution Φ satisfies a small feature-conditional total variation distance on the latent space Z: ∀i, j ∈ {1, . . ., T } y ∈ Y, d TV (S i (z|Y = y) S j (z|Y = y)) ≤ κ; iv) ∀y ∈ Y, on the raw feature space X : min Where α TV (Φ) is Dobrushin coefficient [22]: Discussions The prediction risk of an unseen test environment is controlled by the following terms: S 3 T Fig. 2: Illustration of : distance between T and its nearest source S 3 .
(1) The first term suggests to learn h and Φ to minimize the BER over the labeled data from the source environments; (2) A small κ indicates learning Φ to match featureconditional distribution.Specifically, when κ = 0, we have S 1 (z|Y = y) = • • • = S T (z|Y = y), achieving feature-conditional invariance; (3) in the third term is a unobservable factor in the learning.As Fig. 2 shows, reveals the inherent relations between the test and source environments.Intuitively, a small indicates the test environment T is similar to one of the observed sources, which indicates we can more easily predict the test through leveraging the knowledge from the sources; (4) α TV (Φ) in the third term is the controllable factor as a regularization of Φ.Specifically, α TV (Φ) reflects the the smoothness of the embedding distribution.At the test time, regularization on Φ is crucial since the is unknown, uncontrollable and even non-ignorable.That is, merely minimizing Eq. ( 1) by ensuring BER S t (h, Φ) = 0 and κ = 0 are not sufficient.If α TV (Φ) is large, the upper bound will become vacuous and generalization in the test environment is not necessarily guaranteed; (5) The trade-off in learning Φ.Although α TV (Φ) suggests a smooth representation, however over-smoothing is harmful in learning meaningful representation.For instance, if embedding distribution Φ is a constant, then α TV (Φ) = 0, the network does not learn an embedding and BER S t (h, Φ) will be inherently large.
Compared with most previous theoretical results, our results highlight the role of representation learning in domain generalization.In particular, Theorem 1 further motivates novel algorithm to control the Dobrushin Coefficient, which is shown in Sec 3.2 and Sec 4.

Relation with other invariance criteria
Theorem 1 justifies the importance of considering regularizing of Φ under featureconditional invariance, the following Lemma reveals the relations with other two invariance criteria.
Lemma 1 If the embedding distribution Φ satisfies a small feature-conditional total variation distance on the latent space Z: ∀i, j ∈ {1, . .
|Y| , then we have where C + is a positive constant and Ω = supp(S i (z)) ∩ supp(S j (z)) denotes the intersection of latent space between two environments.
Lemma 1 reveals that the feature-conditional invariance can induce other two types of invariances if the label distribution among the source is balanced, which is practically feasible through re-sampling the dataset as uniform distribution.Specifically, if κ = 0, we can achieve other two invariances.

Sufficient conditions for controlling Dobrushin Coefficient
We will discuss sufficient conditions for controlling the Dobrushin Coefficient, which is intuitively interpreted as smoothness properties of representation.Lemma 2 shows that a Lipschitz condition is one sufficient condition to control α TV (Φ).
Lemma 2 Supposing the embedding distribution then the Dobrushin Coefficient can be upper-bounded by: We can verify that if L φ → 0, then α TV (Φ) → 0. In the conventional deep neuralnetwork, the deterministic parametric embedding can be approximated as the mean (φ(x)) of the conditional distribution with a small variance [1].Therefore, Lemma 2 suggests learning a Lipschitz embedding to promote a better generalization property in the test environment T .

Practical Implementations
We have demonstrated the Lipschitz property of embedding function φ can induce a better generalization property.In this section, we will further elaborate practical implementations to realize the Lipschitz property of the embedding function through multiple observed source environments.
Fig. 3: Illustration of the virtual sample regularization.
It has been proved that the Frobenius norm of Jacobian matrix w.r.t φ is the upper bound of small Lipschitz constant of φ [20].In order to take advantage of multiple-environments, we create virtual samples x through a linear combination of the samples from the sources, shown in Fig. 3.The linear combination coefficients (γ 1 , . . ., γ T ) are generated through the Dirichlet distribution with hyper-parameter β = 1.The aim of creating virtual samples x is to enforce a smooth prediction behavior on the unobserved regions between the environments, which can be broadly viewed as data-augmentation based approach.(This will be discussed in the related work.)

Algorithm 1 Regularization of φ
Require: Multiple-source data-sets S 1 , . . ., S T , embedding φ, hyper-parameter β. 1: Compute Frobenius Norm of Jacobin matrix Regularization is independent of learning invariance We denote the black-box algorithms that achieve the invariance (e.g.feature, label and feature conditional invariance) as INV(φ, S 1 , . . ., S T ), which includes a board range of algorithms.Then the improved loss can be expressed as: In the experimental part, we will investigate different invariance approaches and the benefits of the regularization.

Related Work
Learning invariance is a popular and widely adopted approach in domain generalization.Inspired from the techniques in deep domain adaptation [5], various approaches have been proposed to enable different invariance criteria such as marginal invariance S 1 (z) = • • • = S T (z) [8,15,24,2].However, the proposed theoretical results are mainly inspired from unsupervised domain adaptation, which does not consider the specific scenarios in domain generalization.i.e, the label information is known during the alignment, which can induce better alignments.As for feature conditional invariance S 1 (z|y) = • • • = S T (z|y) [14,28,30,11], it considers the label information and enforce stronger conditions among the sources.However, as our counterexample indicates, merely learning the conditional invariance is not sufficient to provably guarantee the unseen test prediction risk.In contrast, we further formally reveal the limitation of representation learning w.r.t.conditional invariance, which remains elusive in the previous work.A more recent approach is to learn label conditional invariance, i.e. ensuring the same decision boundary across the different environments (IRM [3,17]).However, recent work reveals the failure scenarios in IRM, which can be explained through our theoretical analysis.
Relation with Data-Augmentation Based Approach It has been recently observed that data-augmentation based approaches are quite effective in various practical domain generalization [27,16,31,32,21].Intuitively, augmentation based approaches aim at generate new samples from observed environments to enable smoother prediction results.In this part, we aim to prove the role of data-augmentation, which is implicit to learn a smooth representation and consistent with our theoretical results.
Specifically, we consider one typical case with a conditional black-box interpolation function INP with x = INP(x 1 , . . ., x T ; y) with x 1 ∼ S 1 (x|y), . . ., x T ∼ S T (x|y).For instance, considering object classification under different background, the conditional augmentation aims at creating the same object through considering information from different environments.We further suppose the binary classification problem with Y = {−1, +1}, the classifier is linear with h(z) = w T z and the prediction loss is logistic loss with L(ŷ, y) = log(1 + exp(−ŷy)).The the augmentation loss can be written as: If we use second-order Taylor approximation at E x[φ(x)], the centroid of the augmentation feature on the embedding space, then the prediction loss can be approximated as: The augmented prediction loss can be approximated two terms: (1) suggests a small loss on the centroid of the generated feature, (2) indicates a smooth prediction on the new generated sample.Since L (w T E x[φ(x)], y) ≤ 1 and φ is Lipschitz function, (2) can be further upper-bounded by: ( Therefore, if an embedding function is set to be smooth with a small lipschitz constant, the second order approximation of the augmentation loss can be controlled.Therefore, minimizing the prediction loss on the augmented data can be viewed as an implicit approach to enable the smooth representation.

Experiments
In the experimental part, we aim to address the following question: Is the regularization term effective to generalize in the unseen environments?In what are scenarios that the regularization is beneficial ?

Choice of Invariance Criteria and Loss
We evaluate the proposed regularization through typical invariance representation algorithms to verify the effectiveness of the regularization.
Where 1 t is the one-hot vector.Intuitively, the discriminator tried to minimize the cross-entropy loss to differentiate the sources and ensure the embedding to learn an invariant representation.

Dataset description and Experimental setup
The experiment validation consists in evaluating toy and real-world datasets to verify the effectiveness of the regularization.
ColorMNIST [3] Each MNIST image is either colored by red or green, in order to strongly correlate (but spuriously) with the class label.Thus the class label is strongly correlated with the color than with the digit configuration.The algorithm purely minimizing training error will tend to exploit the false relation of the color, which will lead to a poor generalization in the unseen distribution with different color relations.
Following [3], the dataset is constructed as follows.(1) Preliminary binary label.We randomly select 5K samples from MNIST and construct preliminary binary label ỹ = 0 for digits 0-4 and ỹ = 1 for 5-9; (2) Adding label noise.We obtain the final label y by flipping ỹ with probability 0.25; (3) Adding color as spurious feature.We add the color to the gray-scale digit image by flipping y with probability P S (i.e, coloring y = 1 with red and y = 0 with green by probability 1 − P S ).
The ColorMNIST creates a controllable environment through assigning various P S , which enable us to evaluate the generalization performances under different unobserved environments.
PACS [13] and Office-Home [26] are real-world datasets with high-dimensional images.In PACS, the dataset consists four domains Photo (P), Art (A), Cartoon (C), Sketch (S) with 7 classes.In Office-Home, the dataset includes four domains Art (A), Clipart (C), Product (P) and Real World (R) with 65 classes.
Experimental Setup We use the standard domain generalization framework Do-mainBed [10] to implement our algorithm.In ColorMNIST, we adopt the LeNet structure with three CNN layers as φ and three fc-layers as h.The mini-batch is set as 128 with Adam optimizer with λ 0 = 1, λ 1 ∈ [10 −3 , 1].In PACS and Office-Home datasets, we adopt the pre-trained ResNet-18 as φ and three fc-layers as h.We adopted training-domain validation set [10] to search the best hyper-parameter configuration.Specifically, we set the batch size as 64 and λ 0 ∈ [10 −7 , 10 −2 ] and λ 1 ∈ [10 −5 , 1].We adopt the train-validation split approach (i.e, we randomly split the observed environment as training and validation set and tune the best configuration on the validation set w.r.t. the S. We did not know the test environment during the tuning.) to search the best hyper-parameter.We run the experiments five times and report the average and std.The detailed network structures are delegated in the appendix.

Empirical Results
Table 1: Empirical Results (Accuracy Per-Class on %, bold indicates a statistical significant result) on ColorMNIST.We have three environments with different P S = {0.1,0.2, 0.9}.In the domain-generalization, we train on two environments and test on the untrained environment.The results presented in Tab.1, 2 and 3.In all datasets and different invariance criteria, the regularization suggest a consistent improvement (ranging from 1.2% − 6.2%).Specifically, the improvement in synthetic dataset ColorMNIST is significant, which reveals the effectiveness of the proposed regularization.Moreover, in the real-world datasets such as Office-Home and PACS, the regularization suggests consistent better performance.

Analysis
We further conduct various analysis to understand the property and role of regularization.
Table 2: Empirical Results (Accuracy Per-Class on %, bold indicates a statistical significant result) on PACS.We have four environments Photo (P), Art (A), Cartoon (C) and Sketch (S).In the domain-generalization, we train the model on three environments and test on the untrained environment.Influence of regularization For a better understanding of the regularization, we gradually change λ 1 to show the influence of regularization.The empirical results are consistent with our theoretical analysis: in the presence of small regularization, the prediction performance can be improved.However, a strong regularization (over smoothing) on the representation learning can be harmful with a dropped prediction performance.

Evolution of Training
We additionally visualize the evolution of adversarial loss and the norm of Jacobin matrix in two training modes: conditional alignment with and without regularization.Clearly, training without explicit regularization will lead to a relative large norm of Jacobian matrix.During the optimization procedure, the norm of Jacobian matrix gradually but slowly diminishes, which is possibly caused by the implicit regularization through SGD based approach [23].Therefore, adding an explicit regularization term can help a better generalization property.P T = {0.05, . . ., 0.85}, shown in Fig. 6.In the observed environments P S = {0.2,0.9}, both approaches achieve high prediction accuracy with > 95%.However, the generalization behaviors are quite different: adding an regularization term consistently improves the performance in out-of-distribution prediction through 3 − 5%.

Conclusion
In this paper, we analyzed the representation learning based domain generalization.Concretely, we highlight the importance of regularizing the representation function.
Then we theoretically demonstrate the benefits of regularization, as the key role to control the prediction error in the unseen test environment.In practice, we evaluate the Jacobin matrix regularization on various invariance criteria and datasets, which suggests the benefits of regularization.

Appendix: Proof
Proof of Theorem 1 The prediction error on the test environment can be written as: Where S is the nearest source environment that is the most similar to the test environment (i.The proof of the above inequality is analogous to the first inequality and derived by the property of TV distance.Concretely, we use the inequality T -times and then derive the average upper bound.Since we adopt the feature conditional invariance criteria, then we have d TV (S (z|Y = y) S t (z|Y = y)) ≤ κ.This inequality holds since in training we have enforced small conditional invariance among all the sources.Then the term can be upper bounded by: Then we upper bound the second term through introducing the strong dataprocessing inequality [22].The strong data-processing suggests a tighter bound of data-processing inequality.Specifically, it reveals the decay rate of information loss, characterized by the Dobrushin coefficient.
Strong data-processing inequality For distributions P 0 , P 1 defined on X and a channel Q from space X to space Z, define a marginal distribution M 0 (z) = Q(z|x)P 0 (x)dx.The channel Q satisfies a strong data processing inequality with constant α ≤ 1 for the given f -divergence.
Where α is a constant defined with α f (Q) = sup P 0 =P 1 D f (M 0 M 1 ) D f (P 0 P 1 ) .For any convex f divergence, we have: Where α TV (Q) is the Dobrushin coefficient, which is equivalent as: In our problem, the embedding distribution Φ can be viewed as the information channel, and we denote distributions Then the TV distance can be upper bounded as: We assume φ is L φ Lipschitz such that w.r.t.x, φ(x) − φ(x ) ≤ L φ x − x 2 and the d max = sup x,x x − x 2 .Then we have: Relation with Data-Augmentation In this part, we will demonstrate a simple proof to show the role of data-augmentation, which also aims at regularizing the representation function.
We suppose a differentiable embedding function φ : X → Z and we suppose the loss function as logistic loss L(ŷ, y) = log(1 + exp(−ŷy)) and the predictor as a linear function w, binary classification with balanced label distribution.Then the objective function can be written as: Where x = INP(x 1 , . . ., x T ), x 1 ∼ S 1 (x|Y = y), . . ., x T ∼ S T (x|Y = y) is any interpolation function of samples from multiple environments.We also suppose the data-augmentation aims at improving the local property of the representation φ.Then by using first-order Taylor expansion at local representation φ 0 , we have: G 1 (w) = L(w T φ 0 , y) + E x(φ 0 − φ(x))L (w T φ 0 , y) If we take φ 0 (x) = E x[φ(x)], then the second term vanish, then the first order approximation can be expressed as: Then we compute the second-order approximation at point φ 0 , then we have We can further compute that if L is logistic loss, the second-derivative is independent of label y and the second derivative is bounded by 1. Then we have Var x(w T φ(x)) 2

Fig. 1 :
Fig.1: Limitations of optimizing Eq. (1) with different invariance criteria.Intuitively, the marginal invariance fails when conditional shift occurs.The conditional invariance learns an over-matched embedding with a non-ignorable prediction error in the test environment.
t∈{1,...,T } d TV (T (x|Y = y) S t (x|Y = y)) ≤ .Then the Balanced Error Rate on the test environment is upper bounded by:

Fig. 4 :
Fig.4: Influence of regularization in PACS dataset in CDANN.We gradually change the importance of regularization (i.e, different λ 1 ).The accuracy first increases with a larger λ 0 , then the accuracy drops due to a strong smoothing on the representation.

Fig. 5 :
Fig. 5: Loss Evolution in Office-Home dataset (Training Environments: Clipart, Product, Real-World) in CDANN.Left:The evolution of adversarial loss and regularization term if we adopt the regularization loss.Right: The evolution of adversarial loss and regularization term (Norm of Jacobin matrix) without adopting regularization loss.The results reveal that without explicit regularization loss, the norm of Jacobin matrix still gradually (but slowly) diminishes.In contrast, adding an explicit term can accelerate the optimization procedure.

Fig. 6 :
Fig.6: Generalization on different test environments.The observed environments are P S = {0.2,0.9} with high prediction performance.However, in the generalization of other test environments, the regularization term consistently improves the prediction performance.

E
e. in the raw feature space, d TV (S (x|Y = y) T (x|Y = y)) ≤ ) We have the following upper since the prediction loss in upper bounded by 1 and the property of TV distance ([22], Remark 3.1).We first bound the first term, since S is unknown source during the training, then we can upper bound through all the sources, i.e, ∀t ∈ {1, . . ., T }, we have:E z∼S (z|Y =y) L(h(z), y) ≤ 1 T T t=1 z∼S t (z|Y =y) L(h(z), y)+ 1 T T t=1d TV (S (z|Y = y) S t (z|Y = y))

Table 3 :
Empirical Results (Accuracy Per-Class on %, bold suggests a statistical significant result) on Office-Home.We have four environments Art (A), Clipart (C), Product (P) and Real-world (R).In the domain-generalization, we train the model on three environments and test on the untrained environment.