Conditional variance penalties and domain shift robustness

When training a deep neural network for image classification, one can broadly distinguish between two types of latent features of images that will drive the classification. We can divide latent features into (i) ‘core’ or ‘conditionally invariant’ features C\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C$$\end{document} whose distribution C|Y\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C\vert Y$$\end{document}, conditional on the class Y, does not change substantially across domains and (ii) ‘style’ features S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S$$\end{document} whose distribution S|Y\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$S\vert Y$$\end{document} can change substantially across domains. Examples for style features include position, rotation, image quality or brightness but also more complex ones like hair color, image quality or posture for images of persons. Our goal is to minimize a loss that is robust under changes in the distribution of these style features. In contrast to previous work, we assume that the domain itself is not observed and hence a latent variable. We do assume that we can sometimes observe a typically discrete identifier or “ID\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ID}$$\end{document} variable”. In some applications we know, for example, that two images show the same person, and ID\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ID}$$\end{document} then refers to the identity of the person. The proposed method requires only a small fraction of images to have ID\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ID}$$\end{document} information. We group observations if they share the same class and identifier (Y,ID)=(y,id)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(Y,\mathrm {ID})=(y,\mathrm {id})$$\end{document} and penalize the conditional variance of the prediction or the loss if we condition on (Y,ID)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(Y,\mathrm {ID})$$\end{document}. Using a causal framework, this conditional variance regularization (CoRe) is shown to protect asymptotically against shifts in the distribution of the style variables in a partially linear structural equation model. Empirically, we show that the CoRe penalty improves predictive accuracy substantially in settings where domain changes occur in terms of image quality, brightness and color while we also look at more complex changes such as changes in movement and posture.


Introduction
Deep neural networks (DNNs) have achieved outstanding performance on prediction tasks like visual object and speech recognition (Krizhevsky et al. 2012;Szegedy et al. 2015;He et al. 2015). Issues can arise when the learned representations rely on dependencies that vanish in test distributions (see for example Quionero-Candela et al. (2009), Torralba and Efros (2011), Csurka (2017) and references therein). Such domain shifts can be caused by changing conditions such as color, background or location changes. Predictive performance is then likely to degrade. For example, consider the analysis presented in Kuehlkamp et al. (2017) which is concerned with the problem of predicting a person's gender based on images of their iris. The results indicate that this problem is more difficult than previous studies have suggested due to the remaining effect of cosmetics after segmenting the iris from the whole image. 1 Previous analyses obtained good predictive performance on certain datasets but when testing on a dataset only including images without cosmetics accuracy dropped. In other words, the high predictive performance previously reported relied to a significant extent on exploiting the confounding effect of mascara on the iris segmentation which is highly predictive for gender. Rather than the desired ability of discriminating based on the iris' texture the systems would mostly learn to detect the presence of cosmetics.
More generally, existing biases in datasets used for training machine learning algorithms tend to be replicated in the estimated models (Bolukbasi et al. 2016). For an example involving Google's photo app, see Crawford (2016) and Emspak (2016). In Sect. 5 we show many examples where unwanted biases in the training data are picked up by the trained model. As any bias in the training data is in general used to discriminate between classes, these biases will persist in future classifications, raising also considerations of fairness and discrimination (Barocas and Selbst 2016).
Addressing the issues outlined above, we propose Conditional variance Regularization (CoRe) to give differential weight to different latent features. Conceptually, we take a causal view of the data generating process and categorize the latent data generating factors into 'conditionally invariant' (core) and 'orthogonal' (style) features, as in Gong et al. (2016). The core and style features are unobserved and can in general be highly nonlinear transformations of the observed input data. It is desirable that a classifier only extracts the latent core features from the input data as they pertain to the target of interest in a stable and coherent fashion. Basing a prediction on the core features alone yields stable predictive accuracy even if the style features are altered. Under suitable assumptions, CoRe yields an estimator which is approximately invariant under changes in the conditional distribution of the style features (conditional on the class labels) and it is asymptotically robust with respect to domain shifts, arising through interventions on the style features. CoRe relies on the fact that for certain datasets we can observe grouped observations in the sense that we observe the same object under different conditions. For instance, such grouping information is available (i) in natural image data when several pictures of the same person are taken; (ii) in medical imaging when several images belonging to the same patient are made; (iii) in speech recognition when multiple recordings from the same speaker are available; (iv) in video data where nearby frames showing the same objects can be exploited to group observations; (v) in data augmentation where a transformed data point can be grouped together with the original one.
We will show examples for the first and last category. For the last category, we will show that pairing the augmented data with the original image they were generated from helps to improve accuracy and robustness with respect to the chosen transformation.
Rather than pooling over all examples, CoRe exploits knowledge about this grouping, i.e., that a number of instances relate to the same object. By penalizing between-object variation of the prediction less than variation of the prediction for the same object, we can steer the prediction to be based more on the latent core features and less on the latent style features. While the proposed methodology can be motivated from the desire the achieve representational invariance with respect to the style features, the causal framework we use throughout this work allows to precisely formulate the distribution shifts we aim to protect against.
The remainder of this manuscript is structured as follows: Sect. 1.1 starts with a few motivating examples, showing simple settings where the style features change in the test distribution such that standard empirical risk minimization approaches would fail. In Sect. 1.2 we review related work, introduce notation in Sect. 2 and in Sect. 3 we formally introduce conditional variance regularization CoRe. In Sect. 4, CoRe is shown to be asymptotically equivalent to minimizing the risk under a suitable class of strong interventions in a partially linear classification setting, provided one chooses sufficiently strong CoRe penalties. We also show that the population CoRe penalty induces domain shift robustness for general loss functions to first order in the intervention strength. The size of the conditional variance penalty can be shown to determine the size of the distribution class over which we can expect distributional robustness. In Sect. 5 we evaluate the performance of CoRe in a variety of experiments.
To summarize, our contributions are the following: (i) Causal framework and distributional robustness We build on the causal framework from Gong et al. (2016) to define distributional shifts for style variables. This allows us to formulate the objective of interest in terms of distributional robust inference. Specifically, the distribution class, on which the estimator should achieve a guaranteed performance bound, consists of those distributions that are generated by interventions on the latent style variables in a causal model. Our framework allows that the domain variable itself is latent. (ii) Conditional variance penalties We introduce conditional variance penalties and show two robustness properties in Theorems 1 and 2. (iii) Software We illustrate our ideas using synthetic and real-data experiments. A Ten-sorFlow implementation of CoRe as well as code to reproduce some of the experimental results are available at https ://githu b.com/chris tinah einze /core.

Motivating examples
To motivate the methodology we propose, consider the examples shown in Figs. 1 and 2. Example 1 shows a setting where a nonlinear decision boundary is required. Here, the core 1 3 feature corresponds to the distance from the origin while the style feature corresponds to the angle between the x 1 -axis and the vector from the origin to (x 1 , x 2 ) . Panel (a) shows a subsample of the training data where class 1 is associated with red points, dark blue points correspond to class 0. Panel (b) additionally shows a subsample of the test data where the style-i.e. the distribution of the angle-is intervened upon: class 1 is associated with orange squares, cyan squares correspond to class 0. Clearly, a circular decision boundary yields optimal performance on both training and test set but is unlikely to be found by a standard classification algorithm when only using the training set for the estimation. We will return to these examples in Sect. 3.4. Secondly, we introduce a strong dependence between the class label and the style feature "image quality" in the third example by manipulating the face images from the CelebA dataset : in the training set images of class "wearing glasses" are associated with a lower image quality than images of class "not wearing glasses". Examples are shown in Fig. 2a. In the test set, this relation is reversed, i.e. images showing persons wearing glasses are of higher quality than images of persons without glasses, with examples in Fig. 2b. We will return to this example in Sect. 5.3 and show that training a convolutional The goal is to predict whether a person is wearing glasses. The distributions are shifted in test data by style interventions where style is the image quality. A 5-layer CNN achieves 0% training error and 2% test error for images that are sampled from the same distribution as the training images a, but a 65% error rate on images where the confounding between image quality and glasses is changed b. See Sect. 5.3 for more details neural network to distinguish between people wearing glasses or not works well on test data that are drawn from the same distribution (with error rates below 2%) but fails entirely on the shown test data, with error rates worse than 65%.

Related work
For general distributional robustness, the aim is to learn for a given set F of distributions, twice differentiable and convex loss , and prediction f (x) . The set F is the set of distributions on which one would like the estimator to achieve a guaranteed performance bound. Causal inference can be seen to be a specific instance of distributional robustness, where we take F to be the class of all distributions generated under do-interventions on the predictors X (Meinshausen 2018;Rothenhäusler et al. 2018). Causal models thus have the defining advantage that the predictions will be valid even under arbitrarily large interventions on all predictor variables (Haavelmo 1944;Aldrich 1989;Pearl 2009;Schölkopf et al. 2012;Peters et al. 2016;Zhang et al. 2013Zhang et al. , 2015Yu et al. 2017;Rojas-Carulla et al. 2018;Magliacane et al. 2018). There are two difficulties in transferring these results to the setting of domain shifts in image classification. The first hurdle is that the classification task is typically anti-causal since the image we use as a predictor is a descendant of the true class of the object we are interested in rather than the other way around. The second challenge is that the input data consists of pixel intensities and we do not want (or could) guard against arbitrary interventions on any or all variables but only would like to guard against a shift of the unobserved style features. It is hence not immediately obvious how standard causal inference can be used to guard against large domain shifts.
Another line of work uses a class of distributions of the form F = F (F 0 ) with with > 0 a small constant and D(F, F 0 ) being, for example, a -divergence (Namkoong and Duchi 2017;Ben-Tal et al. 2013;Bagnell 2005;Volpi et al. 2018) or a Wasserstein distance (Shafieezadeh-Abadeh et al. 2017;Sinha et al. 2018;Gao et al. 2017). The distribution F 0 can be the true (but generally unknown) population distribution P from which the data were drawn or its empirical counterpart P n . The distributionally robust targets in Eq.
(2) can often be expressed in penalized form (Gao et al. 2017;Sinha et al. 2018;Xu et al. 2009). A Wasserstein-ball is a suitable class of distributions for example in the context of adversarial examples (Sinha et al. 2018;Szegedy et al. 2014;Goodfellow et al. 2015).
In this work, we do not try to achieve robustness with respect to a set of distributions that are pre-defined by a Kullback-Leibler divergence or a Wasserstein metric as in Eq. (2). Instead, we try to achieve robustness against a set of distributions that are generated by interventions on latent style variables in a causal model (we will make this precise in Sect. 2). We will formulate the class of distributions over which we try to achieve robustness as in Eq. (1) but with the class of distributions in Eq. (2) now replaced with the class of distributions F defined as {F ∶ D style (F, F 0 ) ≤ }, where F 0 is again the distribution the training data are drawn from. The difference to standard distributional robustness approaches listed below Eq. (2) is now that the metric D style measures the shift of the 1 3 orthogonal style features. We do not know a priori which features are prone to distributional shifts and which features have a stable (conditional) distribution. The metric is hence not known a priori and needs to be inferred in a suitable sense from the data.
Similar to this work in terms of their goals are the work of Gong et al. (2016) and Domain-Adversarial Neural Networks (DANN) proposed in Ganin et al. (2016), an approach motivated by the work of Ben-David et al. (2007). The main idea of Ganin et al. (2016) is to learn a representation that contains no discriminative information about the origin of the input (source or target domain). This is achieved by an adversarial training procedure: the loss on domain classification is maximized while the loss of the target prediction task is minimized simultaneously. The data generating process assumed in Gong et al. (2016) is similar to our model, introduced in Sect. 2.1, where we detail the similarities and differences between the models (cf. Fig. 3). Gong et al. (2016) identify the conditionally independent features by adjusting a transformation of the variables to minimize the squared MMD distance between distributions in different domains. 2 The fundamental difference between these very promising methods and our approach is that we use a different data basis. The domain identifier is explicitly observable in Gong et al. (2016) and Ganin et al. (2016), while it is latent in our approach. In contrast, we exploit the presence of an identifier variable ID that relates to the identity of an object (for example identifying a person). In other words, we do not assume that we have data from different domains but just different realizations of the same object under different interventions. This also differentiates this work from latent domain adaptation papers from the computer vision literature (Hoffman et al. 2012;Gong et al. 2013). Further related work is discussed in Sect. 6.

(a)
Observed quantities are shown as shaded nodes; nodes of latent quantities are transparent. Left: data generating process for the considered model as in Gong et al. (2016), where the effect of the domain on the orthogonal features S is mediated via unobserved noise . The style interventions and all its descendants are shown as nodes with dashed borders to highlight variables that are affected by style interventions. Right: our setting. The domain itself is unobserved but we can now observe the (typically discrete) ID variable we use for grouping. The arrow between ID and Y can be reversed, depending on the sampling scheme 1 3

Setting
We introduce the assumed underlying causal graph and some notation before discussing notions of domain shift robustness.

Causal graph
Let Y ∈ Y be a target of interest. Typically Y = ℝ for regression or Y = {1, … , K} in classification with K classes. Let X ∈ ℝ p be predictor variables, for example the p pixels of an image. The causal structural model for all variables is shown in panel (b) of Fig. 3. The domain variable D is latent, in contrast to Gong et al. (2016) whose model is shown in panel (a) of Fig. 3. We add the ID variable to the graph. In Fig. 3, Y → ID but in some settings it might be more plausible to consider ID → Y . For the proposed method both options are possible. Together with Y, the ID variable is used to group observations. It is typically discrete and relates to the identity of the underlying object. The variable can be assumed to be latent in the setting of Gong et al. (2016).
The rest of the graph is in analogy to Gong et al. (2016). The prediction is anti-causal, that is the predictor variables X that we use for Ŷ are non-ancestral to Y. In other words, the class label is here seen to be causal for the image and not the other way around. 3 The causal effect from the class label Y on the image X is mediated via two types of latent variables: the so-called core or 'conditionally invariant' features C and the orthogonal or style features S . The distinguishing factor between the two is that external interventions are possible on the style features but not on the core features. If the interventions have different distributions in different domains, then the conditional distributions C|Y = y, ID = id are invariant for all (y, id) while S|Y = y, ID = id can change. The style variable can include point of view, image quality, resolution, rotations, color changes, body posture, movement etc. and will in general be context-dependent. 4 The style intervention variable influences both the latent style S , and hence also the image X. In potential outcome notation, we let S( = ) be the style under intervention = and X(Y, ID, = ) the image for class Y, identity ID and style intervention . The latter is sometimes abbreviated as X( = ) for notational simplicity. Finally, f (X( = )) is the prediction under the style intervention = . For a formal justification of using a causal graph and potential outcome notation simultaneously see Richardson and Robins (2013).
To be specific, if not mentioned otherwise we will assume a causal graph as follows. For independent Y , ID , style in ℝ, ℝ, ℝ q respectively with positive density on their support and continuously differentiable functions k y , k id , and k style , k core , k x ,

3
The distribution of style is assumed to be identical across domains, while can change. In more generality, one could discard and instead allow the distribution of style to change.
Here the assumption is slightly more restrictive because of additivity of in the structural equation for the style variable. The core features are here assumed to be a deterministic function of Y and ID to allow for theoretical analysis. In more generality (and as indicated in the graph), these would also be non-deterministic relations. The theoretical results will also require positive density for the style features in an -ball around the origin, as made precise in assumption (A1) later.
The prediction ŷ for y, given X = x , is of the form f (x) for a suitable function f with parameters ∈ ℝ d , where the parameters correspond to the weights in a DNN, for example.
We would like to stress that the above model is fairly general and subsumes many simpler ones as special cases. To give a concrete example, consider the task of classifying a health condition Y from medical images X. Style features could be, for example, technical noise, orientation or resolution. The unobserved domain D could correspond to different hospitals or doctors. Due to the usage of different measuring devices in each of these locations, the conditional distribution S|Y will change substantially across different domains. In contrast, the core features C , i.e. those image features that carry the actual signal, will remain invariant conditional on the underlying health condition Y.

Data
We assume we have n data points (x i , y i , id i ) for i = 1, … , n , where the observations id i with i = 1, … , n of variable ID can also contain unobserved values. Let m ≤ n be the number of unique realizations of (Y, ID) and let G 1 , … , G m be a partition of {1, … , n} such that, for each j ∈ {1, … , m} , the realizations (y i , id i ) are identical 5 for all i ∈ G j . While our prime application is classification, regression settings with continuous Y can be approximated in this framework by slicing the range of the response variable into distinct bins in analogy to the approach in sliced inverse regression (Li 1991). The cardinality of G j is denoted by n j ∶= |G j | ≥ 1 . Then n = ∑ i n i is again the total number of samples and c = n − m is the total number of grouped observations in the following sense: if we count all samples in a group except the first one we have, if summing over all groups, a total of c = n − m = ∑ n i=1 (n i − 1) observations left that are 'grouped' with the first example in their corresponding group.
Typically n i = 1 for most samples and occasionally n i ≥ 2 but one can also envisage scenarios with larger groups of the same identifier (y, id). (3) 1 3

Domain shift robustness
In this section, we clarify against which classes of distributions we hope to achieve robustness. Let be a suitable loss that maps y and ŷ = f (x) to ℝ + . The risk under distribution F and parameter is given by Let F 0 be the joint distribution of (ID, Y, S) in the training distribution. A new domain and explicit interventions on the style features can now shift the distribution of (ID, Y,S) to F. We can measure the distance between distributions F 0 and F in different ways. Below we will define the distance considered in this work and denote it by D style (F, F 0 ) . Once defined, we get a class of distributions and the goal will be to optimize a worst-case loss over this distribution class in the sense of Eq. (1), where larger values of afford protection against larger distributional changes. The relevant loss for distribution class F is then In the limit of arbitrarily strong interventions on the style features S , the loss is given by Minimizing the loss L ∞ ( ) with respect to guarantees an accuracy in prediction which will work well across arbitrarily large shifts in the conditional distribution of the style features.
A natural choice to define D style is to use a Wasserstein-type distance (see e.g. Villani 2003). We will first define a distance D y,id for the conditional distributions and then set D(F 0 , F) = E(D Y,ID ) , where the expectation is with respect to random ID and labels Y. The distance D y,id between the two conditional distributions of S will be defined as a Wasserstein W 2 2 (F 0 , F)-distance for a suitable cost function c(x,x) . Specifically, let Π y,id be the couplings between the conditional distributions of S and S , meaning measures supported on ℝ q × ℝ q such that the marginal distribution over the first q components is equal to the distribution of S and the marginal distribution over the remaining q components equal to the distribution of S . Then the distance between the conditional distributions is defined as where c ∶ ℝ q × ℝ q ↦ ℝ + is a nonnegative, lower semi-continuous cost function. Here, we focus on a Mahalanobis distance as cost

3
The cost of a shift is hence measured against the variability under the distribution F 0 , Σ y,id = Var(S|Y, ID). 6 Clearly, since the core and style features are unobserved, we cannot directly optimize the loss (5) but need to infer the metric D style from the input data X. In the next section, we will show how this can be achieved. Intuitively, the model in Fig. 3 implies that the variance conditional on Y, ID stems from the difference in the style features. Hence, we would like to minimize the conditional variance of the prediction or loss when we condition on Y, ID . This enforces the desired invariance with respect to the style features.

Pooled estimator
Let (x i , y i ) for i = 1, … , n be the observations that constitute the training data and ŷ i = f (x i ) the prediction for y i . The standard approach is to simply pool over all available observations, ignoring any grouping information that might be available. The pooled estimator thus treats all examples identically by summing over the empirical loss as where the first part is simply the empirical loss over the training data, In the second part, pen( ) is a complexity penalty, for example a squared 2 -norm of the weights in a convolutional neural network as a ridge penalty.

CoRe estimator
The CoRe estimator is defined in Lagrangian form for penalty ≥ 0 as The penalty Ĉ is a conditional variance penalty of the form (9) conditional-variance-of-prediction:Ĉ f , , ∶=Ê � Var(f (X)|Y, ID) 6 As an example, if the change in distribution for S is caused by random shift-interventions , then S ← S + , and the distance D style induced in the distributions is ensuring that the strength of the shifts is measured against the natural variability Σ y,id of the style features.
where typically ∈ {1∕2, 1} . For = 1∕2 , we also refer to the respective penalties as "conditional-standard-deviation" penalties. In practice in the context of classification and DNNs, we apply the penalty (9) to the predicted logits. The conditional-variance-of-loss penalty (10) takes a similar form to Namkoong and Duchi (2017). The crucial difference of our approach to Namkoong and Duchi (2017) is that we penalize with the expected conditional variance or standard deviation. The fact that we take a conditional variance is here important as we try to achieve distributional robustness with respect to interventions on the style variables. Conditioning on ID allows to guard specifically against these interventions. An unconditional variance penalty, in contrast, can achieve robustness against a predefined class of distributions such as a ball of distributions defined in a Kullback-Leibler or Wasserstein metric. The population CoRe estimator is defined as in Eq. (8) (3) is shown in Sect. 4.1. Furthermore, we discuss the population limit of ̂c ore ( ) in Sect. 4.2, where we show that the regularization parameter ≥ 0 is proportional to the size of the future style interventions that we want to guard against for future test data.

Estimating the expected conditional variance
Recall that G j ⊆ {1, … , n} contains samples with identical realizations of (Y, ID) for j ∈ {1, … , m} . For each j ∈ {1, … , m} , define ̂, j as the arithmetic mean across all f (x i ), i ∈ G j . The canonical estimator of the conditional variance Ĉ f ,1, is then and analogously for the conditional-variance-of-loss, defined in Eq. (10) 7 . If there are no groups of samples that share the same identifier (y, id) , we define Ĉ f ,1, to vanish. The CoRe estimator is then identical to pooled estimation in this special case.

Motivating examples (continued)
We revisit the first example from Sect. 1.1. Figure 4 shows subsamples of the training and test set with the estimated decision boundaries for different values of the penalty parameter when using a 2-layer fully connected neural network. Here, n = 20,000 and c = 500 . Additionally, grouped examples that share the same (y, id) are visualized: two grouped observations are connected by a line or curve, respectively. Ten such groups are shown. Panel (a) shows the decision boundaries for = 0 , equivalent to the pooled estimator, and for CoRe with ∈ {0, 0.05, 0.1, 1} . The pooled estimator misclassifies a large number of (10) conditional-variance-of-loss:Ĉ , , ∶=Ê � Var( (Y, f (X))|Y, ID) , The right hand side can also be interpreted as the graph Laplacian (Belkin et al. 2006) of an appropriately weighted graph that fully connects all observations i ∈ G j for each j ∈ {1, … , m}.
test points as can be seen in panel (b), suffering from a test error of ≈ 58% . In contrast, the decision boundary of the CoRe estimator with = 1 aligns with the direction along which the grouped observations vary, classifying the test set with almost perfect accuracy (test error is ≈ 0%).

Domain shift robustness for the CoRe estimator
We show two properties of the CoRe estimator. First, consistency is shown under the risk definition (6) for an infinitely large conditional variance penalty and the logistic loss in a partially linear structural equation model. Second, the population CoRe estimator is shown to achieve distributional robustness against shift interventions in a first order expansion.

Asymptotic domain shift robustness under strong interventions
We analyze the loss under strong domain shifts, as given in Eq. (6), for the pooled and the CoRe estimator in a one-layer network for binary classification (logistic regression) in an asymptotic setting of large sample size and strong interventions.
Assume the structural equation for the image X ∈ ℝ p is linear in the style features S ∈ ℝ q (with generally p ≫ q ) and we use logistic regression to predict the class label Y ∈ {−1, 1} . Let the interventions ∈ ℝ q act additively on the style features S (this is only for notational convenience) and let the style features S act in a linear way on the image X via a matrix W ∈ ℝ p×q (this is an important assumption without which results are more involved). The core or 'conditionally invariant' features are C ∈ ℝ r , where in general r ≤ p but this is not important for the following. For independent Y , ID , style in ℝ, ℝ, ℝ q (a) Example 1, training set.
The decision boundary as function of the penalty parameters for Example 1 from Fig. 1. There are ten pairs of samples visualized that share the same identifier (y, id) and these are connected by a curve in the figures. The decision boundary associated with a solid line corresponds to = 0 , the standard pooled estimator that ignores the groupings. The broken lines are decision boundaries for increasingly strong penalties, taking into account the groupings in the data. Here, we only show a subsample of the data to avoid overplotting respectively with positive density on their support and continuously differentiable functions k y , k id , k style , k core , k x , We assume a logistic regression as a prediction of Y from the image data X: Given training data with n samples, we estimate with ̂ and use here a logistic loss ). The formulation of Theorem 1 relies on the following assumptions.

Assumption 1
We require the following conditions: (A1) Assume the conditional distribution S|Y = y, ID = id under the training distribution F 0 has positive density (with respect to the Lebesgue measure) in an -ball in 2 -norm around the origin for some > 0 for all y ∈ Y and id ∈ I. (A2) Assume the matrix W has full rank q. (A3) Let M ≤ n be the number of unique realizations among n iid samples of (Y, ID) and let p n ∶= P(M ≤ n − q) . Assume that p n → 1 for n → ∞.
Assumption (A1) is a key assumption about the style variations we observe in the training set. It requires that we observe some variance in those directions that we expect to be subject to domain shifts in the future. If, on the other hand, the conditional variance in a particular direction is vanishing, we also expect it to vanish in the future. A violation of this assumption would imply that the guarantee of the CoRe regularization no longer holds. Assumption (A3) guarantees that the number c = n − m of grouped examples is at least as large as the dimension of the style variables. If we have too few or no grouped examples (small c), we cannot estimate the conditional variance accurately. Under these assumptions we can prove domain shift robustness.

Theorem 1 (Asymptotic domain shift robustness under strong interventions)
Under model (11) and Assumption 1, with probability 1, the pooled estimator (7) has infinite loss (6) under arbitrarily large shifts in the distribution of the style features, The CoRe estimator (8) ̂c ore with → ∞ is domain shift robust under strong interventions in the sense that for n → ∞, .

3
A proof is given in "Appendix A". The respective ridge penalties in both estimators (7) and (8) are assumed to be zero for the proof, but the proof can easily be generalized to include ridge penalties that vanish sufficiently fast for large sample sizes. The Lagrangian regularizer is assumed to be infinite for the CoRe estimator to achieve domain shift robustness under these strong interventions. The next section considers the population CoRe estimator in a setting with weak interventions and finite values of the penalty parameter.

Population domain shift robustness under weak interventions
The previous theorem states that the CoRe estimator can achieve domain shift robustness under strong interventions for an infinitely strong penalty in an asymptotic setting. An open question is how the loss (5), behaves under interventions of small to medium size and correspondingly smaller values of the penalty. Here, we aim to minimize this loss for a given value of and show that domain shift robustness can be achieved to first order with the population CoRe estimator using the conditional-standard-deviation-of-loss penalty, i.e., Eq. (10) with = 1∕2 , by choosing an appropriate value of the penalty . Below we will show this appropriate choice of the penalty weight is = √ .

Assumption 2
We require the following conditions: (B1) Define the loss under a deterministic shift as where the expectation is with respect to random (ID, Y,S) ∼ F , with F defined by the deterministic shift intervention S = S + and (ID, Y,S) ∼ F 0 . Assume that for all ∈ Θ , h ( ) is twice continuously differentiable with bounded second derivative for a deterministic shift ∈ ℝ q . (B2) The spectral norm of the conditional variance Σ y,id of S|Y, ID under F 0 is assumed to be smaller or equal to some ∈ ℝ for all y ∈ Y and id ∈ I.
The first assumption (B1) ensures that the loss is well behaved under interventions on the style variables. The second assumption (B2) allows to take the limit of small conditional variances in the style variables.
If setting = √ and using the conditional-standard-deviation-of-loss penalty, the CoRe estimator optimizes according to The next theorem shows that this is to first order equivalent to minimizing the worst-case loss over the distribution class F . The following result holds for the population CoRe estimator, see below for a discussion about consistency.

Theorem 2 The supremum of the loss over the class of distribution F is to first-order given by the expected loss under distribution F 0 with an additional conditional-standard-deviation-of-loss penalty C ,1∕2,
A proof is given in "Appendix B". The objective of the population CoRe estimator matches thus to first order the loss under domain shifts if we set the penalty weight = √ . Larger anticipated domain shifts thus require naturally a larger penalty in the CoRe estimation. The result is possible as we have chosen the Mahalanobis distance to measure shifts in the style variable and define F , ensuring that the strength of shifts on style variables are measured against the natural variance on the training distribution F 0 .
In practice, the choice of involves a somewhat subjective choice about the strength of the distributional robustness guarantee. A stronger distributional robustness property is traded off against a loss in predictive accuracy if the distribution is not changing in the future. One option for choosing is to choose the largest penalty weight before the validation loss increases considerably. This approach would provide the best distributional robustness guarantee that keeps the loss of predictive accuracy in the training distribution within a pre-specified bound. 8 As a caveat, the result takes the limit of small conditional variance of S in the training distribution and small additional interventions. Under larger interventions higher-order terms could start to dominate, depending on the geometry of the loss function and f . A further caveat is that the result looks at the population CoRe estimator. For finite sample sizes, we would optimize a noisy version on the rhs of (12). To show domain shift robustness in an asymptotic sense, we would need additional uniform convergence (in ) of both the empirical loss and the conditional variance in that for n → ∞, While this is in general a reasonable assumption to make, the validity of the assumption will depend on the specific function class and on the chosen estimator of the conditional variance.

Experiments
We perform an array of different experiments, showing the applicability and advantage of the conditional variance penalty for two broad settings: 1. Settings where we do not know what the style variables correspond to but still want to protect against a change in their distribution in the future.
In the examples we show cases where the style variable ranges from fashion (Sect. 5.2), image quality (Sect. 5.3), movement (Sect. 5.4) and brightness ("Appendix D.1"), which are all not known explicitly to the method. We also include genuinely unknown style variables in Sect. 5.1 (in the sense that they are unknown not only to the methods but also to us as we did not explicitly create the style interventions). 2. Settings where we do know what type of style interventions we would like to protect against. This is usually dealt with by data augmentation (adding images which are, say, rotated or shifted compared to the training data if we want to protect against rotations or translations in the test data; see for example Schölkopf et al. (1996)). The conditional variance penalty is here exploiting that some augmented samples were generated from the same original sample and we use as ID variable the index of the original image. We show that this approach generalizes better than simply pooling the augmented data, in the sense that we need fewer augmented samples to achieve the same test error. This setting is shown in Sect. 5.5.
We compare against the pooled estimator which has the same architecture as the network to which we add the CoRe penalty. For both the pooled and the CoRe estimator we apply an 2 penalty as regularization. We would like to stress that the related work discussed in Sects. 1.2 and 6 cannot be directly compared to the CoRe estimator as these approaches cannot exploit the ID information but rely on having data from different domains available at training time instead. Since this is a different problem setting, we can only compare against the pooled estimator which is a standard approach to classification. As a downside, our approach requires availability of an ID variable, which might not always be available 9 To further understand the behavior of the CoRe penalty, we perform a number of analyses and ablation studies to show (i) How sensitive the performance of CoRe is to the value of the penalty weight (Sects. 5.1.1, 5.2); (ii) How the CoRe penalty differs from a standard 2 penalty (Sect. 5.1.1); (iii) How the value of the CoRe penalty can be used as a qualitative measure for the presence of sample bias (Sects. 5.1.1, 5.2); (iv) How sensitive the performance of both the CoRe and the pooled estimator is to label shift in the grouped observations (Sect. 5.2.1); (v) How the relative performance of both estimators is affected when using pre-trained InceptionV3 features (Sect. 5.2.2); (vi) How sensitive the performance is to different grouping strategies (Sects. 5. Details of the network architectures can be found in Appendix "Appendix C". All reported error rates are averaged over five runs of the respective method. A TensorFlow (Abadi et al. 2015) implementation of CoRe can be found at https ://githu b.com/chris tinah einze /core.

Eyeglasses detection with small sample size
In this example, we explore a setting where training and test data are drawn from the same distribution, so we might not expect a distributional shift between the two. However, we consider a small training sample size which gives rise to statistical fluctuations between training and test data. We assess to which extent the conditional variance penalty can help to improve test accuracies in this setting. Specifically, we use a subsample of the CelebA dataset ) and try to classify images according to whether or not the person in the image wears glasses. For construction of the ID variable, we exploit the fact that several photos of the same person are available and set ID to be the identifier of the person in the dataset. Figure 5 shows examples from both the training and the test dataset. The conditional variance penalty is estimated across groups of observations that share a common (Y, ID) . Here, this corresponds to pictures of the same person where all pictures show the person either with glasses (if Y = 1 ) or all pictures show the person without glasses ( Y = 0 ). Statistical fluctuations between training and test set could for instance arise if by chance the background of eyeglass wearers is darker in the training sample than in test samples, the eyeglass wearers happen to be outdoors more often or might be more often female than male etc.
Below, we present the following analyses. First, we look at five different datasets and analyze the effect of adding the CoRe penalty (using conditional-variance-of-prediction) to the cross-entropy loss. Second, we focus on one dataset and compare the four different variants of the CoRe penalty in Eqs. (9) and (10) with ∈ {1∕2, 1}.

CoRe penalty using the conditional variance of the predicted logits
We consider five different training sets which are created as follows. For each person in the standard CelebA training data we count the number of available images and select the 50 identities for which most images are available individually. We partition these 50 identities into 5 disjoint subsets of size 10 and consider the resulting 5 datasets, Given a training dataset, the standard approach would be to pool all examples. The only additional information we exploit is that some observations can be grouped. If using a 5-layer convolutional neural network with a standard ridge penalty (details can be found in Table 5) and pooling all data, the test error on unseen images ranges from 18.08 to 25.97%. Exploiting the group structure with the CoRe penalty (in addition to a ridge penalty) results in test errors ranging from 14.79 to 21.49%, see Table 1. The relative improvements when using the CoRe penalty range from 9 to 28.6%.
The test error is not very sensitive to the weight of the CoRe penalty as shown in Fig. 6a: for a large range of penalty weights, adding the CoRe penalty decreases the test error compared to the pooled estimator (identical to a CoRe penalty weight of 0). This holds true for various ridge penalty weights.
While the test error rates shown in Fig. 6 suggest already that the CoRe penalty differentiates itself clearly from a standard ridge penalty, we examine next the differential effect of the CoRe penalty on the between-and within-group variances. Concretely, the variance of the predictions can be decomposed as where the first term on the rhs is the within-group variance that CoRe penalizes, while a ridge penalty would penalize both the within-and also the between-group variance (the Var(f (X)) = E Var(f (X)|Y, ID) + Var E(f (X)|Y, ID) , second term on the rhs above). In Fig. 6b we show the ratio between the CoRe penalty and the between-group variance where groups are defined by conditioning on (Y, ID) . Specifically, the ratio is computed as The results shown in Fig. 6b are computed on dataset 1 (DS 1). While increasing ridge penalty weights do lead to a smaller value of the CoRe penalty, the between-group variance is also reduced such that the ratio between the two terms does not decrease with larger weights of the ridge penalty. 10 With increasing weight of the CoRe penalty, the variance ratio decreases, showing that the CoRe penalty indeed penalizes the within-group variance more than the between-group variance. Table 1 also reports the value of the CoRe penalty after training when evaluated for the pooled and the CoRe estimator on the training and the test set. As a qualitative measure to assess the presence of sample bias in the data (provided the model assumptions hold), we can compare the value the CoRe penalty takes after training when evaluated for the pooled estimator and the CoRe estimator. The difference yields a measure for the extent the respective estimators are functions of . If the respective hold-out values are both small, this would indicate that the style features are not very predictive for the target variable. If, on the other hand, the CoRe penalty evaluated for the pooled estimator takes a much larger value than for the CoRe estimator (as in this case), this would indicate the presence of sample bias. The results can be seen to be fairly insensitive to the ridge penalty. b The variance ratio (13) on test data as a function of both the CoRe and ridge penalty weights. The CoRe penalty can be seen to penalize the within-group variance selectively, whereas a strong ridge penalty decreases both the within-and betweengroup variance

Other CoRe penalty types
We now compare all CoRe penalty types, i.e., penalizing with (i) the conditional variance of the predicted logits Ĉ f ,1, , (ii) the conditional standard deviation of the predicted logits Ĉ f ,1∕2, , (iii) the conditional variance of the loss Ĉ l,1, and (iv) the conditional standard deviation of the loss Ĉ l,1∕2, . For this comparison, we use the training dataset 1 (DS 1) from above. Table 2 contains the test error (training error was 0% for all methods) as well as the value the respective CoRe penalty took after training on the training set and the test set. The four CoRe penalty variants' performance differences are not statistically significant. Hence, we mostly focus on the conditional variance of the predicted logits Ĉ f ,1, in the other experiments.

Discussion
While the distributional shift in this example arises due to statistical fluctuations which will diminish as the sample size grows, the following examples are more concerned with biases that will persist even if the number of training and test samples is very large. A second difference to the subsequent examples is the grouping structure-in this example, we consider only a few identities, namely m = 10 , with a relatively large number n i of associated observations (about thirty observations per individual). In the following examples, m is much larger while n i is typically smaller than five.

Gender classification with unknown confounding
In the following set of experiments, we work again with the CelebA dataset and the 5-layer convolutional neural network architecture described in Table 5. This time we consider the problem of classifying whether the person shown in the image is male or female. We create a confounding in training and test set I by including mostly images of men wearing glasses and women not wearing glasses. In test set 2 the association between gender and glasses is flipped: women always wear glasses while men never  Fig. 7. The training set, test set 1 and 2 are subsampled such that they are balanced with respect to Y, resulting in 16,982, 4224 and 1120 observations, respectively. To compute the conditional variance penalty, we use again images of the same person. The ID variable is, in other words, the identity of the person and gender Y is constant across all examples with the same ID . Conditioning on (Y, ID) is hence identical to conditioning on ID alone. Another difference to the other experiments is that we consider a binary style feature here.

Label shift in grouped observations
We compare six different datasets that vary with respect to the distribution of Y in the grouped observations. In all training datasets, the total number of observations is 16982 and the total number of grouped observations is 500. In the first dataset, 50% of the grouped observations correspond to males and 50% correspond to females. In the remaining 5 datasets, we increase the number of grouped observations with Y = "man" , denoted by , to 75%, 90%, 95%, 99% and 100%, respectively. Table 3 shows the performance obtained for these datasets when using the pooled estimator compared to the CoRe estimator with Ĉ f ,1, . The results show that both the pooled estimator as well as the CoRe estimator perform better if the distribution of Y in the grouped observations is more balanced. The CoRe estimator improves the error rate of the pooled estimator by ≈ 28 − 39% on a relative scale. Figure 8 shows the performance for = 50% as a function of the CoRe penalty weight. Significant improvements can be obtained across a large range of values for the CoRe penalty and the ridge penalty. Test errors become more sensitive to the chosen value of the CoRe penalty for very large values of the ridge penalty weight as the overall amount of regularization is already large.

Using pre-trained Inception V3 features
To verify that the above conclusions do not change when using more powerful features, we here compare 2 -regularized logistic regression using pre-trained Inception V3 features 11 with and without the CoRe penalty. Table 4 shows the results for = 0.5 . While the results show that both the pooled estimator as well as the CoRe estimator perform better using pre-trained Inception features, the relative improvement with the CoRe penalty is still 28% on test set 2.

Ablation experiments
In Sect. D.3.1, we report results for the following two additional baselines: (i) we group all examples sharing the same class label and penalize with the conditional variance of the show the test error on test data sets 1 and 2 respectively as a function of the CoRe and ridge penalty. Panels c and d show the variance ratio (13) (comparing within-and between-group variances) for females and males separately predicted logits, computed over these two groups; (ii) we penalize the overall variance of the predicted logits, i.e., a form of unconditional variance regularization.

Eyeglasses detection with known and unknown image quality intervention
We now revisit the second example from Sect. 1.1. We again use the CelebA dataset and consider the problem of classifying whether the person in the image is wearing eyeglasses.
Here, we modify the images in the following way: in the training set and in test set 1, we sample the image quality 12 for all samples {i ∶ y i = 1} (all samples that show glasses) from We compare six different datasets that vary with respect to the distribution of Y in the grouped observations. Specifically, we vary the proportion of images showing men between = 0.5 and = 1 . In all training datasets, the total number of observations is 16,982 and the total number of grouped observations is 500. Both the pooled estimator as well as the CoRe estimator perform better if the distribution of Y in the grouped observations is more balanced. The CoRe estimator improves the error rate of the pooled estimator by ≈ 28 − 39% on a relative scale. a Gaussian distribution with mean = 30 and standard deviation = 10 . Samples with y i = 0 (no glasses) are unmodified. In other words, if the image shows a person wearing glasses, the image quality tends to be lower. In test set 2, the quality is reduced in the same way for y i = 0 samples (no glasses), while images with y i = 1 are not changed. Figure 9 shows examples from the training set and test sets 1 and 2. For the CoRe penalty, we calculate the conditional variance across images that share the same ID if Y = 1 , that is across images that show the same person wearing glasses on all images. Observations with Y = 0 (not wearing glasses) are not grouped. Two examples are shown in the red box of Fig. 9.
Here, we have c = 5000 grouped observations among a total sample size of n = 20,000. Figure 9 shows misclassification rates for CoRe and the pooled estimator on test sets 1 and 2. The pooled estimator (only penalized with an 2 penalty) achieves low error rates of 2% on test set 1, but suffers from a 65% misclassification error on test set 2, as now the relation between Y and the implicit S variable (image quality) has been flipped. The CoRe estimator has a larger error of 13% on test set 1 as image quality as a feature is penalized by CoRe implicitly and the signal is less strong if image quality has been removed as a dimension. However, in test set 2 the performance of the CoRe estimator is 28% and improves substantially on the 65% error of the pooled estimator. The reason is again the same: the CoRe penalty ensures that image quality is not used as a feature to the same extent as for the pooled estimator. This increases the test error slightly if the samples are generated from the same distribution as training data (as here for test set 1) but substantially improves the test error if the distribution of image quality, conditional on the class label, is changed on test data (as here for test set 2).
Eyeglasses detection with known image quality intervention To compare to the above results, we repeat the experiment by changing the grouped observations as follows. Above, we grouped images that had the same person ID when Y = 1 . We refer to this scheme of grouping observations with the same (Y, ID) as 'Grouping setting 2'. Here, we use an explicit augmentation scheme and augment c = 5000 images with Y = 1 in  Fig. 9 Eyeglass detection for CelebA dataset with image quality interventions (which are unknown to any procedure used). The JPEG compression level is lowered for Y = 1 (glasses) samples on training data and test set 1 and lowered for Y = 0 (no glasses) samples for test set 2. To the human eye, these interventions are barely visible but the CNN that uses pooled data without CoRe penalty has exploited the correlation between image quality and outcome Y to achieve a (arguably spurious) low test error of 2% on test set 1. However, if the correlation between image quality and Y breaks down, as in test set 2, the CNN that uses pooled data without a CoRe penalty has a 65% misclassification rate. The training data on the left show paired observations in two red boxes: these observations share the same label Y and show the same person ID . They are used to compute the conditional variance penalty for the CoRe estimator that does not suffer from the same degradation in performance for test set 2 the following way: each image is paired with a copy of itself and the image quality is adjusted as described above. In other words, the only difference between the two images is that image quality differs slightly, depending on the value that was drawn from the Gaussian distribution with mean = 30 and standard deviation = 10 , determining the strength of the image quality intervention. Both the original and the copy get the same value of identifier variable ID . We call this grouping scheme 'Grouping setting 1'. Compare the left panels of Figs. 9 and 10 for examples. While we used explicit changes in image quality in both above and here, we referred to grouping setting 2 as 'unknown image quality interventions' as the training sample as in the left panel of Fig. 9 does not immediately reveal that image quality is the important style variable. In contrast, the augmented data samples (grouping setting 1) we use here differ only in their image quality for a constant (Y, ID).  Fig. 9 is in the training data where the paired images now use the same underlying image in two different JPEG compressions. The compression level is drawn from the same distribution. The CoRe penalty performs better than for the experiment in Fig. 9 Figure 10 shows examples and results. The pooled estimator performs more or less identical to the previous dataset. The explicit augmentation did not help as the association between image quality and whether eyeglasses are worn is not changed in the pooled data after including the augmented data samples. The misclassification error of the CoRe estimator is substantially better than the error rate of the pooled estimator. The error rate on test set 2 of 13% is also improving on the rate of 28% of the CoRe estimator in grouping setting 2. We see that using grouping setting 1 works best since we could explicitly control that only S ≡ image quality varies between grouped examples. In grouping setting 2, different images of the same person can vary in many factors, making it more challenging to isolate image quality as the factor to be invariant against.
A similar example where S ≡ brightness is summarized in "Appendix D.1".

Stickmen image-based age classification with unknown movement interventions
In this example we consider synthetically generated stickmen images; see Fig. 11 for some examples. The target of interest is Y ∈ {adult, child} . The core feature C is here the height of each person. The class Y is causal for height and height cannot be easily intervened on or change in different domains. Height is thus a robust predictor for differentiating between children and adults. As style feature we have here the movement of a person (distribution of angles between body, arms and legs). For the training data we created a dependence between age and the style feature 'movement', which can be thought to arise through a hidden common cause D , namely the place of observation. For instance, the images of children might mostly show children playing while the images of adults typically show them in more "static" postures. The left panel of Fig. 11 shows examples from the training set where large movements are associated with children and small movements are associated with adults. Test set 1 follows the same distribution, as shown in the middle panel. A standard CNN will exploit this relationship between movement and the label Y of interest, whereas this is discouraged by the conditional variance penalty of CoRe. The latter is pairing images of the same person in slightly different movements as shown by the red boxes in the leftmost panel of Fig. 11. If the learned model exploits this dependence between movement and age for predicting Y, it will fail when presented images of, say, dancing adults. The right panel of Fig. 11 shows such examples (test set 2). The standard CNN suffers in this case from a 41% misclassification rate, as opposed to the 3% on test set 1 data. For as few as c = 50 paired observations, the network with an added CoRe penalty, in contrast, achieves also 4% on test set 1 data and succeeds in achieving an 9% performance on test set 2, whereas the pooled estimator fails on this dataset with a test error of 41%. These results suggest that the learned representation of the pooled estimator uses movement as a predictor for age while CoRe does not use this feature due to the conditional variance regularization. Importantly, including more grouped examples would not improve the performance of the pooled estimator as these would be subject to the same bias and hence also predominantly have examples of heavily moving children and "static" adults (also see Fig. 23 which shows results for c ∈ {20, 500, 2000}).

MNIST: more sample efficient data augmentation
The goal of using CoRe in this example is to make data augmentation more efficient in terms of the required samples. In data augmentation, one creates additional samples by modifying the original inputs, e.g. by rotating, translating, or flipping the images (Schölkopf et al. 1996). In other words, additional samples are generated by interventions on style features. Using this augmented data set for training results in invariance of the estimator with respect to the transformations (style features) of interest. For CoRe we can use the grouping information that the original and the augmented samples belong to the same object. This enforces the invariance with respect to the style features more strongly compared to normal data augmentation which just pools all samples. We assess this for the style feature 'rotation' on MNIST (LeCun et al. 1998) and only include c = 200 augmented training examples for m = 10,000 original samples, resulting in a total sample size of n = 10200 . The degree of the rotations is sampled uniformly at random from [35,70]. Figure 12 shows examples from the training set. By using CoRe the average test error on rotated examples is reduced from 22% to 10%. Very few augmented sample are thus sufficient to lead to stronger rotational invariance. The standard approach of creating augmented data and pooling all images requires, in contrast, many more samples to achieve the same effect. Additional results for m ∈ {1000, 10,000} and c ranging from 100 to 5000 can be found in Fig. 22 in Appendix Sect. D.5.  Fig. 13 Elmer-the-Elephant dataset. The left panel shows training data with a few additional grayscale elephants. The pooled estimator learns that color is predictive for the animal class and achieves test error of 24% on test set 1 where this association is still true but suffers a misclassification error of 53% on test set 2 where this association breaks down. By adding the CoRe penalty, the test error is consistently around 30%, irrespective of the color distribution of horses and elephants 1 3

Elmer the Elephant
In this example, we want to assess whether invariance with respect to the style feature 'color' can be achieved. In the children's book 'Elmer the elephant' 13 one instance of a colored elephant suffices to recognize it as being an elephant, making the color 'gray' no longer an integral part of the object 'elephant'. Motivated by this process of concept formation, we would like to assess whether CoRe can exclude 'color' from its learned representation by penalizing conditional variance appropriately. We work with the 'Animals with attributes 2' (AwA2) dataset (Xian et al. 2017) and consider classifying images of horses and elephants. We include additional examples by adding grayscale images for c = 250 images of elephants. These additional examples do not distinguish themselves strongly from the original training data as the elephant images are already close to grayscale images. The total training sample size is 1850. Figure 13 shows examples and misclassification rates from the training set and test sets for CoRe and the pooled estimator on different test sets. Examples from these and more test sets can be found in Fig. 24. Test set 1 contains original, colored images only. In test set 2 images of horses are in grayscale and the colorspace of elephant images is modified, effectively changing the color gray to red-brown. We observe that the pooled estimator does not perform well on test set 2 as its learned representation seems to exploit the fact that 'gray' is predictive for 'elephant' in the training set. This association is no longer valid for test set 2. In contrast, the predictive performance of CoRe is hardly affected by the changing color distributions. More details can be found in "Appendix D.7".
It is noteworthy that a colored elephant can be recognized as an elephant by adding a few examples of a grayscale elephant to the very lightly colored pictures of natural

Further related work
Encoding certain invariances in estimators is a well-studied area in computer vision and machine learning with an extensive body of literature. While a large part of this work assumes the desired invariance to be known, fewer approaches aim to learn the required invariances from data and the focus often lies on geometric transformations of the input data or explicitly creating augmented observations (Sohn and Lee 2012;Khasanova and Frossard 2017;Hashimoto et al. 2017;Devries and Taylor 2017). The main difference between this line of work and CoRe is that we do not require to know the style feature explicitly, the set of possible style features is not restricted to a particular class of transformations and we do not aim to create augmented observations in a generative framework.
Recently, various approaches have been proposed that leverage causal motivations for deep learning or use deep learning for causal inference, related to e.g. the problems of cause-effect inference and generative adversarial networks (Chalupka et al. 2014;Lopez-Paz and Oquab 2017;Goudet et al. 2017;Bahadori et al. 2017;Besserve et al. 2018;Kocaoglu et al. 2018). Kilbertus et al. (2017) exploit causal reasoning to characterize fairness considerations in machine learning. Distinguishing between the protected attribute and its proxies, they derive causal non-discrimination criteria. The resulting algorithms avoiding proxy discrimination require classifiers to be constant as a function of the proxy variables in the causal graph, thereby bearing some structural similarity to our style features.
Distinguishing between core and style features can be seen as some form of disentangling factors of variation. Estimating disentangled factors of variation has gathered a lot of interested in the context of generative modeling. As in CoRe, Bouchacourt et al. (2018) exploit grouped observations. In a variational autoencoder framework, they aim to separate style and content-they assume that samples within a group share a common but unknown value for one of the factors of variation while the style can differ. Denton and Birodkar (2017) propose an autoencoder framework to disentangle style and content in videos using an adversarial loss term where the grouping structure induced by clip identity is exploited. Here we try to solve a classification task directly without estimating the latent factors explicitly as in a generative framework.
In the computer vision literature, various works have used identity information to achieve pose invariance in the context of face recognition (Bartlett and Sejnowski 1996;Tran et al. 2017). More generally, the idea of exploiting various observations of the same underlying object is related to multi-view learning (Xu et al. 2013). In the context of adversarial examples, Kannan et al. (2018) recently proposed the defense "Adversarial logit pairing" which is methodologically equivalent to the CoRe penalty C f ,1, when using the squared error loss. Several empirical studies have shown mixed results regarding the performance on ∞ perturbations (Engstrom et al. 2018;Mosbach et al. 2018), so far this setting has not been analyzed theoretically and hence it is an open question whether a CoRetype penalty constitutes an effective defense against adversarial examples.

Conclusion
Distinguishing the latent features in an image into CoRe and style features, we have proposed conditional variance regularization (CoRe) to achieve robustness with respect to interventions on the style or "orthogonal" features. The main idea of the CoRe estimator is to exploit the fact that we often have instances of the same object in the training data. By demanding invariance of the classifier amongst a group of instances that relate to the same object, we can achieve invariance of the classification performance with respect to interventions on style features such as image quality, fashion type, color, or body posture. The training also works despite sampling biases in the data.
There are two main application areas: 1. If the style features are known explicitly, we can achieve the same classification performance as standard data augmentation approaches with substantially fewer augmented samples, as shown for example in Sect. 5.5. An interesting line of work would be to use larger models such as Inception or large ResNet architectures He et al. 2016). These models have been trained to be invariant to an array of explicitly defined style features. In Sect. 5.2.2 we include results which show that using Inception V3 features does not guard against interventions on more implicit style features. We would thus like to assess what benefits CoRe can bring for training Inception-style models end-to-end, both in terms of sample efficiency and in terms of generalization performance. While we showed some examples where the necessary grouping information is available, an interesting possible future direction would be to use video data since objects display temporal constancy and the temporal information can hence be used for grouping and conditional variance regularization. Beyond that our results show that it can be worthwhile to collect ID information when new datasets are created. As CoRe only requires a subset of the observations to have ID annotations, in many cases this information might be cheap to collect while it can improve performance substantially when future test data is subject to domain shifts.

Proof of Theorem 1
First part To show the first part, namely that with probability 1, we need to show that W t̂pool ≠ 0 with probability 1. The reason this is sufficient is as follows: if W t ≠ 0 , then L ∞ ( ) = ∞ as we can then find a v ∈ ℝ q such that ∶= t Wv ≠ 0 . Assume without limitation of generality that v is normed such that To show that W t̂pool ≠ 0 with probability 1, let ̂ * be the oracle estimator that is constrained to be orthogonal to the column space of W: We show W t̂pool ≠ 0 by contradiction. Assume hence that W t̂pool = 0 . If this is indeed the case, then the constraint W t = 0 in (14) becomes non-active and we have ̂p ool =̂ * . This would imply that taking the directional derivative of the training loss with respect to any ∈ ℝ p in the column space of W should vanish at the solution ̂ * . In other words, define the gradient as g( ) = ∇ L n ( ) ∈ ℝ p . The implication is then that for all in the columnspace of W, and we will show the latter condition is violated almost surely.
As we work with the logistic loss and Y ∈ {−1, 1} , the loss is given by (y i , f (x i )) = log(1 + exp(−y i x t i )). Define r i ( ) ∶= y i ∕(1 + exp(y i x t i )) . For all i = 1, … , n we have r i ≠ 0 . Then The training images can be written according to the model as x i = x 0 i + Ws i , where X 0 ∶= k x (C, X ) are the images in absence of any style variation. Since the style features only have an effect on the column space of W in X, the oracle estimator ̂ * is identical under the true training data and the (hypothetical) training data x 0 i , i = 1, … , n in absence of style variation. As X − X 0 = WS , Eq. (16) can also be written as Since is in the column-space of W , there exists u ∈ ℝ q such that = Wu and we can write (17) as L ∞ (̂p ool ) = ∞, (15) t g(̂ * ) = 0 1 3 From (A2) we have that the eigenvalues of W t W are all positive. Also r i (̂ * ) is not a function of the interventions s i , i = 1, … , n since, as above, the estimator ̂ * is identical whether trained on the original data x i or on the intervention-free data x 0 i , i = 1, … , n . If we condition on everything except for the random interventions by conditioning on (x 0 i , y i ) for i = 1, … , n , then the rhs of (18) can be written as where a ∈ ℝ q is fixed (conditionally) and B = 1 n ∑ n i=1 r i (̂ * )(s i ) t W t W ∈ ℝ q is a random vector and B ≠ −a ∈ ℝ q with probability 1 by (A1) and (A2) Hence the left hand side of (18) is not identically 0 with probability 1 for any given in the column-space of W. This shows that the implication (15) is incorrect with probability 1 and hence completes the proof of the first part by contradiction.
Invariant parameter space Before continuing with the second part of the proof, some definitions. Let I be the invariant parameter space For all ∈ I , the loss (6) for any F ∈ F is identical to the loss under F 0 . That is for all ≥ 0, The optimal predictor in the invariant space I is If f is only a function of the core features C , then ∈ I . The challenge is that the core features are not directly observable and we have to infer the invariant space I from data. Second part For the second part, we first show that with probability at least p n , as defined in (A3), ̂c ore =̂ * with ̂ * defined as in (14). The invariant space for this model is the linear subspace I = { ∶ W t = 0} and by their respective definitions, Since we use I n = I n ( ) with = 0, I n = ∶Ê(V ar(f (X)|Y, ID)) = 0 .
Since S has a linear influence on X in (11) are in the same group G j of observations for some j ∈ {1, … , m} . Note that the number of grouped examples n − m is equal to or exceeds the rank q of W with probability p n , using (A3), and p n → 1 for n → ∞ . By (A2), it follows then with probability at least p n that I n ⊆ { ∶ W t = 0} = I . As, by definition, I ⊆ I n is always true, we have with probability p n that I = I n . Hence, with probability p n (and p n → 1 for n → ∞ ), ̂c ore =̂ * . It thus remains to be shown that Since ̂ * is in I, we have (y, x( )) = (y, x 0 ) , where x 0 are the previously defined data in absence of any style variance. Hence that is the estimator is unchanged if we use the (hypothetical) data x 0 i , i = 1, … , n as training data. The population optimal parameter vector defined in (19) as is for all ≥ 0 identical to Hence (21) and (22) can be written as By uniform convergence of L (0) n to the population loss L (0) , we have L (0) (̂ * ) → p L (0) ( * ) . By definition of I and * , we have L * ∞ = L ∞ ( * ) = L (0) ( * ) . As ̂ * is in I, we also have L ∞ (̂ * ) = L (0) (̂ * ) . Since, from above, L (0) (̂ * ) → p L (0) ( * ) , this also implies L ∞ (̂ * ) → p L ∞ ( * ) = L * ∞ . Using the previously established result that ̂c ore =̂ * with probability at least p n and p n → 1 for n → ∞ , this completes the proof.

Proof of Theorem 2
Let F 0 be the training distribution of (ID, Y, S) and F a distribution for (ID, Y,S) in F . By definition of F , we can write S = S + for a suitable random variable ∈ ℝ q with Vice versa: if we can write S = S + with ∈ U , then the distribution is in F . While X under F 0 can be written as X( = 0) , the distribution of X under F is of the form X( ) or, alternatively, X( Adopting from now on the latter constraint that U ∈ U 1 , and using (B2), where ∇h is the gradient of h ( ) with respect to , evaluated at ≡ 0 . Hence The proof is complete if we can show that On the one hand, This follows for a matrix Σ with Cholesky decomposition Σ = V t V, On the other hand, the conditional-variance-of-loss can be expanded as which completes the proof.

Network architectures
We implemented the considered models in TensorFlow (Abadi et al. 2015). The model architectures used are detailed in Table 5. CoReCoRe and the pooled estimator use the same network architecture and training procedure; merely the loss function differs by the CoRe regularization term. In all experiments we use the Adam optimizer (Kingma and Ba 2015). All experimental results are based on training the respective model five times (using the same data) to assess the variance due to the randomness in the training procedure. In each epoch of the training, the training data x i , i = 1, … , n are randomly shuffled, keeping the grouped observations (x i ) i∈I j for j ∈ {1, … , m} together to ensure that mini batches will contain grouped observations. In all experiments the mini batch size is set to 120. For small c this implies that not all mini batches contain grouped observations, making the optimization more challenging.

Eyeglasses detection: known and unknown brightness interventions
As in Sect. 5.3 we work with the CelebA dataset and try to classify whether the person in the image is wearing eyeglasses. Here we analyze a confounded setting that could arise as follows. Say the hidden common cause D of Y and S is a binary variable and indicates (a) Group. setting 1, β = 5 (b) Group. setting 1, β = 10 (c) Group whether the image was taken outdoors or indoors. If it was taken outdoors, then the person tends to wear (sun-)glasses more often and the image tends to be brighter. If the image was taken indoors, then the person tends not to wear (sun-)glasses and the image tends to be darker. In other words, the style variable S is here equivalent to brightness and the structure of the data generating process is equivalent to the one shown in Fig. 3. Figure 14 shows examples from the training set and test sets. As previously, we compute the conditional variance over images of the same person, sharing the same class label (and the CoRe estimator is hence not using the knowledge that brightness is important). Two alternatives for constructing grouped observations in this setting are discussed further below. We use c = 2000 and n = 20,000 . For the brightness intervention, we sample the value for the magnitude of the brightness increase resp. decrease from an exponential distribution with mean = 20 . In the training set and test set 1, we sample the brightness value as b i,j = [100 + y i e i,j ] + where e i,j ∼ Exp( −1 ) and y i ∈ {−1, 1} , where y i = 1 indicates presence of glasses and y i = −1 indicates absence. 15 For test set 2, we use instead b i,j = [100 − y i e i,j ] + , so that the relation between brightness and glasses is flipped. Figure 14 shows misclassification rates for CoRe and the pooled estimator on different test sets. Examples from all test sets can be found in Fig. 15. First, we notice that the pooled estimator performs better than CoRe on test set 1. This can be explained by the fact that it can exploit the predictive information contained in the brightness of an image while CoRe is restricted not to do so. Second, we observe that the pooled estimator does not perform well on test set 2 as its learned representation seems to use the image's brightness as a predictor for the response which fails when the brightness distribution in the test set differs significantly from the training set. In contrast, the predictive performance of CoRe is hardly affected by the changing brightness distributions.
We now discuss two alternatives for constructing different test sets and we vary the number of grouped observations in c ∈ {200, 2000, 5000} as well as the strength of the brightness interventions in ∈ {5, 10, 20} , all with sample size n = 20,000 . Generation of training and test sets 1 and 2 were already described above. Here, we consider additionally test set 3 where all images are left unchanged (no brightness interventions at all) and in test set 4 the brightness of all images is increased. Furthermore, we consider three different ways of grouping images. Above, we used images of the same person to create a grouped observation by sampling a different value for the brightness intervention. We refer to this as 'Grouping setting 2' here. An alternative is to use the same image of the same person in different brightnesses (drawn from the same distribution) as a group over which the conditional variance is calculated. We call this 'Grouping setting 1' and it can be useful if we know that we want to protect against brightness interventions in the future. For comparison, we also evaluate grouping with an image of a different person (but sharing the same class label) as a baseline ('Grouping setting 3'). Examples from the training sets using grouping settings 1, 2 and 3 can be found in Fig. 15.
Results for all grouping settings, ∈ {5, 10, 20} and c ∈ {200, 5000} can be found in Fig. 16. We see that using grouping setting 1 works best since we could explicitly control that only S ≡ brightness varies between grouping examples. In grouping setting 2, different images of the same person can vary in many factors, making it more challenging to isolate brightness as the factor to be invariant against. Lastly, we see that if we group images of different persons ('Grouping setting 3'), the difference between CoRe estimator and the pooled estimator becomes much smaller than in the previous settings. Figure 17 shows some examples of misclassified observations for Grouping setting 1. Figure 18 shows the numerator and the denominator of the variance ratio defined in Eq. (13) separately as a function of the CoRe penalty weight. In conjunction with Fig. 6b, we observe that a ridge penalty decreases both the within-and between-group variance while the CoRe penalty penalizes the within-group variance selectively. 1 3

Additional baselines: Unconditional variance regularization and grouping by class label
As additional baselines, we consider the following two schemes: (i) we group all examples sharing the same class label and penalize with the conditional variance of the Panel a shows the numerator of the variance ratio defined in Eq. (13) on test data as a function of both the CoRe and ridge penalty weights. Panel b shows the equivalent plot for the denominator. A ridge penalty decreases both the within-and between-group variance while the CoRe penalty penalizes the within-group variance selectively (the latter can be seen more clearly in Fig. 6b predicted logits, computed over these two groups; (ii) we penalize the overall variance of the predicted logits, i.e., a form of unconditional variance regularization. Figure 19 shows the performance of these two approaches. In contrast to the CoRe penalty, regularizing with the variance of the predicted logits conditional on Y only does not yield performance improvements on test set 2, compared to the pooled estimator (corresponding to a penalty weight of 0). Interestingly, using baseline (i) without a ridge penalty does yield an improvement on test set I, compared to the pooled estimator with various strengths of the ridge penalty. Table 6 additionally reports the standard errors for the results discussed in Sect. 5.2. Classification for Y ∈ {woman, man} with = 0.5 , using the baselines which (i) penalize the variance of the predicted logits conditional on the class label Y only; and (ii) penalize the overall variance of the predicted logits (cf. Sect. D.3.1). For baseline (i), panels (a) and (b) show the test error on test data sets 1 and 2 respectively as a function of the "baseline penalty weight" for various ridge penalty strengths. For baseline (ii), the equivalent plots are shown in panels (c) and (d). In contrast to the CoRe penalty, regularizing with these two baselines does not yield performance improvements on test set 2, compared to the pooled estimator (corresponding to a penalty weight of 0)

Eyeglasses detection: image quality intervention
Here, we show further results for the experiments introduced in Sect. 5.3. Specifically, we consider interventions of different strengths by varying the mean of the quality intervention in ∈ {30, 40, 50} . Recall that we use ImageMagick to modify the image quality. In the training set and in test set 1, we sample the image quality value as q i,j ∼ N( , = 10) and apply the command convert -quality q_ij input. In test set 1 all digits are rotated by a degree randomly sampled from [35,70]. Test set 2 is the usual MNIST test set at random from [35,70]. Figure 22 shows the misclassification rates. Test set 1 contains rotated digits only, test set 2 is the usual MNIST test set. We see that the misclassification rates of CoRe are always lower on test set 1, showing that it makes data augmentation more efficient. For m = 1000 , it even turns out to be beneficial for performance on test set 2.

Stickmen image-based age classification
Here, we show further results for the experiment introduced in Sect. 5.4. Recall that test set 1 follows the same distribution as the training set. In test sets 2 and 3 large movements are associated with both children and adults, while the movements are heavier in test set 3 than in test set 2. Figure D.10b shows results for different numbers of grouping examples. For c = 20 the misclassification rate of CoRe estimator has a large variance. For c ∈ {50, 500, 2000} , the CoRe estimator shows similar results. Its performance is thus not sensitive to the number of grouped examples, once there are sufficiently many grouped observations in the training set. The pooled estimator fails to achieve good predictive performance on test sets 2 and 3 as it seems to use "movement" as a predictor for "age" (Fig. 23).