Joint-product representation learning for domain generalization in classification and regression

In this work, we study the problem of generalizing a prediction (classification or regression) model trained on a set of source domains to an unseen target domain, where the source and target domains are different but related, i.e, the domain generalization problem. The challenge in this problem lies in the domain difference, which could degrade the generalization ability of the prediction model. To tackle this challenge, we propose to learn a neural network representation function to align a joint distribution and a product distribution in the representation space, and show that such joint-product distribution alignment conveniently leads to the alignment of multiple domains. In particular, we align the joint distribution and the product distribution under the L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^{2}$$\end{document}-distance, and show that this distance can be analytically estimated by exploiting its variational characterization and a linear variational function. This allows us to comfortably align the two distributions by minimizing the estimated distance with respect to the network representation function. Our experiments on synthetic and real-world datasets for classification and regression demonstrate the effectiveness of the proposed solution. For example, it achieves the best average classification accuracy of 82.26% on the text dataset Amazon Reviews, and the best average regression error of 0.114 on the WiFi dataset UJIIndoorLoc.


Introduction
Supervised learning models (eg, classification and regression models) with appropriately learned parameters can generalize well to the test data, under the assumption that both the training and test data are governed by the same domain Pðx; yÞ, where x and y represent the input and output variables [44]. While this is a reasonable assumption to make, it is likely to be violated in practical applications. In computer vision, the training and test images can be acquired from different imaging conditions (eg, background and illumination) representing different probability distributions [31]. In indoor WiFi localization, the training and test data often follow different distributions, as they are collected at different time periods [18] or from different places [13]. Under such circumstances, supervised learning models trained by merely following the Empirical Risk Minimization (ERM) principle [44], may perform sub-optimally and fail to make accurate predictions for the test data [38].
As an important problem in machine learning and computer vision, domain generalization [5,6] is exactly concerned with such a non-identically-distributed supervised learning scenario. In this problem, the training data consist of n (n ! 2) datasets respectively drawn from n source domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ, while the test data are sampled from an unseen target domain P t ðx; yÞ. The source and target domains are different but related [17,28,31,37], and the goal of domain generalization is to train a prediction (classification or regression) model on the collection of the n source datasets and generalize it to the target domain. In the following paragraphs, we will use some mathematical notations to describe the domain generalization works. For clarity and easy readability, we first give an overview of these notations in Table 1.
Prior works [1,17,28,30,38] aim to learn a representation function (i.e, a projection matrix or a neural network) to align the n source domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ as a key solution to the problem, and train a classifier/regressor in the representation space. Then, the prediction model containing the representation function and the classifier/regressor is expected to generalize well to the unseen target domain P t ðx; yÞ [30,37,52]. Specifically, since a domain Pðx; yÞ can be decomposed into Pðx; yÞ ¼ PðxÞPðyjxÞ, early works [17,28,37] align the n domains via learning a representation function to align the marginal distributions (marginals) P 1 ðxÞ; Á Á Á ; P n ðxÞ, assuming that the posterior distribution PðyjxÞ is stable. However, as noted in several works [30,38,52], the stability of PðyjxÞ is often violated in practice. Therefore, later works [30,31,52] propose to align the n domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ in other manners. They learn a representation function to align (1) a set of n marginals and c sets of n class-conditional distributions (class-conditionals) P 1 ðxjy ¼ iÞ; Á Á Á ; P n ðxjy ¼ iÞ for i 2 f1; Á Á Á ; cg, or (2) a set of n marginals and n sets of c class-conditionals P l ðxjy ¼ 1Þ; Á Á Á ; P l ðxjy ¼ cÞ for l 2 f1; Á Á Á ; ng, where c is the number of classes when y is a discrete variable for classification. However, these works suffer from the heavy work of aligning multiple sets of marginals and class-conditionals, with each set containing multiple distributions. As noted in [38], such alignment may be difficult to achieve when the number of classes c or the number of domains n is large. Additionally and importantly, in the regression tasks which arise widely in various real-world applications [7,13,26], the alignment of the class-conditionals in these works may not be feasible since the output variable y is continuous in regression.
In this work, we propose to learn a neural network representation function that aims at aligning the n domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ in a different way. To be specific, we first introduce a domain variable l, l 2 f1; Á Á Á ; ng, and define a joint distribution Pðx; y; lÞ and a product distribution Pðx; yÞPðlÞ. We then respectively view domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ as Pðx; yjl ¼ 1Þ; Á Á Á ; Pðx; yjl ¼ nÞ. Our idea is to learn the network representation function /, such that the joint distribution and the product distribution are aligned in the representation space, i.e, Pð/ðxÞ; y; lÞ ¼ Pð/ðxÞ; yÞPðlÞ. We show through a proposition that such joint-product distribution alignment leads to the alignment of the n domains, i.e, Pð/ðxÞ; yjl ¼ 1Þ ¼ Á Á Á ¼ Pð/ðxÞ; yjl ¼ nÞ. Therefore, the problem of aligning multiple domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ is conveniently transformed into the problem of aligning two distributions: the joint distribution Pðx; y; lÞ and the product distribution Pðx; yÞPðlÞ. The benefits of our alignment proposal are twofold. (1) Our proposal only needs to align two distributions regardless of the number of classes c or the number of domains n, which is straightforward and easy to achieve. (2) Our proposal naturally applies to the regression tasks, since it does not rely on aligning the classconditionals.
To be more specific, we align Pðx; y; lÞ and Pðx; yÞPðlÞ under the L 2 -distance. This distance, as we will show, can be analytically estimated, and is better suited to our case here than the Maximum Mean Discrepancy (MMD) [19]. Later in the next section, we include a more detailed discussion to justify our motivation in employing this distance. To estimate the L 2 -distance, we first exploit the Legendre-Fenchel convex duality [40] to obtain its variational characterization, i.e, the maximal value of a quadratic functional with respect to a variational function. Subsequently, we design the variational function as a linear-in-parameter model, and derive an analytic estimate for the L 2 -distance. As a result, our joint-product distribution alignment can be readily conducted by learning the representation function that minimizes the estimated L 2 -distance between the joint distribution and the product distribution. In the representation space, we train a downstream classifier/regressor for the inference task in the target domain. Both the representation function and the classifier/regressor are optimized via the minibatch Stochastic Gradient Descent (SGD) algorithm. See Fig. 1 for an illustration of our solution, which is denoted as JPRL for ''Joint-Product Representation Learning'' in the remainder. We demonstrate the effectiveness of our solution through conducting comprehensive experiments on synthetic and real-world datasets for classification and regression.
This paper is structured as follows. Section 2 reviews the related works. Section 3 introduces the JPRL solution. Section 4 discusses the assumption behind the solution. Section 5 describes the datasets and the experimental settings and reports the experimental results. Section 6 presents the conclusion.

Related work
The study of domain generalization can be traced back to the early works of Blanchard et al. [5] and Khosla et al. [24]. Since then, many strategies have been proposed to tackle the problem. Domain alignment is a popular strategy for domain generalization, which, to a certain extent, is inspired by the domain adaptation works [11,12,16,29,55]. Here, we focus on discussing the domain alignment works [1,28,31,37,38,52], since they are most relevant to our solution. In essence, most of these works learn a representation function (i.e, a projection matrix or a neural network) to align the marginal distributions (marginals) or the class-conditional distributions (class-conditionals) of the domains under various metrics, eg, MMD, Jensen-Shannon (JS) divergence. For clarity, we first present in Table 2 an overview of the main differences between our work and the related works from the perspectives of distribution alignment, representation function, distribution discrepancy metric, and optimization. We then elaborate on the details in the following.
Early works learn a representation function to align the marginals P 1 ðxÞ; Á Á Á ; P n ðxÞ under MMD or JS divergence. Muandet et al. [37] learned a project to align the marginals and preserve the functional relationship between input and output variables. Similarly, Ghifary et al. [17] reduced the dimensionality of data such that the marginals are aligned, and the separability of classes and the separability of unlabeled data are also maximized. These MMD-based works place the projection matrix out of the kernel mapping to the Reproducing Kernel Hilbert Space (RKHS) 1 [41]. Consequently, the resulting optimization problems could be solved via eigenvalue decomposition. Moreover, Li et al. [28] learned a neural network to align the distributions of the coded source features under MMD, and match the aligned marginals to a prior Laplacian distribution under JS divergence, which is achieved by adversarial training. Since these works [17,28,37] assume that the Fig. 1 Illustration of our Joint-Product Representation Learning (JPRL) solution to domain generalization. Here, the network prediction model h ¼ g / could be a shallow network or a deep convolutional neural network, both of which are implemented in Sect. 5. We propose to perform joint-product distribution alignment in the representation space, and derive an analytic estimate c L 2 ðPð/ðxÞ; y; lÞ; Pð/ðxÞ; yÞPðlÞÞ of the L 2 -distance to serve as the alignment loss. We utilize the common minibatch SGD to optimize the parameters of the network, such that the classification/regression loss and the alignment loss are jointly minimized. With the network model trained, we can apply it to the inference task in the target domain  [2], this could be viewed as nonlinear distribution alignment under MMD with a linear kernel. Alternatively, one can also place the projection matrix inside the kernel mapping and perform linear or nonlinear distribution alignment under MMD with a Gaussian or Laplacian kernel, which is known as a universal/characteristic kernel [19].
posterior distribution PðyjxÞ is stable, the n domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ is aligned by aligning the n marginals. However, as discussed in [30,38,52], the stability of PðyjxÞ is often violated in practice, eg, speaker recognition, object recognition, resulting in the under-alignment of domains.
Being aware of this point, later works align the n domains via learning a representation function to align the marginals and the class-conditionals under MMD or JS divergence. Li et al. [30] searched a projection matrix to align a set of n class prior-normalized marginals and c sets of n class-conditionals P 1 ðxjy ¼ iÞ; Á Á Á ; P n ðxjy ¼ iÞ for i 2 f1; Á Á Á ; cg under MMD, and derived an optimization problem that is solved via eigenvalue decomposition. As an extension of [30], Conditional Invariant Deep Domain Generalization (CIDDG) [31] shares a similar distribution alignment idea, but replaces the projection matrix by a deep neural network, and the MMD by the JS divergence for better performance. Zhao et al. [52] learned a network representation function to align a set of n marginals and n sets of c class-conditionals P l ðxjy ¼ 1Þ; Á Á Á ; P l ðxjy ¼ cÞ for l 2 f1; Á Á Á ; ng under the JS divergence. These works [31,52] characterize the JS divergence as the maximal value of a log loss functional. Consequently, minimizing the divergence leads to the adversarial training problem. However, when the number of classes c or the number of domains n is large, it may be difficult to achieve the alignment of domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ via the alignments of marginals and class-conditionals [38]. Furthermore, in the regression tasks, which arise widely in realworld applications such as indoor WiFi localization [13], age estimation [7], and human pose estimation [26], it may not be feasible to align the class-conditionals, since the output variable y is continuous.
Our work learns a neural network representation function to align joint distribution Pðx; y; lÞ and product distribution Pðx; yÞPðlÞ under the L 2 -distance. (1) We show that aligning these two distributions conveniently leads to the alignment of the n domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ (see details in Sect. 3.2). Such joint-product distribution alignment is straightforward to achieve, since it only aligns two distributions. Moreover, such alignment can handle the regression tasks, since it is free from aligning the classconditionals. (2) In the neural network context, we align distributions under the L 2 -distance rather than the JS divergence, since JS divergence is usually expressed by adversarial training [28,31,52], which is known to be unstable and time consuming [38,49]. While MMD and its extensions [33,56] circumvent the drawbacks of adversarial training, to our best knowledge, these metrics are mainly designed and employed for the discrepancy between the marginals [17,37], the class-conditionals [22,30,56], or the joint distributions of multiple input variables [33], i.e, Pðx 1 ; Á Á Á ; x k Þ and Qðx 1 ; Á Á Á ; x k Þ. According to [19], they require the kernel function to be the universal/characteristic kernel for them to become proper metrics 2 . However, in our work, it may not be trivial to formulate a proper MMD metric between the joint distribution Pðx; y; lÞ and the product distribution Pðx; yÞPðlÞ. These considerations motivate and encourage us to opt for the L 2 -distance, which quantifies the discrepancy between the joint distribution and the product distribution in a straightforward, decent, and intuitive manner. Importantly, we show that the L 2 -distance can be analytically estimated (see details in Sect. 3.3). (3) In Sect. 5.1, we conduct experiments to reinforce our proposal that joint-product distribution alignment under L 2 -distance leads to effective domain alignment.
Of course, in addition to domain alignment, there are also other strategies for tackling domain generalization [35,48,54]. Notable works include, but are not limited to, the ones that are based on meta-learning [3], parameter decomposition [39], and optimization [36]. Balaji et al. [3] encoded the notion of domain generalization using a regularization function, and learned the function in a meta-learning framework. Piratla et al. [39] decomposed the parameters of a neural network into a common component which is expected to generalize to the unseen target domain, and a low-rank domain-specific component that overfits the source domains. Mansilla et al. [36] conducted gradient surgery to enhance the generalization capability of deep neural network models. In Sect. 5.2, we experimentally compare our work with some of these works for completeness.

Problem formulation
Let X be an input space and Y be an output space. Particularly, Y is a discrete set of c categories for classification or a continuous space for regression. According to [17,31,37], we define the domain generalization problem as follows. A domain is a distribution Pðx; yÞ defined on X Â Y. In domain generalization, we have n ðn ! 2Þ source domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ, which are reflected by the associated datasets D 1 , and an unseen target domain P t ðx; yÞ, whose samples are not available during training. The source and target domains are different but related. Given the n source datasets, the goal is to learn a prediction (classification or regression) model h : X ! Y that performs well on the target domain.

Joint-product distribution alignment
We model h as a neural network containing a representation function / and a downstream classifier/regressor g, i.e, y ¼ hðxÞ ¼ g /ðxÞ. Here, / maps from the input space to the representation space Z, i.e, / : X ! Z, and g maps from the representation space to the output space, i.e, g : Z ! Y. In this work, we propose to learn a representation function that aligns the source domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ with available training data as a key solution to the domain generalization problem.
We show that the alignment of the n domains can be conducted by simply aligning two distributions. To be specific, we first introduce a domain variable l, l 2 L ¼ f1; Á Á Á ; ng, and define a joint distribution Pðx; y; lÞ and a product distribution Pðx; yÞPðlÞ on X Â Y Â L. Then, we view domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ as Pðx; yjl ¼ 1Þ; Á Á Á ; Pðx; yjl ¼ nÞ, respectively, which is inspired by the probabilistic formulation of the multi-task learning problem [4]. From this viewpoint, joint distribution Pðx; y; lÞ is reflected by dataset Finally, we present the following proposition, showing that joint-product distribution alignment leads to domain alignment.

Analytic estimation of L 2 -Distance
Under representation function /, we write the joint distribution and the product distribution as Pð/ðxÞ; y; lÞ and Pð/ðxÞ; yÞPðlÞ. We first introduce the L 2 -distance between these two distributions, and then elaborate on the distance estimation. For clarity, we illustrate in Fig. 2 an overview of the estimation in this subsection. The estimated L 2 -distance between Pð/ðxÞ; y; lÞ and Pð/ðxÞ; yÞPðlÞ will serve as the alignment loss for learning the representation function.
The L 2 -distance between Pð/ðxÞ; y; lÞ and Pð/ðxÞ; yÞ PðlÞ is defined as is a Gaussian kernel, and k 3 ðl; l i Þ ¼ dðl; l i Þ is a delta kernel that evaluates 1 if l ¼ l i and 0 otherwise. Additionally, k 2 ðy; y i Þ ¼ dðy; y i Þ is a delta kernel when y is a discrete variable for classification, and k 2 ðy; is a Gaussian kernel when y is a continuous variable for regression. We note that similar linear-in-parameter models have also been employed in prior works [4,12], but those models are different from our model (6) in terms of the input variables. Importantly, model (6) leads to the analytic estimate of the L 2 -distance, as we will show below. Plugging into Eq. (5) the linear variational function, Eq. (6), we derive and obtain the analytic estimate of the L 2 -distance: rð/ðx i Þ; y i ; l j ; hÞ Equation (8) rewrites the terms in Eq. (7) using matrix and vector notations, where 1 is an m-dimensional column vector of ones, P, K, S, and H are m Â m symmetric matrices, and is the Hadamard product. The (i, j)-th elements of P, K, and S are p ij ¼ p ð/ðx i Þ; y i ; l i Þ; ð ð/ðx j Þ; y j ; l j ÞÞ, dðy i ; y j Þdðl i ; l j Þ when y is a discrete variable in the classification tasks, and h i;j ¼ exp pk/ðx i ÞÀ/ðx j Þk 2 À4 exp pky i Ày j k 2 À4 dðl i ; l j Þ when y is a continuous variable in the regression tasks. Please refer to Appendix 1 for the detailed mathematical derivation of h i;j . Eq. (9) defines vector b ¼ 1 m P1 À 1 m 2 ½ðK1Þ ðS1Þ to make the unconstrained quadratic maximization problem explicit. In Eq. (10), we solve the maximization problem analytically and derive the analytic estimate of the L 2 -distance, Note that, here, a diagonal matrix I is added to H to ensure that the matrix is always numerically invertible in practice, where is a small positive value and I is the identity matrix.

Optimization problem
We combine the alignment loss (the estimated L 2 -distance) and the classification/regression loss, and present the optimization problem of the proposed JPRL solution as follows Here, in line with the common practice in [14,52], ' is the cross-entropy loss for classification or the square loss for regression, and k ð [ 0Þ is a tradeoff parameter for balancing the alignment loss and the task loss. We optimize the network prediction model h ¼ g /, where / is the representation function and g is the classifier/regressor, to jointly minimize the two losses. We employ minibatch SGD to solve problem (11). In every iteration of the algorithm, a minibatch consists of n minibatches respectively sampled from the n source datasets, containing the corresponding domain variables, and the objective in Eq. (11) is calculated using these minibatches.

Discussion on the assumption
In domain generalization, since data from the target domain are not available during training, one must assume the existence of certain relationship between the source and target domains, and exploits such relationship to improve the generalization performance of the prediction model [28,31,[37][38][39]52]. For instance, in the work of Piratla et al. [39], the authors assumed that the source and target domains share some ''stable'' features whose relationship with the output variable is invariant across domains, and the goal is to learn those features.
In our work, the fundamental assumption is that the source and target domains P 1 ðx; yÞ; Á Á Á ; P n ðx; yÞ; P t ðx; yÞ could be related by a representation function that makes them similar. Importantly, the representation function to a certain extent can be captured and learned by aligning the multiple source domains with available training data. As such, in the learned representation space, the source (training) and target (test) data follow similar distributions, which approximates the independent and identically distributed (i.i.d.) supervised learning setting. Hence, the source trained classifier/regressor can generalize to the target domain. In Sect. 5.1, we provide synthetic examples to further explain this point.
As noted by Zhang et al. [50], if the target domain P t ðx; yÞ changes arbitrarily, the available source data would be of no use to make predictions in the target domain. Under such circumstances, domain generalization methods (the domain alignment ones [28,31,37,38,52] including ours, and others [36,39,47,48]) may not succeed in learning prediction models that generalize to the target domain.

Experiments on synthetic datasets
We evaluate JPRL on synthetic datasets to verify its effectiveness in aligning domains. For simplicity and clear visualization, we construct two source domains P 1 ðx; yÞ; P 2 ðx; yÞ and a target domain P t ðx; yÞ. We write vector x ¼ ðx 1 ; x 2 ; x 3 ; x 4 Þ > and use N ðx; l; RÞ to denote a multivariate Gaussian distribution with mean vector l and covariance matrix R. Similarly, N ðy; lðxÞ; r 2 Þ denotes a Gaussian distribution with mean lðxÞ and variance r 2 . We implement JPRL with a one-Hidden-Layer Neural Network (1HLNN).
Classification. We consider the case where y 2 fÀ1; þ1g is a discrete variable for classification. We define the source and target domains in Table 3 and their corresponding parameters in Table 4. The domains are constructed to be different but related by a projection matrix W ¼ 1 0 0 0 0 1 0 0 such that P 1 ðWx; yÞ % P 2 ðWx; yÞ % P t ðWx; yÞ. We first draw 200 samples from P 1 ðx; yÞ, P 2 ðx; yÞ, and P t ðx; yÞ, respectively, and then train our JPRL network on the 400 source samples. For comparison, we also train a vanilla neural network on the same data without domain alignment and use it as a Baseline. Figure 3 plots the source and target data, which live in the representation spaces of the Baseline network and the JPRL network. Comparing Fig. 3a against Fig. 3b, we observe that JPRL aligns the source data much better than the Baseline. Importantly, while the target data do not participate in training the network, the learned representation function of JPRL well generalizes to the target data and aligns them with the source data. Consequently, JPRL performs well in the target domain and yields a small classification error of 0.04, which is superior to the error of 0.31 from the Baseline. Clearly, this experiment verifies the effectiveness of JPRL in domain alignment and in target domain classification.
Regression. We consider the case where y is a continuous variable for regression. We define the source and target domains in Table 3 and their corresponding parameters in   Figure 4 visualizes the source and target data in the representation spaces of these two networks. From Fig. 3a, we observe that JPRL well aligns the source and target domains in the representation space, and obtains a low target regression error of 0.09. These results are significantly better than the ones in Fig. 3b for the Baseline. Evidently, this experiment confirms the effectiveness of JPRL in domain alignment and in target domain regression.

Experiments on real-world datasets
In domain generalization, there exist two settings for conducting experiments on real-world datasets: one commonly practiced in [47,48,52,53], and the other one introduced by Gulrajani and Lopez-Paz [20]. We conduct the experiments following the former setting, so that we can quote the available results reported by the authors themselves.

Datasets
We evaluate JPRL on classification and regression datasets. The classification datasets include a text dataset Amazon Reviews [8] and two image datasets PACS [27] and Office-Home [45]. Among them, the number of classes varies from 2 classes to 65 classes. The regression datasets include a WiFi dataset UJIIndoorLoc [43] and an image dataset UTKFace [51]. We use these datasets in the experiments, since they are utilized in prior works [10,13,48,52]. Importantly, we aim to verify that JPRL can (1) perform well on several types of data (text, image, and WiFi), (2) scale and perform well in classification tasks with a few classes and many classes, and (3) handle domain generalization in regression. We summarize the dataset information in Table 5, and briefly describe each set in the following.   Fig. 5c for the example images. The task is regression with the age variable.

Evaluation protocol
In line with the leave-one-domain-out evaluation protocol [47,52], we employ neural network classification or regression model trained on the source datasets to predict the labels of samples from the target set. The performance of a classification model is measured by the classification  accuracy (%) following [52], and the performance of a regression model is measured by the sum of Mean Absolute Error (MAE) following [14]. On every task, we follow [47,52] and repeat the experiments 5 times with different random seeds to report the average result.

Implementation details
We implement JPRL with both shallow and deep networks (backbones), and present in Table 6 an overview of the backbone configuration for the datasets. To be specific, on the text dataset Amazon Reviews with the 5000-dimensional pre-processed features [8], following [3,16], the backbone is a 1HLNN, which has 100 hidden neurons with the sigmoid activation, and 2 output neurons (i.e, the number of classes in Amazon Reviews) with the softmax transformation. On each of the 2 RGB image datasets PACS and Office-Home, we follow the practice in previous works [47,48,52]  We preprocess and split the training data as follows. On Amazon Reviews and UJIIndoorLoc, we utilize the common z-score standardization to preprocess the features. On PACS, Office-Home, and UTKFace, we follow the standard practice in [47] and process the RGB images via random resized cropping, horizontal flipping, and color jittering. Moreover, for the regression datasets UJIIndoor-Loc and UTKFace, we follow the work of Chen et al. [14] and normalize the regression labels to [0, 1] to eliminate the effects of diverse scales in the regression variables. Regarding the data splitting protocol, we follow the general practice in prior works [47,48,52] and use 90% of available data as training data and 10% as validation data.
For training the networks, JPRL with its shallow implementations on Amazon Reviews and UJIIndoorLoc is trained from scratch by minibatch SGD with a momentum of 0.9 and a learning rate of 10 À3 . The tradeoff parameter k is selected from the range f10 À3 ; 10 À1 ; Á Á Á ; 10 3 g by using the validation data. Furthermore, JPRL with its deep implementations on PACS, Office-Home, and UTKFace is trained from the ImageNet pretrained models. The optimizer is still the minibatch SGD and the learning rate is initially set to 10 À3 and shrunk to 10 À4 after 50 iterations. This time, the tradeoff parameter k is not selected through a grid search, since the corresponding procedure would be computationally costly. Instead, following [12,16], we gradually change k from 0 to 1 by a progressive schedule: where t is the training progress linearly changing from 0 to 1.

Results
We report in Table 7, the classification results on Amazon Reviews, in Tables 8 and 9 the results on PACS, in Tables 10 and 11 the results on Office-Home. In addition, we report in Table 12 the regression results on UJIIndoorLoc and in Table 13 the regression results on UTKFace. In every table, the names of the source domains are omitted under the leave-one-domain-out evaluation protocol. For every column in the table, the best result is highlighted in bold.
Classification. We quote the available results of the comparison methods in Table 7 from [3], the results in Tables 8 and 9 from [39,47,48,52], and the results in Tables 10 and 11 from [47,48], since our experimental settings coincide with these works. To compare with the relevant Alignment methods CIDDG and DGER, we use their source codes, follow their hyperparameter tuning protocols, and produce their results on datasets Amazon Reviews and Office-Home. Besides, we also produce the results of Other methods, i.e, CSD and DGGS, on Amazon Reviews. Note that, with the cited and produced results, we aim for a comprehensive comparison of JPRL with the Alignment methods and Other recent methods on different types of data: text (Amazon Reviews) and image (PACS, Office-Home), and in tasks with a few classes (Amazon Reviews, PACS) and many classes (Office-Home).
In Table 7 for the text dataset Amazon Reviews, JPRL with the 1HLNN backbone consistently outperforms its competitors on all the 4 tasks, and yields the highest average classification accuracy of 82.26%. The outperformance verifies that, compared to the alignment of marginals and class-conditionals (CIDDG, DGER) or other strategies (MetaReg, CSD, DGGS), our joint-product distribution alignment under the L 2 -distance would be more effective in domain generalization for text classification. In Tables 8 and 9 for the image dataset PACS, JPRL with the ResNet18 and ResNet50 backbones is among the top performing methods. It significantly outperforms the Alignment methods MMD-AAE and DGER on the first 2 tasks in Table 8, and achieves better average accuracy than the very recent methods MixStyle, FACT, and ATSRL in both tables. These results demonstrate that JPRL can achieve general preferable performance regardless of the backbone choice. In Tables 10 and 11 for another image dataset Office-Home, the results show that JPRL would be preferable to the comparison methods considered. Here, it is worth mentioning that, CIDDG and DGER align distributions in a class-wise manner using adversarial training, which might not be easy to achieve with many classes (65 in Office-Home). By contrast, our JPRL learns a representation function to align 2 distributions: the joint  distribution and the product distribution, which conveniently leads to the alignment of multiple domains simultaneously. Moreover, our distribution alignment is conducted by solving a simple minimization problem, and not the challenging adversarial problem. In summary, the results from Tables 8, 9, 10 and 11 show that in domain generalization for image classification, the proposed JPRL solution (1) would be more advantageous than the relevant Alignment competitors, and (2) can also yield better or comparable results to the Others.
Regression. Since the regression results of the domain generalization methods are not available, we use the source codes of the Alignment method MMD-AAE and the Other methods (MetaReg, CSD, and DGGS) to produce their regression results on datasets UJIIndoorLoc and UTKFace. The results are produced by replacing the original classification loss with the regression loss (square loss) and following the hyperparameter tuning protocols of the methods. Here, we do not include the relevant CIDDG and DGER as comparison methods, since their domain generalization losses (not the classification loss) are derived by assuming that the label y is a discrete variable for classification. Similar to classification, we aim for a comprehensive comparison of JPRL with the Alignment and Other methods on different types of data: WiFi (UJIIndoorLoc) and image (UTKFace), and in tasks with one regression variable (UTKFace) and multiple regression variables (UJIIndoorLoc). In Table 12 for the WiFi dataset UJIIndoorLoc and Table 13 for the image dataset UTKFace, JPRL with the 1HLNN and ResNet50 backbones obtains lower sum of Mean Absolute Error (MAE) than its comparison methods on 7 out of the 8 tasks. This verifies that in the less studied problem of domain generalization for regression, our JPRL is more effective in handling different regression tasks. Furthermore, we also evaluate our solution on the Cooling Capacity dataset [25] from the field of energy conversion and management. We split the dataset into 3 subsets (domains): D500, D600, and D900, according to the cycle time. The task is regression with the cooling capacity variable. We follow the work of Krzywanski et al. [25] and implement our JPRL with a two-Hidden-Layer Neural Network (2HLNN). Table 14 reports the regression results. We observe that JPRL consistently outperforms its comparison methods, which again demonstrates the effectiveness of our solution.

Statistical test
To be strict in a statistical sense, we further conduct statistical test to check whether our solution is significantly better than the others in the classification tasks. We conduct the Wilcoxon signed-ranks test [9,15] based on the classification results from Tables 7,8,9,10 and 11. The test uses a statistic z to compare the performance of two methods over multiple tasks. Specifically, in each task the classification accuracy is adopted as the performance measure of the methods. We fix JPRL as a control method, and conduct 8 pairs of tests: CIDDG versus JPRL, DGER versus JPRL, MetaReg versus JPRL, DDAIG versus JPRL, MixStyle versus JPRL, ASTRL versus JPRL, RSC versus JPRL, and FACT versus JPRL. The detailed description of the test procedure is presented in Appendix 1, and the resulting 8 z values are reported in Table 15. We observe from Table 15 that the z values for these 8 pairs are all below the critical value of -1.96. According to [9,15], this indicates that with a significance level of 0.05, JPRL is statistically better than its comparison methods.

Experimental analysis
The main experimental results above have demonstrated the advantage of our JPRL solution to domain generalization in classification and regression. In the following, we further conduct several experiments to analyze the contributions of JPRL. As comparison methods, we include Domain Adaptation Networks (DAN) [32], Joint Adaptation Networks (JAN) [33], and Deep Subdomain Adaptation Network (DSAN) [56], which are popular MMD-based methods. Note that, since these methods are originally designed for domain adaptation, we slightly modify them such that their inputs contain data from multiple source domains, which are the same as the inputs of domain generalization methods.

Feature visualization
We visualize the domain alignment ability of JPRL on realworld dataset, and exploit the adversarial methods CIDDG and DGER, and the MMD-based method DSAN for In each column, the best result is highlighted in bold comparison. We use Office-Home as the experimental dataset, which has many classes in each domain. Therefore, this dataset could be challenging for domain alignment and can better assess the ability of JPRL. In particular, we take Product as the target and the remaining 3 as the source domains, and plot in Fig. 6 the t-SNE [34] embeddings of the domain data, which are obtained from the representation spaces of the above methods with ResNet18 backbone. By comparing Fig. 6d against Fig. 6a, b, c, we observe that JPRL better aligns the source and target domains in the representation space than its comparison methods CIDDG, DGER, and DSAN. This suggests that by joint-product distribution alignment under the L 2 -distance, JPRL better reduces the discrepancy among domains, approximates the i.i.d. supervised learning setting to a certain extent, and eventually leads to superior target classification results.

Regression loss
According to a recent survey by Lathuilière et al. [26], besides the square loss, the L 1 loss and the Huber loss are also possible regression losses. Here, we investigate whether the advantage of JPRL over the comparison methods is dependent on the regression loss. To this end, we utilize all three losses, and run JPRL and its comparison methods DGGS and MMD-AAE on two tasks from UJIIndoorLoc and UTKFace. The target regression results on the two In each column, the best result is highlighted in bold In each column, the best result is highlighted in bold In each column, the best result is highlighted in bold tasks are reported in Fig. 7a, b, where the horizontal dash lines are plotted for easy comparison. Figure 7a shows that with different losses, JPRL consistently achieves lower target error than DGGS and MMD-AAE. For JPRL itself, the performance varies with the losses: the best performance is attained at the L 1 loss and the other two losses lead to similar performance. From Fig. 7b, we again observe that JPRL yields better results than its competitors regardless of the regression losses. From these evidences, we conclude that while the performance of JPRL varies with different regression losses, its advantage over the comparison methods is still a constant, which shows that our joint-product representation learning component plays an important role.

Comparison with MMD-based methods
We compare JPRL with DAN, JAN, and DSAN, which are also free from adversarial training. We run the comparison experiments on Office-Home and equip all the methods with the ResNet18 backbone. The target classification results are reported in Table 16. We observe that JPRL recognizably outperforms DAN, JAN, and DSAN. This shows that in domain generalization, joint-product distribution alignment under the L 2 -distance has advantage over the other alignment strategies under the MMD distance, which are implemented by the comparison methods.

Conclusion
In this work, we study the domain generalization problem and propose the JPRL solution that generalizes a source trained network prediction model to the target domain. Our solution works by (1) aligning in the network representation space two probability distributions: the joint distribution and the product distribution, and (2) minimizing the downstream classification or regression loss. Particularly, In each column, the best result is highlighted in bold In each column, the best result is highlighted in bold In each column, the best result is highlighted in bold we align the two distributions under the L 2 -distance, and estimate it via analytically solving an unconstrained quadratic maximization problem, leading to an explicit estimate of the distance. We implement our solution with both shallow and deep network architectures, and experimentally demonstrate its effectiveness on synthetic and real-world datasets for classification and regression. The limitation of our solution is that when the domains are very different from each other, it could be difficult to well align the joint distribution and the product distribution in the network representation space. As a result, our solution may not bring much performance improvement for the network prediction model. In the future, we plan to strengthen our solution to handle such a case, by incorporating other complementary domain generalization strategies (eg, [36,53]) into our solution. Furthermore, we also plan to explore the related multi-source domain adaptation problem [46], and extend the current solution to address that problem.

Appendix C
Here, we give the detailed description of the statistical test in Sect. 5.2.6. We describe the procedure of the Wilcoxon signed-ranks test [9,15] on the task s from Tables 7, 8, 9, 10 and 11. The test compares the performance of two methods over multiple tasks. Specifically, in each task the classification accuracy is adopted as the performance measure of the methods. We fix JPRL as a control method, and conduct 8 pairs of tests: CIDDG versus JPRL, DGER versus JPRL, MetaReg versus JPRL, DDAIG versus JPRL, MixStyle versus JPRL, ASTRL versus JPRL, RSC versus JPRL, and FACT versus JPRL. To run the test, we rank the differences in performance of two methods for each task out of N tasks. The differences are ranked according to their absolute values. The smallest absolute value gets the rank of 1, the second smallest gets the rank of 2, and so on. In case of equality, average ranks are assigned. The statistic of the Wilcoxon signed-ranks test is: where Tða; bÞ ¼ minfR þ ða; bÞ; R À ða; bÞg. R þ ða; bÞ is the sum of ranks for the tasks on which method b outperforms method a and R À ða; bÞ is the sum of ranks for the opposite. They are defined as follows: ðC:2Þ where diff i is the difference between the accuracy of two methods on the i-th task out of N tasks, and rankðdiff i Þ is the rank of jdiff i j. We fix b as JPRL, and let a vary from CIDDG to FACT in turn. Based on formulas (C.1)-(C.3), we can compute z(a, b) for the 8 pairs of tests.

Declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.