1 Introduction

Supervised learning models (eg, classification and regression models) with appropriately learned parameters can generalize well to the test data, under the assumption that both the training and test data are governed by the same domain \(P(\varvec{x},y)\), where \(\varvec{x}\) and y represent the input and output variables [44]. While this is a reasonable assumption to make, it is likely to be violated in practical applications. In computer vision, the training and test images can be acquired from different imaging conditions (eg, background and illumination) representing different probability distributions [31]. In indoor WiFi localization, the training and test data often follow different distributions, as they are collected at different time periods [18] or from different places [13]. Under such circumstances, supervised learning models trained by merely following the Empirical Risk Minimization (ERM) principle [44], may perform sub-optimally and fail to make accurate predictions for the test data [38].

As an important problem in machine learning and computer vision, domain generalization [5, 6] is exactly concerned with such a non-identically-distributed supervised learning scenario. In this problem, the training data consist of n (\(n \ge 2\)) datasets respectively drawn from n source domains \(P^{1}(\varvec{x}, y),\) \(\cdots , P^{n}(\varvec{x}, y)\), while the test data are sampled from an unseen target domain \(P^{t}(\varvec{x}, y)\). The source and target domains are different but related [17, 28, 31, 37], and the goal of domain generalization is to train a prediction (classification or regression) model on the collection of the n source datasets and generalize it to the target domain. In the following paragraphs, we will use some mathematical notations to describe the domain generalization works. For clarity and easy readability, we first give an overview of these notations in Table 1.

Table 1 Notations and their descriptions

Prior works [1, 17, 28, 30, 38] aim to learn a representation function (i.e, a projection matrix or a neural network) to align the n source domains \(P^{1}(\varvec{x}, y), \cdots , P^{n}(\varvec{x}, y)\) as a key solution to the problem, and train a classifier/regressor in the representation space. Then, the prediction model containing the representation function and the classifier/regressor is expected to generalize well to the unseen target domain \(P^{t}(\varvec{x}, y)\) [30, 37, 52]. Specifically, since a domain \(P(\varvec{x},y)\) can be decomposed into \(P(\varvec{x},y)=P(\varvec{x})P(y\vert \varvec{x})\), early works [17, 28, 37] align the n domains via learning a representation function to align the marginal distributions (marginals) \(P^{1}(\varvec{x}),\) \(\cdots , P^{n}(\varvec{x})\), assuming that the posterior distribution \(P(y \vert \varvec{x})\) is stable. However, as noted in several works [30, 38, 52], the stability of \(P(y\vert \varvec{x})\) is often violated in practice. Therefore, later works [30, 31, 52] propose to align the n domains \(P^{1}(\varvec{x}, y), \cdots , P^{n}(\varvec{x}, y)\) in other manners. They learn a representation function to align (1) a set of n marginals and c sets of n class-conditional distributions (class-conditionals) \(P^{1}(\varvec{x}\vert y=i),\) \(\cdots , P^{n}(\varvec{x} \vert y=i)\) for \(i \in \{1, \cdots , c\}\), or (2) a set of n marginals and n sets of c class-conditionals \(P^{l}(\varvec{x} \vert y=1), \cdots , P^{l}(\varvec{x} \vert y=c)\) for \(l \in \{1, \cdots , n\}\), where c is the number of classes when y is a discrete variable for classification. However, these works suffer from the heavy work of aligning multiple sets of marginals and class-conditionals, with each set containing multiple distributions. As noted in [38], such alignment may be difficult to achieve when the number of classes c or the number of domains n is large. Additionally and importantly, in the regression tasks which arise widely in various real-world applications [7, 13, 26], the alignment of the class-conditionals in these works may not be feasible since the output variable y is continuous in regression.

Fig. 1
figure 1

Illustration of our Joint-Product Representation Learning (JPRL) solution to domain generalization. Here, the network prediction model \(h=g \circ \phi\) could be a shallow network or a deep convolutional neural network, both of which are implemented in Sect. 5. We propose to perform joint-product distribution alignment in the representation space, and derive an analytic estimate \(\widehat{L^{2}}(P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l))\) of the \(L^{2}\)-distance to serve as the alignment loss. We utilize the common minibatch SGD to optimize the parameters of the network, such that the classification/regression loss and the alignment loss are jointly minimized. With the network model trained, we can apply it to the inference task in the target domain

In this work, we propose to learn a neural network representation function that aims at aligning the n domains \(P^{1}(\varvec{x}, y),\) \(\cdots , P^{n}(\varvec{x}, y)\) in a different way. To be specific, we first introduce a domain variable l, \(l \in \{1, \cdots , n\}\), and define a joint distribution \(P(\varvec{x},y,l)\) and a product distribution \(P(\varvec{x},y)P(l)\). We then respectively view domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) as \(P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)\). Our idea is to learn the network representation function \(\phi\), such that the joint distribution and the product distribution are aligned in the representation space, i.e, \(P(\phi (\varvec{x}),y,l) = P(\phi (\varvec{x}),y)P(l)\). We show through a proposition that such joint-product distribution alignment leads to the alignment of the n domains, i.e, \(P(\phi (\varvec{x}),y \vert l=1) = \cdots = P(\phi (\varvec{x}),y \vert l=n)\). Therefore, the problem of aligning multiple domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) is conveniently transformed into the problem of aligning two distributions: the joint distribution \(P(\varvec{x},y,l)\) and the product distribution \(P(\varvec{x},y)P(l)\). The benefits of our alignment proposal are twofold. (1) Our proposal only needs to align two distributions regardless of the number of classes c or the number of domains n, which is straightforward and easy to achieve. (2) Our proposal naturally applies to the regression tasks, since it does not rely on aligning the class-conditionals.

To be more specific, we align \(P(\varvec{x},y,l)\) and \(P(\varvec{x},y)P(l)\) under the \(L^{2}\)-distance. This distance, as we will show, can be analytically estimated, and is better suited to our case here than the Maximum Mean Discrepancy (MMD) [19]. Later in the next section, we include a more detailed discussion to justify our motivation in employing this distance. To estimate the \(L^{2}\)-distance, we first exploit the Legendre-Fenchel convex duality [40] to obtain its variational characterization, i.e, the maximal value of a quadratic functional with respect to a variational function. Subsequently, we design the variational function as a linear-in-parameter model, and derive an analytic estimate for the \(L^{2}\)-distance. As a result, our joint-product distribution alignment can be readily conducted by learning the representation function that minimizes the estimated \(L^{2}\)-distance between the joint distribution and the product distribution. In the representation space, we train a downstream classifier/regressor for the inference task in the target domain. Both the representation function and the classifier/regressor are optimized via the minibatch Stochastic Gradient Descent (SGD) algorithm. See Fig. 1 for an illustration of our solution, which is denoted as JPRL for “Joint-Product Representation Learning" in the remainder. We demonstrate the effectiveness of our solution through conducting comprehensive experiments on synthetic and real-world datasets for classification and regression.

This paper is structured as follows. Section 2 reviews the related works. Section 3 introduces the JPRL solution. Section 4 discusses the assumption behind the solution. Section 5 describes the datasets and the experimental settings and reports the experimental results. Section 6 presents the conclusion.

2 Related work

The study of domain generalization can be traced back to the early works of Blanchard et al. [5] and Khosla et al. [24]. Since then, many strategies have been proposed to tackle the problem.

Domain alignment is a popular strategy for domain generalization, which, to a certain extent, is inspired by the domain adaptation works [11, 12, 16, 29, 55]. Here, we focus on discussing the domain alignment works [1, 28, 31, 37, 38, 52], since they are most relevant to our solution. In essence, most of these works learn a representation function (i.e, a projection matrix or a neural network) to align the marginal distributions (marginals) or the class-conditional distributions (class-conditionals) of the domains under various metrics, eg, MMD, Jensen-Shannon (JS) divergence. For clarity, we first present in Table 2 an overview of the main differences between our work and the related works from the perspectives of distribution alignment, representation function, distribution discrepancy metric, and optimization. We then elaborate on the details in the following.

Table 2 Overview of the main differences between our work and other relevant works. To present the text clearly, we abbreviate “class-conditionals”, “representation”, and “decomposition” to “class-cond.”, “repr.”, and “decomp.”, respectively

Early works learn a representation function to align the marginals \(P^{1}(\varvec{x}), \cdots , P^{n}(\varvec{x})\) under MMD or JS divergence. Muandet et al. [37] learned a project to align the marginals and preserve the functional relationship between input and output variables. Similarly, Ghifary et al. [17] reduced the dimensionality of data such that the marginals are aligned, and the separability of classes and the separability of unlabeled data are also maximized. These MMD-based works place the projection matrix out of the kernel mapping to the Reproducing Kernel Hilbert Space (RKHS)Footnote 1 [41]. Consequently, the resulting optimization problems could be solved via eigenvalue decomposition. Moreover, Li et al. [28] learned a neural network to align the distributions of the coded source features under MMD, and match the aligned marginals to a prior Laplacian distribution under JS divergence, which is achieved by adversarial training. Since these works [17, 28, 37] assume that the posterior distribution \(P(y \vert \varvec{x})\) is stable, the n domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) is aligned by aligning the n marginals. However, as discussed in [30, 38, 52], the stability of \(P(y \vert \varvec{x})\) is often violated in practice, eg, speaker recognition, object recognition, resulting in the under-alignment of domains.

Being aware of this point, later works align the n domains via learning a representation function to align the marginals and the class-conditionals under MMD or JS divergence. Li et al. [30] searched a projection matrix to align a set of n class prior-normalized marginals and c sets of n class-conditionals \(P^{1}(\varvec{x} \vert y=i), \cdots , P^{n}(\varvec{x} \vert y=i)\) for \(i \in \{1, \cdots , c\}\) under MMD, and derived an optimization problem that is solved via eigenvalue decomposition. As an extension of [30], Conditional Invariant Deep Domain Generalization (CIDDG) [31] shares a similar distribution alignment idea, but replaces the projection matrix by a deep neural network, and the MMD by the JS divergence for better performance. Zhao et al. [52] learned a network representation function to align a set of n marginals and n sets of c class-conditionals \(P^{l}(\varvec{x} \vert y=1), \cdots ,\) \(P^{l}(\varvec{x} \vert y=c)\) for \(l \in \{1, \cdots , n\}\) under the JS divergence. These works [31, 52] characterize the JS divergence as the maximal value of a \(\text{log}\) loss functional. Consequently, minimizing the divergence leads to the adversarial training problem. However, when the number of classes c or the number of domains n is large, it may be difficult to achieve the alignment of domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) via the alignments of marginals and class-conditionals [38]. Furthermore, in the regression tasks, which arise widely in real-world applications such as indoor WiFi localization [13], age estimation [7], and human pose estimation [26], it may not be feasible to align the class-conditionals, since the output variable y is continuous.

Our work learns a neural network representation function to align joint distribution \(P(\varvec{x},y,l)\) and product distribution \(P(\varvec{x},y)P(l)\) under the \(L^{2}\)-distance. (1) We show that aligning these two distributions conveniently leads to the alignment of the n domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) (see details in Sect. 3.2). Such joint-product distribution alignment is straightforward to achieve, since it only aligns two distributions. Moreover, such alignment can handle the regression tasks, since it is free from aligning the class-conditionals. (2) In the neural network context, we align distributions under the \(L^{2}\)-distance rather than the JS divergence, since JS divergence is usually expressed by adversarial training [28, 31, 52], which is known to be unstable and time consuming [38, 49]. While MMD and its extensions [33, 56] circumvent the drawbacks of adversarial training, to our best knowledge, these metrics are mainly designed and employed for the discrepancy between the marginals [17, 37], the class-conditionals [22, 30, 56], or the joint distributions of multiple input variables [33], i.e, \(P(\varvec{x}^{1}, \cdots , \varvec{x}^{k})\) and \(Q(\varvec{x}^{1}, \cdots , \varvec{x}^{k})\). According to [19], they require the kernel function to be the universal/characteristic kernel for them to become proper metricsFootnote 2. However, in our work, it may not be trivial to formulate a proper MMD metric between the joint distribution \(P(\varvec{x},y,l)\) and the product distribution \(P(\varvec{x},y)P(l)\). These considerations motivate and encourage us to opt for the \(L^{2}\)-distance, which quantifies the discrepancy between the joint distribution and the product distribution in a straightforward, decent, and intuitive manner. Importantly, we show that the \(L^{2}\)-distance can be analytically estimated (see details in Sect. 3.3). (3) In Sect. 5.1, we conduct experiments to reinforce our proposal that joint-product distribution alignment under \(L^{2}\)-distance leads to effective domain alignment.

Of course, in addition to domain alignment, there are also other strategies for tackling domain generalization [35, 48, 54]. Notable works include, but are not limited to, the ones that are based on meta-learning [3], parameter decomposition [39], and optimization [36]. Balaji et al. [3] encoded the notion of domain generalization using a regularization function, and learned the function in a meta-learning framework. Piratla et al. [39] decomposed the parameters of a neural network into a common component which is expected to generalize to the unseen target domain, and a low-rank domain-specific component that overfits the source domains. Mansilla et al. [36] conducted gradient surgery to enhance the generalization capability of deep neural network models. In Sect. 5.2, we experimentally compare our work with some of these works for completeness.

3 Methodology

3.1 Problem formulation

Let \({\mathcal {X}}\) be an input space and \({\mathcal {Y}}\) be an output space. Particularly, \({\mathcal {Y}}\) is a discrete set of c categories for classification or a continuous space for regression. According to [17, 31, 37], we define the domain generalization problem as follows. A domain is a distribution \(P(\varvec{x},y)\) defined on \({\mathcal {X}} \times {\mathcal {Y}}\). In domain generalization, we have \(n~(n \ge 2)\) source domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\), which are reflected by the associated datasets \({\mathcal {D}}_{xy}^{1}=\{(\varvec{x}_{i}^{1},y_{i}^{1})\}_{i=1}^{m_{1}}\), \(\cdots\), \({\mathcal {D}}_{xy}^{n}=\{(\varvec{x}_{i}^{n},y_{i}^{n})\}_{i=1}^{m_{n}}\), and an unseen target domain \(P^{t}(\varvec{x},y)\), whose samples are not available during training. The source and target domains are different but related. Given the n source datasets, the goal is to learn a prediction (classification or regression) model \(h: {\mathcal {X}} \rightarrow {\mathcal {Y}}\) that performs well on the target domain.

3.2 Joint-product distribution alignment

We model h as a neural network containing a representation function \(\phi\) and a downstream classifier/regressor g, i.e, \(y=h(\varvec{x})=g\circ \phi (\varvec{x})\). Here, \(\phi\) maps from the input space to the representation space \({\mathcal {Z}}\), i.e, \(\phi : {\mathcal {X}} \rightarrow {\mathcal {Z}}\), and g maps from the representation space to the output space, i.e, \(g: {\mathcal {Z}} \rightarrow {\mathcal {Y}}\). In this work, we propose to learn a representation function that aligns the source domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) with available training data as a key solution to the domain generalization problem.

We show that the alignment of the n domains can be conducted by simply aligning two distributions. To be specific, we first introduce a domain variable l, \(l \in {\mathcal {L}}=\{1, \cdots , n\}\), and define a joint distribution \(P(\varvec{x},y,l)\) and a product distribution \(P(\varvec{x},y)P(l)\) on \({\mathcal {X}} \times {\mathcal {Y}} \times {\mathcal {L}}\). Then, we view domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)\) as \(P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)\), respectively, which is inspired by the probabilistic formulation of the multi-task learning problem [4]. From this viewpoint, joint distribution \(P(\varvec{x},y,l)\) is reflected by dataset \({\mathcal {D}}_{xyl}= \{(\varvec{x}_{i}^{1},y_{i}^{1}, 1)\}_{i=1}^{m_{1}}\cup \cdots \cup \{(\varvec{x}_{i}^{n},y_{i}^{n}, n)\}_{i=1}^{m_{n}}=\{(\varvec{x}_{i},y_{i}, l_{i})\}_{i=1}^{m}\), where \(m=m_{1}+ \cdots +m_{n}\). The distribution \(P(\varvec{x},y)=\int P(\varvec{x},y,l)dl\) is reflected by dataset \({\mathcal {D}}_{xy}=\{(\varvec{x}_{i},y_{i})\}_{i=1}^{m}\), and the distribution \(P(l)=\int P(\varvec{x},y,l)d\varvec{x}dy\) is reflected by dataset \({\mathcal {D}}_{l}=\{ l_{i}\}_{i=1}^{m}\). Finally, we present the following proposition, showing that joint-product distribution alignment leads to domain alignment.

Proposition 1

Under representation function \(\phi\), the alignment of two distributions \(P(\varvec{x},y,l)\) and \(P(\varvec{x},y)P(l)\) implies the alignment of n domains \(P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)\). That is, \(P(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y)P(l)\) \(\Rightarrow\) \(P(\phi (\varvec{x}),y \vert l=1)= \cdots =P(\phi (\varvec{x}),y \vert l=n)\).

The proof is placed in Appendix 1. Evidently, Proposition 1 suggests that aligning joint distribution \(P(\varvec{x},y,l)\) and product distribution \(P(\varvec{x},y)P(l)\) leads to the alignment of multiple domains \(P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)\), i.e, \(P^{1}(\varvec{x},y),\) \(\cdots , P^{n}(\varvec{x},y)\). In the next subsection, we measure the discrepancy between \(P(\varvec{x},y,l)\) and \(P(\varvec{x},y)P(l)\) under the \(L^{2}\)-distance, and derive its estimate as the alignment loss.

3.3 Analytic estimation of \(L^{2}\)-Distance

Under representation function \(\phi\), we write the joint distribution and the product distribution as \(P(\phi (\varvec{x}),y,l)\) and \(P(\phi (\varvec{x}),y)P(l)\). We first introduce the \(L^{2}\)-distance between these two distributions, and then elaborate on the distance estimation. For clarity, we illustrate in Fig. 2 an overview of the estimation in this subsection. The estimated \(L^{2}\)-distance between \(P(\phi (\varvec{x}),y,l)\) and \(P(\phi (\varvec{x}),y)P(l)\) will serve as the alignment loss for learning the representation function.

Fig. 2
figure 2

Overview of the analytic estimation of \(L^{2}\)-distance.(i) We define the \(L^{2}\)-distance between the joint distribution and the product distribution in Eq. (1). (ii) Based on Eq. (1), we introduce the variational characterization of the \(L^{2}\)-distance in Eq. (4). (iii) We replace the expectations in Eq. (4) by empirical averages and obtain Eq. (5). (iv) We design the variational function in Eq. (5) as the linear-in-parameter model in Eq. (6). (v) We solve an unconstrained quadratic maximization problem and derive the analytic estimate of the \(L^{2}\)-distance in Eq. (10). In this procedure, the variational characterization and the linear variational function are crucial to the estimation.

The \(L^{2}\)-distance between \(P(\phi (\varvec{x}),y,l)\) and \(P(\phi (\varvec{x}),y)P(l)\) is defined as

$$\begin{aligned}&L^{2}\left (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\right ) &= \frac{1}{2}\int \left (P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\right )^{2}d\phi (\varvec{x})dydl. \end{aligned}$$
(1)

The distance compares distributions \(P(\phi (\varvec{x}),y,l)\) and \(P(\phi (\varvec{x}),y)P(l)\) based on their difference \(P(\phi (\varvec{x}),y,l)-P(\phi (\varvec{x}),y)P(l)\). It is non-negative, symmetric, and equals to zero if and only if \(P(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y)P(l)\).

To estimate the \(L^{2}\)-distance, we alternatively express the original definition in Eq. (1) as

$$\begin{aligned}&L^{2}\left (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\right ) \nonumber \\&\quad =\int \max _{r}\Big [\left (P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\right ) \nonumber \\& \quad \quad \times r(\phi (\varvec{x}),y,l) - \frac{1}{2}r(\phi (\varvec{x}),y,l)^{2}\Big ]d\phi (\varvec{x})dydl \end{aligned}$$
(2)
$$\begin{aligned}&=\max _{r}\Big (\int \Big [\Big (P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\Big ) \nonumber \\&\quad \times r(\phi (\varvec{x}),y,l) - \frac{1}{2}r(\phi (\varvec{x}),y,l)^{2}\Big ]d\phi (\varvec{x})dydl\Big ) \end{aligned}$$
(3)
$$\begin{aligned}&=\max _{r}\Big (\int r(\phi (\varvec{x}),y,l)P(\phi (\varvec{x}),y,l)d\phi (\varvec{x})dydl \nonumber \\&\quad - \int r(\phi (\varvec{x}),y,l)P(\phi (\varvec{x}),y)P(l)d\phi (\varvec{x})dydl \nonumber \\&\quad - \frac{1}{2}\int r(\phi (\varvec{x}),y,l)^{2}d\phi (\varvec{x})dydl\Big ). \end{aligned}$$
(4)

In Eq. (2), we write the convex function \(\frac{1}{2}u^{2}=\max _{v}(uv-\frac{1}{2}v^{2})\) by using the Legendre-Fenchel convex duality [40], and regard the difference function \(P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\) as u and the function \(r(\phi (\varvec{x}),y,l)\) as v. In Eq. (4), the quadratic functional is maximized when the function \(r(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\). The corresponding maximal value is the right hand side of Eq. (1). Here, we call Eq. (4) the variational characterization of the \(L^{2}\)-distance, and \(r(\phi (\varvec{x}),y,l)\) the variational function.

Since distributions \(P(\phi (\varvec{x}),y,l)\), \(P(\phi (\varvec{x}),y)\), and P(l) are reflected by their samples (see Sect. 3.2), we can replace the expectations in Eq. (4) by empirical averages and approximate the \(L^{2}\)-distance as

$$\begin{aligned}&L^{2}\big (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\big ) \nonumber \\&\quad \approx \max _{r}\Big (\frac{1}{m}\sum _{i=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{i}) \nonumber \\&\quad - \frac{1}{m^{2}}\sum _{i,j=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{j}) \nonumber \\&\quad - \frac{1}{2}\int r(\phi (\varvec{x}),y,l)^{2}d\phi (\varvec{x})dydl\Big ) . \end{aligned}$$
(5)

Note that, here, the second term of Eq. (4) contains the expectations with respect to two probability distributions \(P(\phi (\varvec{x}),y)\) and P(l). Therefore, this term is approximated in the form \(\int r(\phi (\varvec{x}),y,l)P(\phi (\varvec{x}),y)P(l)d\phi (\varvec{x})dydl \approx \frac{1}{m^{2}}\sum _{i,j=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{j}).\)

We design the variational function as a linear-in-parameter model

$$\begin{aligned} r(\phi (\varvec{x}),y,l;\varvec{\theta })=\sum _{i=1}^{m} \theta _{i}p\left ((\phi (\varvec{x}),y,l),(\phi (\varvec{x}_{i}),y_{i},l_{i})\right ), \end{aligned}$$
(6)

where \(\varvec{\theta }=(\theta _{1}, ..., \theta _{m})^{\top }\) are the parameters of the model and \(p\left ((\phi (\varvec{x}),y,l),(\phi (\varvec{x}_{i}),y_{i},l_{i})\right ) =k^{1}(\phi (\varvec{x}),\phi (\varvec{x}_{i}))k^{2}(y,y_{i})k^{3}(l,l_{i})\) is a product kernel. In particular, \(k^{1}(\phi (\varvec{x}),\phi (\varvec{x}_{i}))=\exp \left (\frac{\pi \Vert \phi (\varvec{x}) -\phi (\varvec{x}_{i})\Vert ^{2}}{-2}\right )\) is a Gaussian kernel, and \(k^{3}(l,l_{i})=\delta (l,l_{i})\) is a delta kernel that evaluates 1 if \(l=l_{i}\) and 0 otherwise. Additionally, \(k^{2}(y,y_{i})=\delta (y,y_{i})\) is a delta kernel when y is a discrete variable for classification, and \(k^{2}(y,y_{i})=\exp \left (\frac{\pi \Vert y-y_{i}\Vert ^{2}}{-2}\right )\) is a Gaussian kernel when y is a continuous variable for regression. We note that similar linear-in-parameter models have also been employed in prior works [4, 12], but those models are different from our model (6) in terms of the input variables. Importantly, model (6) leads to the analytic estimate of the \(L^{2}\)-distance, as we will show below.

Plugging into Eq. (5) the linear variational function, Eq. (6), we derive and obtain the analytic estimate of the \(L^{2}\)-distance:

$$\begin{aligned}&\widehat{L^{2}}\big (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\big ) \nonumber \\&\quad = \max _{\varvec{\theta }}\Big (\frac{1}{m}\sum _{i=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{i};\varvec{\theta }) \nonumber \\&\quad - \frac{1}{m^{2}}\sum _{i,j=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{j};\varvec{\theta }) \nonumber \\&\quad - \frac{1}{2}\int r(\phi (\varvec{x}),y,l;\varvec{\theta })^{2}d\phi (\varvec{x})dydl\Big ) \end{aligned}$$
(7)
$$\begin{aligned}&=\max _{\varvec{\theta }}\Big (\frac{1}{m}\varvec{1}^{\top }\varvec{P}\varvec{\theta } - \frac{1}{m^{2}}[(\varvec{1}^{\top }\varvec{K}) \odot (\varvec{1}^{\top }\varvec{S})]\varvec{\theta } - \frac{1}{2}\varvec{\theta }^{\top }\varvec{H}\varvec{\theta }\Big ) \end{aligned}$$
(8)
$$\begin{aligned}&=\max _{\varvec{\theta }} \left (\varvec{b}^{\top }\varvec{\theta } - \frac{1}{2}\varvec{\theta }^{\top }\varvec{H}\varvec{\theta } \right) \end{aligned}$$
(9)
$$\begin{aligned}&=\varvec{b}^{\top }\widehat{\varvec{\theta }} - \frac{1}{2}\widehat{\varvec{\theta }}^{\top }\varvec{H}\widehat{\varvec{\theta }}. \end{aligned}$$
(10)

Equation (8) rewrites the terms in Eq. (7) using matrix and vector notations, where \(\varvec{1}\) is an m-dimensional column vector of ones, \(\varvec{P}\), \(\varvec{K}\), \(\varvec{S}\), and \(\varvec{H}\) are \(m \times m\) symmetric matrices, and \(\odot\) is the Hadamard product. The (ij)-th elements of \(\varvec{P}\), \(\varvec{K}\), and \(\varvec{S}\) are \(p_{ij} = p\left ((\phi (\varvec{x}_{i}),y_{i},l_{i}),(\phi (\varvec{x}_{j}),y_{j},l_{j})\right )\), \(k_{ij}=k^{1}(\phi (\varvec{x}_{i}),\phi (\varvec{x}_{j}))k^{2}(y_{i},y_{j})\), and \(s_{ij}=k^{3}(l_{i},l_{j})\). Additionally, the (ij)-th element of \(\varvec{H}\) is \(h_{i,j}\), where \(h_{i,j}=\exp \left(\frac{\pi \Vert \phi (\varvec{x}_{i}) - \phi (\varvec{x}_{j}) \Vert ^{2}}{-4}\right)\delta (y_{i},y_{j})\delta (l_{i},l_{j})\) when y is a discrete variable in the classification tasks, and \(h_{i,j}=\exp \left (\frac{\pi \Vert \phi (\varvec{x}_{i}) - \phi (\varvec{x}_{j}) \Vert ^{2}}{-4}\right )\) \(\exp \left (\frac{\pi \Vert y_{i} - y_{j} \Vert ^{2}}{-4}\right )\delta (l_{i},l_{j})\) when y is a continuous variable in the regression tasks. Please refer to Appendix 1 for the detailed mathematical derivation of \(h_{i,j}\). Eq. (9) defines vector \(\varvec{b} = \frac{1}{m}\varvec{P}\varvec{1} - \frac{1}{m^{2}}[(\varvec{K}\varvec{1}) \circ (\varvec{S}\varvec{1})]\) to make the unconstrained quadratic maximization problem explicit. In Eq. (10), we solve the maximization problem analytically and derive the analytic estimate of the \(L^{2}\)-distance, where \(\widehat{\varvec{\theta }} = (\varvec{H}+\epsilon {\textbf{I}})^{-1}\varvec{b}\). Note that, here, a diagonal matrix \(\epsilon {\textbf{I}}\) is added to \(\varvec{H}\) to ensure that the matrix is always numerically invertible in practice, where \(\epsilon\) is a small positive value and \({\textbf{I}}\) is the identity matrix.

3.4 Optimization problem

We combine the alignment loss (the estimated \(L^{2}\)-distance) and the classification/regression loss, and present the optimization problem of the proposed JPRL solution as follows

$$\begin{aligned}&\min _{\phi , g} \frac{1}{n}\sum _{s=1}^{n}\frac{1}{m_{s}} \sum _{i=1}^{m_{s}}\ell (g(\phi (\varvec{x}_{i}^{s})),y_{i}^{s}) \nonumber \\&~~~~~~~+ \lambda \widehat{L^{2}} \left (P(\phi (\varvec{x}),y,l), P(\phi (\varvec{x}),y)P(l)\right ). \end{aligned}$$
(11)

Here, in line with the common practice in [14, 52], \(\ell\) is the cross-entropy loss for classification or the square loss for regression, and \(\lambda ~(>0)\) is a tradeoff parameter for balancing the alignment loss and the task loss. We optimize the network prediction model \(h=g \circ \phi\), where \(\phi\) is the representation function and g is the classifier/regressor, to jointly minimize the two losses.

We employ minibatch SGD to solve problem (11). In every iteration of the algorithm, a minibatch consists of n minibatches respectively sampled from the n source datasets, containing the corresponding domain variables, and the objective in Eq. (11) is calculated using these minibatches.

4 Discussion on the assumption

In domain generalization, since data from the target domain are not available during training, one must assume the existence of certain relationship between the source and target domains, and exploits such relationship to improve the generalization performance of the prediction model [28, 31, 37,38,39, 52]. For instance, in the work of Piratla et al. [39], the authors assumed that the source and target domains share some “stable” features whose relationship with the output variable is invariant across domains, and the goal is to learn those features.

In our work, the fundamental assumption is that the source and target domains \(P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y), P^{t}(\varvec{x},y)\) could be related by a representation function that makes them similar. Importantly, the representation function to a certain extent can be captured and learned by aligning the multiple source domains with available training data. As such, in the learned representation space, the source (training) and target (test) data follow similar distributions, which approximates the independent and identically distributed (i.i.d.) supervised learning setting. Hence, the source trained classifier/regressor can generalize to the target domain. In Sect. 5.1, we provide synthetic examples to further explain this point.

As noted by Zhang et al. [50], if the target domain \(P^{t}(\varvec{x},y)\) changes arbitrarily, the available source data would be of no use to make predictions in the target domain. Under such circumstances, domain generalization methods (the domain alignment ones [28, 31, 37, 38, 52] including ours, and others [36, 39, 47, 48]) may not succeed in learning prediction models that generalize to the target domain.

5 Experiments

5.1 Experiments on synthetic datasets


We evaluate JPRL on synthetic datasets to verify its effectiveness in aligning domains. For simplicity and clear visualization, we construct two source domains \(P^{1}(\varvec{x},y),\) \(P^{2}(\varvec{x},y)\) and a target domain \(P^{t}(\varvec{x},y)\). We write vector \(\varvec{x}=(x_{1}, x_{2}, x_{3}, x_{4})^{\top }\) and use \({\mathcal {N}}(\varvec{x}; \varvec{\mu },\varvec{\Sigma })\) to denote a multivariate Gaussian distribution with mean vector \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\). Similarly, \({\mathcal {N}}(y; \mu (\varvec{x}),\sigma ^{2})\) denotes a Gaussian distribution with mean \(\mu (\varvec{x})\) and variance \(\sigma ^{2}\). We implement JPRL with a one-Hidden-Layer Neural Network (1HLNN).


Classification. We consider the case where \(y \in \{-1, +1\}\) is a discrete variable for classification. We define the source and target domains in Table 3 and their corresponding parameters in Table 4. The domains are constructed to be different but related by a projection matrix \(\varvec{W} = \left( \begin{array}{cccc} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0\\ \end{array} \right)\) such that \(P^{1}(\varvec{W}\varvec{x},y) \approx P^{2}(\varvec{W}\varvec{x},y) \approx P^{t}(\varvec{W}\varvec{x},y)\). We first draw 200 samples from \(P^{1}(\varvec{x},y)\), \(P^{2}(\varvec{x},y)\), and \(P^{t}(\varvec{x},y)\), respectively, and then train our JPRL network on the 400 source samples. For comparison, we also train a vanilla neural network on the same data without domain alignment and use it as a Baseline. Figure 3 plots the source and target data, which live in the representation spaces of the Baseline network and the JPRL network. Comparing Fig. 3a against Fig. 3b, we observe that JPRL aligns the source data much better than the Baseline. Importantly, while the target data do not participate in training the network, the learned representation function of JPRL well generalizes to the target data and aligns them with the source data. Consequently, JPRL performs well in the target domain and yields a small classification error of 0.04, which is superior to the error of 0.31 from the Baseline. Clearly, this experiment verifies the effectiveness of JPRL in domain alignment and in target domain classification.

Table 3 Definition of the source and target domains in cases where \(y \in \{-1,+1\}\) is a discrete variable for classification and \(y \in {\mathbb {R}}\) is a continuous variable for regression. For classification, \(P(y=-1 \vert \varvec{x})=1-P(y=+1 \vert \varvec{x})\)
Table 4 Parameter of the source and target domains defined in Table 3

Regression. We consider the case where y is a continuous variable for regression. We define the source and target domains in Table 3 and their corresponding parameters in Table 4. The domains are different but related by a vector \(\varvec{w} = (1, 0, 0, 0)^{\top }\) such that \(P^{1}(\varvec{w}^{\top }\varvec{x},y) \approx P^{2}(\varvec{w}^{\top }\varvec{x},y) \approx P^{t}(\varvec{w}^{\top }\varvec{x},y)\). Same as the classification case, we draw samples from the source and target domains, and train the JPRL and Baseline networks on the source samples. Figure 4 visualizes the source and target data in the representation spaces of these two networks. From Fig. 3a, we observe that JPRL well aligns the source and target domains in the representation space, and obtains a low target regression error of 0.09. These results are significantly better than the ones in Fig. 3b for the Baseline. Evidently, this experiment confirms the effectiveness of JPRL in domain alignment and in target domain regression.

Fig. 3
figure 3

Visualization of source and target data in the representation spaces of the Baseline network and the JPRL network for classification. Symbols s\(^{l}\)(p) and s\(^{l}\)(n) denote the positive and negative classes in source domain \(l \in \{1,2\}\), and t(p) and t(n) denote the positive and negative classes in the target domain. a Baseline (err. 0.31). b JPRL (err. 0.04)

Fig. 4
figure 4

Visualization of source and target data in the representation spaces of the Baseline network and the JPRL network for regression. Symbol s\(^{l}\) denotes data from source domain \(l \in \{1,2\}\) and symbol t denotes data from the target domain. a Baseline (err. 0.35). b JPRL (err. 0.09)

5.2 Experiments on real-world datasets

In domain generalization, there exist two settings for conducting experiments on real-world datasets: one commonly practiced in [47, 48, 52, 53], and the other one introduced by Gulrajani and Lopez-Paz [20]. We conduct the experiments following the former setting, so that we can quote the available results reported by the authors themselves.

5.2.1 Datasets

We evaluate JPRL on classification and regression datasets. The classification datasets include a text dataset Amazon Reviews [8] and two image datasets PACS [27] and Office-Home [45]. Among them, the number of classes varies from 2 classes to 65 classes. The regression datasets include a WiFi dataset UJIIndoorLoc [43] and an image dataset UTKFace [51]. We use these datasets in the experiments, since they are utilized in prior works [10, 13, 48, 52]. Importantly, we aim to verify that JPRL can (1) perform well on several types of data (text, image, and WiFi), (2) scale and perform well in classification tasks with a few classes and many classes, and (3) handle domain generalization in regression. We summarize the dataset information in Table 5, and briefly describe each set in the following.

Table 5 Statistics of the real-world datasets for classification and regression (*)

Amazon reviews [8] contains online reviews from 4 domains (products): Books (B), DVDs (D), Electronics (E), and Kitchens (K). Each domain has 2000 reviews which are encoded in 5000-dimensional feature vectors of unigrams and bigrams. The task is classification with 2 classes representing positive and negative reviews.


PACS [27] includes images from 4 domains: ArtPainting (A) with 2048 images, Cartoon (C) with 2344 images, Photo (P) with 1670 images, and Sketch (S) with 3929 images. See Fig. 5a for the example images. The task is classification with 7 classes.


Office-Home [45] contains images of everyday objects organized into 4 domains: Art (A) with 2427 images, Clipart (C) with 4365 images, Product (P) with 4439 images, and RealWorld (R) with 4357 images. See Fig. 5b for the example images. The task is classification with 65 classes.


UJIIndoorLoc [43] is a multi-floor indoor localization dataset to test indoor positioning systems that rely on WiFi fingerprint. Each WiFi fingerprint is characterized by 520 Received Signal Strength Intensity (RSSI) values. Inspired by [13], we split the dataset into 4 domains: Floor1 (F1), Floor2 (F2), Floor3 (F3), and Floor4 (F4), leading to 4501, 5464, 4722, and 5220 samples in each of these domains. The task is regression with the latitude and longitude variables.


UTKFace [51] includes face images labeled by age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, etc. We use face images with ages spanning from 1 to 100, and split the dataset into 4 domains by the ethnicity label: Asian (A), Black (B), Indian (I), and White (W), leading to 3419, 4488, 3949, and 9970 samples in each of these domains. See Fig. 5c for the example images. The task is regression with the age variable.

Fig. 5
figure 5

Example images from 3 datasets. a PACS [27]. b Office-Home [45]. c UTKFace [51]

5.2.2 Comparison methods

We compare JPRL against existing domain generalization methods including (1) the ones that perform domain alignment (Alignment) for domain generalization, and (2) the ones that tackle the problem via other strategies (Others). The former methods include: MMD-based Adversarial AutoEncoder (MMD-AAE) [28], CIDDG [31], Domain Generalization via Entropy Regularization (DGER) [52]. The latter methods include: Cross-Gradient (CrossGrad) [42], Meta-Regularization (MetaReg) [3], Deep Domain-Adversarial Image Generation (DDAIG) [53], Representation Self-Challenging (RSC) [23], Common Specific Decomposition (CSD) [39], Fourier Augmented Co-Teacher (FACT) [47], MixStyle [54], Domain Generalization via Gradient Surgery (DGGS) [36], and Adversarial Teacher-Student Representation Learning (ATSRL) [48].

5.2.3 Evaluation protocol

In line with the leave-one-domain-out evaluation protocol [47, 52], we employ neural network classification or regression model trained on the source datasets to predict the labels of samples from the target set. The performance of a classification model is measured by the classification accuracy (%) following [52], and the performance of a regression model is measured by the sum of Mean Absolute Error (MAE) following [14]. On every task, we follow [47, 52] and repeat the experiments 5 times with different random seeds to report the average result.

5.2.4 Implementation details


We implement JPRL with both shallow and deep networks (backbones), and present in Table 6 an overview of the backbone configuration for the datasets. To be specific, on the text dataset Amazon Reviews with the 5000-dimensional pre-processed features [8], following [3, 16], the backbone is a 1HLNN, which has 100 hidden neurons with the sigmoid activation, and 2 output neurons (i.e, the number of classes in Amazon Reviews) with the softmax transformation. On each of the 2 RGB image datasets PACS and Office-Home, we follow the practice in previous works [47, 48, 52] and use the ResNet18 and ResNet50 [21] backbones, where their final layers are both reconstructed to have the same number of outputs as the number of classes in that dataset (7 for PACS and 65 for Office-Home). On the WiFi dataset UJIIndoorLoc with the 520-dimensional features, the backbone is a 1HLNN which has 260 hidden neurons with the ReLU activation, and 2 output neurons (i.e, the number of regression variables in UJIIndoorLoc) with the sigmoid transformation. By such design, the number of hidden neurons for the network is half of the number of its input neurons. Finally, on the RGB image dataset UTKFace, the backbone is the ResNet50 model, where its final layer is reconstructed to have one output (i.e, the number of regression variables in UTKFace) with the sigmoid transformation.

Table 6 Backbone configuration for the datasets, where 1HLNN represents one-Hidden-Layer Neural Network

We preprocess and split the training data as follows. On Amazon Reviews and UJIIndoorLoc, we utilize the common z-score standardization to preprocess the features. On PACS, Office-Home, and UTKFace, we follow the standard practice in [47] and process the RGB images via random resized cropping, horizontal flipping, and color jittering. Moreover, for the regression datasets UJIIndoorLoc and UTKFace, we follow the work of Chen et al. [14] and normalize the regression labels to [0, 1] to eliminate the effects of diverse scales in the regression variables. Regarding the data splitting protocol, we follow the general practice in prior works [47, 48, 52] and use 90% of available data as training data and 10% as validation data.

For training the networks, JPRL with its shallow implementations on Amazon Reviews and UJIIndoorLoc is trained from scratch by minibatch SGD with a momentum of 0.9 and a learning rate of \(10^{-3}\). The tradeoff parameter \(\lambda\) is selected from the range \(\{10^{-3}, 10^{-1}, \cdots , 10^{3}\}\) by using the validation data. Furthermore, JPRL with its deep implementations on PACS, Office-Home, and UTKFace is trained from the ImageNet pretrained models. The optimizer is still the minibatch SGD and the learning rate is initially set to \(10^{-3}\) and shrunk to \(10^{-4}\) after 50 iterations. This time, the tradeoff parameter \(\lambda\) is not selected through a grid search, since the corresponding procedure would be computationally costly. Instead, following [12, 16], we gradually change \(\lambda\) from 0 to 1 by a progressive schedule: \(\lambda _{t}=\frac{2}{1+\exp (-10t)}-1\), where t is the training progress linearly changing from 0 to 1.

5.2.5 Results


We report in Table 7, the classification results on Amazon Reviews, in Tables 8 and 9 the results on PACS, in Tables 10 and 11 the results on Office-Home. In addition, we report in Table 12 the regression results on UJIIndoorLoc and in Table 13 the regression results on UTKFace. In every table, the names of the source domains are omitted under the leave-one-domain-out evaluation protocol. For every column in the table, the best result is highlighted in bold.

Table 7 Classification accuracy (%) of the Alignment and Other methods with 1HLNN backbone on dataset Amazon Reviews.
Table 8 Classification accuracy (%) of the Alignment and Other methods with ResNet18 backbone on dataset PACS
Table 9 Classification accuracy (%) of the Alignment and Other methods with ResNet50 backbone on dataset PACS
Table 10 Classification accuracy (%) of the Alignment and Other methods with ResNet18 backbone on dataset Office-Home
Table 11 Classification accuracy (%) of the Alignment and Other methods with ResNet50 backbone on dataset Office-Home
Table 12 Sum of MAE of the Alignment and Other methods with 1HLNN backbone on regression dataset UJIIndoorLoc
Table 13 Sum of MAE of the Alignment and Other methods with ResNet50 backbone on regression dataset UTKFace
Table 14 Sum of MAE of the Alignment and Other methods with 2HLNN backbone on regression dataset Cooling Capacity

Classification. We quote the available results of the comparison methods in Table 7 from [3], the results in Tables 8 and 9 from [39, 47, 48, 52], and the results in Tables 10 and  11 from [47, 48], since our experimental settings coincide with these works. To compare with the relevant Alignment methods CIDDG and DGER, we use their source codes, follow their hyperparameter tuning protocols, and produce their results on datasets Amazon Reviews and Office-Home. Besides, we also produce the results of Other methods, i.e, CSD and DGGS, on Amazon Reviews. Note that, with the cited and produced results, we aim for a comprehensive comparison of JPRL with the Alignment methods and Other recent methods on different types of data: text (Amazon Reviews) and image (PACS, Office-Home), and in tasks with a few classes (Amazon Reviews, PACS) and many classes (Office-Home).

In Table 7 for the text dataset Amazon Reviews, JPRL with the 1HLNN backbone consistently outperforms its competitors on all the 4 tasks, and yields the highest average classification accuracy of 82.26%. The outperformance verifies that, compared to the alignment of marginals and class-conditionals (CIDDG, DGER) or other strategies (MetaReg, CSD, DGGS), our joint-product distribution alignment under the \(L^{2}\)-distance would be more effective in domain generalization for text classification. In Tables 8 and  9 for the image dataset PACS, JPRL with the ResNet18 and ResNet50 backbones is among the top performing methods. It significantly outperforms the Alignment methods MMD-AAE and DGER on the first 2 tasks in Table 8, and achieves better average accuracy than the very recent methods MixStyle, FACT, and ATSRL in both tables. These results demonstrate that JPRL can achieve general preferable performance regardless of the backbone choice. In Tables 10 and  11 for another image dataset Office-Home, the results show that JPRL would be preferable to the comparison methods considered. Here, it is worth mentioning that, CIDDG and DGER align distributions in a class-wise manner using adversarial training, which might not be easy to achieve with many classes (65 in Office-Home). By contrast, our JPRL learns a representation function to align 2 distributions: the joint distribution and the product distribution, which conveniently leads to the alignment of multiple domains simultaneously. Moreover, our distribution alignment is conducted by solving a simple minimization problem, and not the challenging adversarial problem. In summary, the results from Tables 8, 9, 10 and 11 show that in domain generalization for image classification, the proposed JPRL solution (1) would be more advantageous than the relevant Alignment competitors, and (2) can also yield better or comparable results to the Others.


Regression. Since the regression results of the domain generalization methods are not available, we use the source codes of the Alignment method MMD-AAE and the Other methods (MetaReg, CSD, and DGGS) to produce their regression results on datasets UJIIndoorLoc and UTKFace. The results are produced by replacing the original classification loss with the regression loss (square loss) and following the hyperparameter tuning protocols of the methods. Here, we do not include the relevant CIDDG and DGER as comparison methods, since their domain generalization losses (not the classification loss) are derived by assuming that the label y is a discrete variable for classification. Similar to classification, we aim for a comprehensive comparison of JPRL with the Alignment and Other methods on different types of data: WiFi (UJIIndoorLoc) and image (UTKFace), and in tasks with one regression variable (UTKFace) and multiple regression variables (UJIIndoorLoc).

In Table 12 for the WiFi dataset UJIIndoorLoc and Table 13 for the image dataset UTKFace, JPRL with the 1HLNN and ResNet50 backbones obtains lower sum of Mean Absolute Error (MAE) than its comparison methods on 7 out of the 8 tasks. This verifies that in the less studied problem of domain generalization for regression, our JPRL is more effective in handling different regression tasks. Furthermore, we also evaluate our solution on the Cooling Capacity dataset [25] from the field of energy conversion and management. We split the dataset into 3 subsets (domains): D500, D600, and D900, according to the cycle time. The task is regression with the cooling capacity variable. We follow the work of Krzywanski et al. [25] and implement our JPRL with a two-Hidden-Layer Neural Network (2HLNN). Table 14 reports the regression results. We observe that JPRL consistently outperforms its comparison methods, which again demonstrates the effectiveness of our solution.

5.2.6 Statistical test

To be strict in a statistical sense, we further conduct statistical test to check whether our solution is significantly better than the others in the classification tasks. We conduct the Wilcoxon signed-ranks test [9, 15] based on the classification results from Tables 7, 8, 9, 10 and 11. The test uses a statistic z to compare the performance of two methods over multiple tasks. Specifically, in each task the classification accuracy is adopted as the performance measure of the methods. We fix JPRL as a control method, and conduct 8 pairs of tests: CIDDG versus JPRL, DGER versus JPRL, MetaReg versus JPRL, DDAIG versus JPRL, MixStyle versus JPRL, ASTRL versus JPRL, RSC versus JPRL, and FACT versus JPRL. The detailed description of the test procedure is presented in Appendix 1, and the resulting 8 z values are reported in Table 15. We observe from Table 15 that the z values for these 8 pairs are all below the critical value of -1.96. According to [9, 15], this indicates that with a significance level of 0.05, JPRL is statistically better than its comparison methods.

Table 15 z values of different methods versus JPRL on the domain generalization tasks

5.3 Experimental analysis

The main experimental results above have demonstrated the advantage of our JPRL solution to domain generalization in classification and regression. In the following, we further conduct several experiments to analyze the contributions of JPRL. As comparison methods, we include Domain Adaptation Networks (DAN) [32], Joint Adaptation Networks (JAN) [33], and Deep Subdomain Adaptation Network (DSAN) [56], which are popular MMD-based methods. Note that, since these methods are originally designed for domain adaptation, we slightly modify them such that their inputs contain data from multiple source domains, which are the same as the inputs of domain generalization methods.

5.3.1 Feature visualization


We visualize the domain alignment ability of JPRL on real-world dataset, and exploit the adversarial methods CIDDG and DGER, and the MMD-based method DSAN for comparison. We use Office-Home as the experimental dataset, which has many classes in each domain. Therefore, this dataset could be challenging for domain alignment and can better assess the ability of JPRL. In particular, we take Product as the target and the remaining 3 as the source domains, and plot in Fig. 6 the t-SNE [34] embeddings of the domain data, which are obtained from the representation spaces of the above methods with ResNet18 backbone. By comparing Fig. 6d against Fig. 6a, b, c, we observe that JPRL better aligns the source and target domains in the representation space than its comparison methods CIDDG, DGER, and DSAN. This suggests that by joint-product distribution alignment under the \(L^{2}\)-distance, JPRL better reduces the discrepancy among domains, approximates the i.i.d. supervised learning setting to a certain extent, and eventually leads to superior target classification results.

Fig. 6
figure 6

T-SNE visualization of source and target data in the representation spaces of CIDDG, DGER, DSAN, and JPRL. The source domains are Art, Clipart, and RealWorld, and the target domain is Product. a CIDDG. b DGER. c DSAN. d JPRL

5.3.2 Regression loss


According to a recent survey by Lathuilière et al. [26], besides the square loss, the \(L^{1}\) loss and the Huber loss are also possible regression losses. Here, we investigate whether the advantage of JPRL over the comparison methods is dependent on the regression loss. To this end, we utilize all three losses, and run JPRL and its comparison methods DGGS and MMD-AAE on two tasks from UJIIndoorLoc and UTKFace. The target regression results on the two tasks are reported in Fig. 7a, b, where the horizontal dash lines are plotted for easy comparison. Figure 7a shows that with different losses, JPRL consistently achieves lower target error than DGGS and MMD-AAE. For JPRL itself, the performance varies with the losses: the best performance is attained at the \(L^{1}\) loss and the other two losses lead to similar performance. From Fig. 7b, we again observe that JPRL yields better results than its competitors regardless of the regression losses. From these evidences, we conclude that while the performance of JPRL varies with different regression losses, its advantage over the comparison methods is still a constant, which shows that our joint-product representation learning component plays an important role.

Fig. 7
figure 7

Target regression error of DGGS, MMD-AAE, and JPRL, which are trained with the square loss, the \(L^{1}\) loss, and the Huber loss. The horizontal dash lines are plotted for easy comparison of the error bars. a Regression error in target domain Floor1 (UJIIndoorLoc). b Regression error in target domain Asian (UTKFace)

5.3.3 Comparison with MMD-based methods


We compare JPRL with DAN, JAN, and DSAN, which are also free from adversarial training. We run the comparison experiments on Office-Home and equip all the methods with the ResNet18 backbone. The target classification results are reported in Table 16. We observe that JPRL recognizably outperforms DAN, JAN, and DSAN. This shows that in domain generalization, joint-product distribution alignment under the \(L^{2}\)-distance has advantage over the other alignment strategies under the MMD distance, which are implemented by the comparison methods.

Table 16 Classification accuracy (%) of MMD-based methods and our solution with ResNet18 backbone on dataset Office-Home. In each column, the best result is highlighted in bold

6 Conclusion

In this work, we study the domain generalization problem and propose the JPRL solution that generalizes a source trained network prediction model to the target domain. Our solution works by (1) aligning in the network representation space two probability distributions: the joint distribution and the product distribution, and (2) minimizing the downstream classification or regression loss. Particularly, we align the two distributions under the \(L^{2}\)-distance, and estimate it via analytically solving an unconstrained quadratic maximization problem, leading to an explicit estimate of the distance. We implement our solution with both shallow and deep network architectures, and experimentally demonstrate its effectiveness on synthetic and real-world datasets for classification and regression.

The limitation of our solution is that when the domains are very different from each other, it could be difficult to well align the joint distribution and the product distribution in the network representation space. As a result, our solution may not bring much performance improvement for the network prediction model. In the future, we plan to strengthen our solution to handle such a case, by incorporating other complementary domain generalization strategies (eg, [36, 53]) into our solution. Furthermore, we also plan to explore the related multi-source domain adaptation problem [46], and extend the current solution to address that problem.