Joint-product representation learning for domain generalization in classification and regression

Chen, Sentao; Chen, Liang

doi:10.1007/s00521-023-08520-1

Joint-product representation learning for domain generalization in classification and regression

Original Article
Open access
Published: 23 April 2023

Volume 35, pages 16509–16526, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Joint-product representation learning for domain generalization in classification and regression

Download PDF

995 Accesses
1 Citation
Explore all metrics

Abstract

In this work, we study the problem of generalizing a prediction (classification or regression) model trained on a set of source domains to an unseen target domain, where the source and target domains are different but related, i.e, the domain generalization problem. The challenge in this problem lies in the domain difference, which could degrade the generalization ability of the prediction model. To tackle this challenge, we propose to learn a neural network representation function to align a joint distribution and a product distribution in the representation space, and show that such joint-product distribution alignment conveniently leads to the alignment of multiple domains. In particular, we align the joint distribution and the product distribution under the $L^{2}$-distance, and show that this distance can be analytically estimated by exploiting its variational characterization and a linear variational function. This allows us to comfortably align the two distributions by minimizing the estimated distance with respect to the network representation function. Our experiments on synthetic and real-world datasets for classification and regression demonstrate the effectiveness of the proposed solution. For example, it achieves the best average classification accuracy of 82.26% on the text dataset Amazon Reviews, and the best average regression error of 0.114 on the WiFi dataset UJIIndoorLoc.

Autoencoders and their applications in machine learning: a survey

Article Open access 03 February 2024

A survey of transfer learning

Article Open access 28 May 2016

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Supervised learning models (eg, classification and regression models) with appropriately learned parameters can generalize well to the test data, under the assumption that both the training and test data are governed by the same domain $P(\varvec{x},y)$, where $\varvec{x}$ and y represent the input and output variables [44]. While this is a reasonable assumption to make, it is likely to be violated in practical applications. In computer vision, the training and test images can be acquired from different imaging conditions (eg, background and illumination) representing different probability distributions [31]. In indoor WiFi localization, the training and test data often follow different distributions, as they are collected at different time periods [18] or from different places [13]. Under such circumstances, supervised learning models trained by merely following the Empirical Risk Minimization (ERM) principle [44], may perform sub-optimally and fail to make accurate predictions for the test data [38].

As an important problem in machine learning and computer vision, domain generalization [5, 6] is exactly concerned with such a non-identically-distributed supervised learning scenario. In this problem, the training data consist of n ($n \ge 2$) datasets respectively drawn from n source domains $P^{1}(\varvec{x}, y),$ $\cdots , P^{n}(\varvec{x}, y)$, while the test data are sampled from an unseen target domain $P^{t}(\varvec{x}, y)$. The source and target domains are different but related [17, 28, 31, 37], and the goal of domain generalization is to train a prediction (classification or regression) model on the collection of the n source datasets and generalize it to the target domain. In the following paragraphs, we will use some mathematical notations to describe the domain generalization works. For clarity and easy readability, we first give an overview of these notations in Table 1.

Table 1 Notations and their descriptions

Full size table

Prior works [1, 17, 28, 30, 38] aim to learn a representation function (i.e, a projection matrix or a neural network) to align the n source domains $P^{1}(\varvec{x}, y), \cdots , P^{n}(\varvec{x}, y)$ as a key solution to the problem, and train a classifier/regressor in the representation space. Then, the prediction model containing the representation function and the classifier/regressor is expected to generalize well to the unseen target domain $P^{t}(\varvec{x}, y)$ [30, 37, 52]. Specifically, since a domain $P(\varvec{x},y)$ can be decomposed into $P(\varvec{x},y)=P(\varvec{x})P(y\vert \varvec{x})$, early works [17, 28, 37] align the n domains via learning a representation function to align the marginal distributions (marginals) $P^{1}(\varvec{x}),$ $\cdots , P^{n}(\varvec{x})$, assuming that the posterior distribution $P(y \vert \varvec{x})$ is stable. However, as noted in several works [30, 38, 52], the stability of $P(y\vert \varvec{x})$ is often violated in practice. Therefore, later works [30, 31, 52] propose to align the n domains $P^{1}(\varvec{x}, y), \cdots , P^{n}(\varvec{x}, y)$ in other manners. They learn a representation function to align (1) a set of n marginals and c sets of n class-conditional distributions (class-conditionals) $P^{1}(\varvec{x}\vert y=i),$ $\cdots , P^{n}(\varvec{x} \vert y=i)$ for $i \in \{1, \cdots , c\}$, or (2) a set of n marginals and n sets of c class-conditionals $P^{l}(\varvec{x} \vert y=1), \cdots , P^{l}(\varvec{x} \vert y=c)$ for $l \in \{1, \cdots , n\}$, where c is the number of classes when y is a discrete variable for classification. However, these works suffer from the heavy work of aligning multiple sets of marginals and class-conditionals, with each set containing multiple distributions. As noted in [38], such alignment may be difficult to achieve when the number of classes c or the number of domains n is large. Additionally and importantly, in the regression tasks which arise widely in various real-world applications [7, 13, 26], the alignment of the class-conditionals in these works may not be feasible since the output variable y is continuous in regression.

In this work, we propose to learn a neural network representation function that aims at aligning the n domains $P^{1}(\varvec{x}, y),$ $\cdots , P^{n}(\varvec{x}, y)$ in a different way. To be specific, we first introduce a domain variable l, $l \in \{1, \cdots , n\}$, and define a joint distribution $P(\varvec{x},y,l)$ and a product distribution $P(\varvec{x},y)P(l)$. We then respectively view domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ as $P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)$. Our idea is to learn the network representation function $\phi$, such that the joint distribution and the product distribution are aligned in the representation space, i.e, $P(\phi (\varvec{x}),y,l) = P(\phi (\varvec{x}),y)P(l)$. We show through a proposition that such joint-product distribution alignment leads to the alignment of the n domains, i.e, $P(\phi (\varvec{x}),y \vert l=1) = \cdots = P(\phi (\varvec{x}),y \vert l=n)$. Therefore, the problem of aligning multiple domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ is conveniently transformed into the problem of aligning two distributions: the joint distribution $P(\varvec{x},y,l)$ and the product distribution $P(\varvec{x},y)P(l)$. The benefits of our alignment proposal are twofold. (1) Our proposal only needs to align two distributions regardless of the number of classes c or the number of domains n, which is straightforward and easy to achieve. (2) Our proposal naturally applies to the regression tasks, since it does not rely on aligning the class-conditionals.

To be more specific, we align $P(\varvec{x},y,l)$ and $P(\varvec{x},y)P(l)$ under the $L^{2}$-distance. This distance, as we will show, can be analytically estimated, and is better suited to our case here than the Maximum Mean Discrepancy (MMD) [19]. Later in the next section, we include a more detailed discussion to justify our motivation in employing this distance. To estimate the $L^{2}$-distance, we first exploit the Legendre-Fenchel convex duality [40] to obtain its variational characterization, i.e, the maximal value of a quadratic functional with respect to a variational function. Subsequently, we design the variational function as a linear-in-parameter model, and derive an analytic estimate for the $L^{2}$-distance. As a result, our joint-product distribution alignment can be readily conducted by learning the representation function that minimizes the estimated $L^{2}$-distance between the joint distribution and the product distribution. In the representation space, we train a downstream classifier/regressor for the inference task in the target domain. Both the representation function and the classifier/regressor are optimized via the minibatch Stochastic Gradient Descent (SGD) algorithm. See Fig. 1 for an illustration of our solution, which is denoted as JPRL for “Joint-Product Representation Learning" in the remainder. We demonstrate the effectiveness of our solution through conducting comprehensive experiments on synthetic and real-world datasets for classification and regression.

This paper is structured as follows. Section 2 reviews the related works. Section 3 introduces the JPRL solution. Section 4 discusses the assumption behind the solution. Section 5 describes the datasets and the experimental settings and reports the experimental results. Section 6 presents the conclusion.

2 Related work

The study of domain generalization can be traced back to the early works of Blanchard et al. [5] and Khosla et al. [24]. Since then, many strategies have been proposed to tackle the problem.

Domain alignment is a popular strategy for domain generalization, which, to a certain extent, is inspired by the domain adaptation works [11, 12, 16, 29, 55]. Here, we focus on discussing the domain alignment works [1, 28, 31, 37, 38, 52], since they are most relevant to our solution. In essence, most of these works learn a representation function (i.e, a projection matrix or a neural network) to align the marginal distributions (marginals) or the class-conditional distributions (class-conditionals) of the domains under various metrics, eg, MMD, Jensen-Shannon (JS) divergence. For clarity, we first present in Table 2 an overview of the main differences between our work and the related works from the perspectives of distribution alignment, representation function, distribution discrepancy metric, and optimization. We then elaborate on the details in the following.

Table 2 Overview of the main differences between our work and other relevant works. To present the text clearly, we abbreviate “class-conditionals”, “representation”, and “decomposition” to “class-cond.”, “repr.”, and “decomp.”, respectively

Full size table

Early works learn a representation function to align the marginals $P^{1}(\varvec{x}), \cdots , P^{n}(\varvec{x})$ under MMD or JS divergence. Muandet et al. [37] learned a project to align the marginals and preserve the functional relationship between input and output variables. Similarly, Ghifary et al. [17] reduced the dimensionality of data such that the marginals are aligned, and the separability of classes and the separability of unlabeled data are also maximized. These MMD-based works place the projection matrix out of the kernel mapping to the Reproducing Kernel Hilbert Space (RKHS)^{Footnote 1} [41]. Consequently, the resulting optimization problems could be solved via eigenvalue decomposition. Moreover, Li et al. [28] learned a neural network to align the distributions of the coded source features under MMD, and match the aligned marginals to a prior Laplacian distribution under JS divergence, which is achieved by adversarial training. Since these works [17, 28, 37] assume that the posterior distribution $P(y \vert \varvec{x})$ is stable, the n domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ is aligned by aligning the n marginals. However, as discussed in [30, 38, 52], the stability of $P(y \vert \varvec{x})$ is often violated in practice, eg, speaker recognition, object recognition, resulting in the under-alignment of domains.

Being aware of this point, later works align the n domains via learning a representation function to align the marginals and the class-conditionals under MMD or JS divergence. Li et al. [30] searched a projection matrix to align a set of n class prior-normalized marginals and c sets of n class-conditionals $P^{1}(\varvec{x} \vert y=i), \cdots , P^{n}(\varvec{x} \vert y=i)$ for $i \in \{1, \cdots , c\}$ under MMD, and derived an optimization problem that is solved via eigenvalue decomposition. As an extension of [30], Conditional Invariant Deep Domain Generalization (CIDDG) [31] shares a similar distribution alignment idea, but replaces the projection matrix by a deep neural network, and the MMD by the JS divergence for better performance. Zhao et al. [52] learned a network representation function to align a set of n marginals and n sets of c class-conditionals $P^{l}(\varvec{x} \vert y=1), \cdots ,$ $P^{l}(\varvec{x} \vert y=c)$ for $l \in \{1, \cdots , n\}$ under the JS divergence. These works [31, 52] characterize the JS divergence as the maximal value of a $\text{log}$ loss functional. Consequently, minimizing the divergence leads to the adversarial training problem. However, when the number of classes c or the number of domains n is large, it may be difficult to achieve the alignment of domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ via the alignments of marginals and class-conditionals [38]. Furthermore, in the regression tasks, which arise widely in real-world applications such as indoor WiFi localization [13], age estimation [7], and human pose estimation [26], it may not be feasible to align the class-conditionals, since the output variable y is continuous.

Our work learns a neural network representation function to align joint distribution $P(\varvec{x},y,l)$ and product distribution $P(\varvec{x},y)P(l)$ under the $L^{2}$-distance. (1) We show that aligning these two distributions conveniently leads to the alignment of the n domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ (see details in Sect. 3.2). Such joint-product distribution alignment is straightforward to achieve, since it only aligns two distributions. Moreover, such alignment can handle the regression tasks, since it is free from aligning the class-conditionals. (2) In the neural network context, we align distributions under the $L^{2}$-distance rather than the JS divergence, since JS divergence is usually expressed by adversarial training [28, 31, 52], which is known to be unstable and time consuming [38, 49]. While MMD and its extensions [33, 56] circumvent the drawbacks of adversarial training, to our best knowledge, these metrics are mainly designed and employed for the discrepancy between the marginals [17, 37], the class-conditionals [22, 30, 56], or the joint distributions of multiple input variables [33], i.e, $P(\varvec{x}^{1}, \cdots , \varvec{x}^{k})$ and $Q(\varvec{x}^{1}, \cdots , \varvec{x}^{k})$. According to [19], they require the kernel function to be the universal/characteristic kernel for them to become proper metrics^{Footnote 2}. However, in our work, it may not be trivial to formulate a proper MMD metric between the joint distribution $P(\varvec{x},y,l)$ and the product distribution $P(\varvec{x},y)P(l)$. These considerations motivate and encourage us to opt for the $L^{2}$-distance, which quantifies the discrepancy between the joint distribution and the product distribution in a straightforward, decent, and intuitive manner. Importantly, we show that the $L^{2}$-distance can be analytically estimated (see details in Sect. 3.3). (3) In Sect. 5.1, we conduct experiments to reinforce our proposal that joint-product distribution alignment under $L^{2}$-distance leads to effective domain alignment.

Of course, in addition to domain alignment, there are also other strategies for tackling domain generalization [35, 48, 54]. Notable works include, but are not limited to, the ones that are based on meta-learning [3], parameter decomposition [39], and optimization [36]. Balaji et al. [3] encoded the notion of domain generalization using a regularization function, and learned the function in a meta-learning framework. Piratla et al. [39] decomposed the parameters of a neural network into a common component which is expected to generalize to the unseen target domain, and a low-rank domain-specific component that overfits the source domains. Mansilla et al. [36] conducted gradient surgery to enhance the generalization capability of deep neural network models. In Sect. 5.2, we experimentally compare our work with some of these works for completeness.

3 Methodology

3.1 Problem formulation

Let ${\mathcal {X}}$ be an input space and ${\mathcal {Y}}$ be an output space. Particularly, ${\mathcal {Y}}$ is a discrete set of c categories for classification or a continuous space for regression. According to [17, 31, 37], we define the domain generalization problem as follows. A domain is a distribution $P(\varvec{x},y)$ defined on ${\mathcal {X}} \times {\mathcal {Y}}$. In domain generalization, we have $n~(n \ge 2)$ source domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$, which are reflected by the associated datasets ${\mathcal {D}}_{xy}^{1}=\{(\varvec{x}_{i}^{1},y_{i}^{1})\}_{i=1}^{m_{1}}$, $\cdots$, ${\mathcal {D}}_{xy}^{n}=\{(\varvec{x}_{i}^{n},y_{i}^{n})\}_{i=1}^{m_{n}}$, and an unseen target domain $P^{t}(\varvec{x},y)$, whose samples are not available during training. The source and target domains are different but related. Given the n source datasets, the goal is to learn a prediction (classification or regression) model $h: {\mathcal {X}} \rightarrow {\mathcal {Y}}$ that performs well on the target domain.

3.2 Joint-product distribution alignment

We model h as a neural network containing a representation function $\phi$ and a downstream classifier/regressor g, i.e, $y=h(\varvec{x})=g\circ \phi (\varvec{x})$. Here, $\phi$ maps from the input space to the representation space ${\mathcal {Z}}$, i.e, $\phi : {\mathcal {X}} \rightarrow {\mathcal {Z}}$, and g maps from the representation space to the output space, i.e, $g: {\mathcal {Z}} \rightarrow {\mathcal {Y}}$. In this work, we propose to learn a representation function that aligns the source domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ with available training data as a key solution to the domain generalization problem.

We show that the alignment of the n domains can be conducted by simply aligning two distributions. To be specific, we first introduce a domain variable l, $l \in {\mathcal {L}}=\{1, \cdots , n\}$, and define a joint distribution $P(\varvec{x},y,l)$ and a product distribution $P(\varvec{x},y)P(l)$ on ${\mathcal {X}} \times {\mathcal {Y}} \times {\mathcal {L}}$. Then, we view domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y)$ as $P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)$, respectively, which is inspired by the probabilistic formulation of the multi-task learning problem [4]. From this viewpoint, joint distribution $P(\varvec{x},y,l)$ is reflected by dataset ${\mathcal {D}}_{xyl}= \{(\varvec{x}_{i}^{1},y_{i}^{1}, 1)\}_{i=1}^{m_{1}}\cup \cdots \cup \{(\varvec{x}_{i}^{n},y_{i}^{n}, n)\}_{i=1}^{m_{n}}=\{(\varvec{x}_{i},y_{i}, l_{i})\}_{i=1}^{m}$, where $m=m_{1}+ \cdots +m_{n}$. The distribution $P(\varvec{x},y)=\int P(\varvec{x},y,l)dl$ is reflected by dataset ${\mathcal {D}}_{xy}=\{(\varvec{x}_{i},y_{i})\}_{i=1}^{m}$, and the distribution $P(l)=\int P(\varvec{x},y,l)d\varvec{x}dy$ is reflected by dataset ${\mathcal {D}}_{l}=\{ l_{i}\}_{i=1}^{m}$. Finally, we present the following proposition, showing that joint-product distribution alignment leads to domain alignment.

Proposition 1

Under representation function $\phi$, the alignment of two distributions $P(\varvec{x},y,l)$ and $P(\varvec{x},y)P(l)$ implies the alignment of n domains $P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)$. That is, $P(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y)P(l)$ $\Rightarrow$ $P(\phi (\varvec{x}),y \vert l=1)= \cdots =P(\phi (\varvec{x}),y \vert l=n)$.

The proof is placed in Appendix 1. Evidently, Proposition 1 suggests that aligning joint distribution $P(\varvec{x},y,l)$ and product distribution $P(\varvec{x},y)P(l)$ leads to the alignment of multiple domains $P(\varvec{x},y \vert l=1), \cdots , P(\varvec{x},y \vert l=n)$, i.e, $P^{1}(\varvec{x},y),$ $\cdots , P^{n}(\varvec{x},y)$. In the next subsection, we measure the discrepancy between $P(\varvec{x},y,l)$ and $P(\varvec{x},y)P(l)$ under the $L^{2}$-distance, and derive its estimate as the alignment loss.

3.3 Analytic estimation of $L^{2}$-Distance

Under representation function $\phi$, we write the joint distribution and the product distribution as $P(\phi (\varvec{x}),y,l)$ and $P(\phi (\varvec{x}),y)P(l)$. We first introduce the $L^{2}$-distance between these two distributions, and then elaborate on the distance estimation. For clarity, we illustrate in Fig. 2 an overview of the estimation in this subsection. The estimated $L^{2}$-distance between $P(\phi (\varvec{x}),y,l)$ and $P(\phi (\varvec{x}),y)P(l)$ will serve as the alignment loss for learning the representation function.

The $L^{2}$-distance between $P(\phi (\varvec{x}),y,l)$ and $P(\phi (\varvec{x}),y)P(l)$ is defined as

$$\begin{aligned}&L^{2}\left (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\right ) &= \frac{1}{2}\int \left (P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\right )^{2}d\phi (\varvec{x})dydl. \end{aligned}$$

(1)

The distance compares distributions $P(\phi (\varvec{x}),y,l)$ and $P(\phi (\varvec{x}),y)P(l)$ based on their difference $P(\phi (\varvec{x}),y,l)-P(\phi (\varvec{x}),y)P(l)$. It is non-negative, symmetric, and equals to zero if and only if $P(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y)P(l)$.

To estimate the $L^{2}$-distance, we alternatively express the original definition in Eq. (1) as

$$\begin{aligned}&L^{2}\left (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\right ) \nonumber \\&\quad =\int \max _{r}\Big [\left (P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\right ) \nonumber \\& \quad \quad \times r(\phi (\varvec{x}),y,l) - \frac{1}{2}r(\phi (\varvec{x}),y,l)^{2}\Big ]d\phi (\varvec{x})dydl \end{aligned}$$

(2)

$$\begin{aligned}&=\max _{r}\Big (\int \Big [\Big (P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)\Big ) \nonumber \\&\quad \times r(\phi (\varvec{x}),y,l) - \frac{1}{2}r(\phi (\varvec{x}),y,l)^{2}\Big ]d\phi (\varvec{x})dydl\Big ) \end{aligned}$$

(3)

$$\begin{aligned}&=\max _{r}\Big (\int r(\phi (\varvec{x}),y,l)P(\phi (\varvec{x}),y,l)d\phi (\varvec{x})dydl \nonumber \\&\quad - \int r(\phi (\varvec{x}),y,l)P(\phi (\varvec{x}),y)P(l)d\phi (\varvec{x})dydl \nonumber \\&\quad - \frac{1}{2}\int r(\phi (\varvec{x}),y,l)^{2}d\phi (\varvec{x})dydl\Big ). \end{aligned}$$

(4)

In Eq. (2), we write the convex function $\frac{1}{2}u^{2}=\max _{v}(uv-\frac{1}{2}v^{2})$ by using the Legendre-Fenchel convex duality [40], and regard the difference function $P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)$ as u and the function $r(\phi (\varvec{x}),y,l)$ as v. In Eq. (4), the quadratic functional is maximized when the function $r(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y,l) - P(\phi (\varvec{x}),y)P(l)$. The corresponding maximal value is the right hand side of Eq. (1). Here, we call Eq. (4) the variational characterization of the $L^{2}$-distance, and $r(\phi (\varvec{x}),y,l)$ the variational function.

Since distributions $P(\phi (\varvec{x}),y,l)$, $P(\phi (\varvec{x}),y)$, and P(l) are reflected by their samples (see Sect. 3.2), we can replace the expectations in Eq. (4) by empirical averages and approximate the $L^{2}$-distance as

$$\begin{aligned}&L^{2}\big (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\big ) \nonumber \\&\quad \approx \max _{r}\Big (\frac{1}{m}\sum _{i=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{i}) \nonumber \\&\quad - \frac{1}{m^{2}}\sum _{i,j=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{j}) \nonumber \\&\quad - \frac{1}{2}\int r(\phi (\varvec{x}),y,l)^{2}d\phi (\varvec{x})dydl\Big ) . \end{aligned}$$

(5)

Note that, here, the second term of Eq. (4) contains the expectations with respect to two probability distributions $P(\phi (\varvec{x}),y)$ and P(l). Therefore, this term is approximated in the form $\int r(\phi (\varvec{x}),y,l)P(\phi (\varvec{x}),y)P(l)d\phi (\varvec{x})dydl \approx \frac{1}{m^{2}}\sum _{i,j=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{j}).$

We design the variational function as a linear-in-parameter model

$$\begin{aligned} r(\phi (\varvec{x}),y,l;\varvec{\theta })=\sum _{i=1}^{m} \theta _{i}p\left ((\phi (\varvec{x}),y,l),(\phi (\varvec{x}_{i}),y_{i},l_{i})\right ), \end{aligned}$$

(6)

where $\varvec{\theta }=(\theta _{1}, ..., \theta _{m})^{\top }$ are the parameters of the model and $p\left ((\phi (\varvec{x}),y,l),(\phi (\varvec{x}_{i}),y_{i},l_{i})\right ) =k^{1}(\phi (\varvec{x}),\phi (\varvec{x}_{i}))k^{2}(y,y_{i})k^{3}(l,l_{i})$ is a product kernel. In particular, $k^{1}(\phi (\varvec{x}),\phi (\varvec{x}_{i}))=\exp \left (\frac{\pi \Vert \phi (\varvec{x}) -\phi (\varvec{x}_{i})\Vert ^{2}}{-2}\right )$ is a Gaussian kernel, and $k^{3}(l,l_{i})=\delta (l,l_{i})$ is a delta kernel that evaluates 1 if $l=l_{i}$ and 0 otherwise. Additionally, $k^{2}(y,y_{i})=\delta (y,y_{i})$ is a delta kernel when y is a discrete variable for classification, and $k^{2}(y,y_{i})=\exp \left (\frac{\pi \Vert y-y_{i}\Vert ^{2}}{-2}\right )$ is a Gaussian kernel when y is a continuous variable for regression. We note that similar linear-in-parameter models have also been employed in prior works [4, 12], but those models are different from our model (6) in terms of the input variables. Importantly, model (6) leads to the analytic estimate of the $L^{2}$-distance, as we will show below.

Plugging into Eq. (5) the linear variational function, Eq. (6), we derive and obtain the analytic estimate of the $L^{2}$-distance:

$$\begin{aligned}&\widehat{L^{2}}\big (P(\phi (\varvec{x}),y,l),P(\phi (\varvec{x}),y)P(l)\big ) \nonumber \\&\quad = \max _{\varvec{\theta }}\Big (\frac{1}{m}\sum _{i=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{i};\varvec{\theta }) \nonumber \\&\quad - \frac{1}{m^{2}}\sum _{i,j=1}^{m} r(\phi (\varvec{x}_{i}),y_{i},l_{j};\varvec{\theta }) \nonumber \\&\quad - \frac{1}{2}\int r(\phi (\varvec{x}),y,l;\varvec{\theta })^{2}d\phi (\varvec{x})dydl\Big ) \end{aligned}$$

(7)

$$\begin{aligned}&=\max _{\varvec{\theta }}\Big (\frac{1}{m}\varvec{1}^{\top }\varvec{P}\varvec{\theta } - \frac{1}{m^{2}}[(\varvec{1}^{\top }\varvec{K}) \odot (\varvec{1}^{\top }\varvec{S})]\varvec{\theta } - \frac{1}{2}\varvec{\theta }^{\top }\varvec{H}\varvec{\theta }\Big ) \end{aligned}$$

(8)

$$\begin{aligned}&=\max _{\varvec{\theta }} \left (\varvec{b}^{\top }\varvec{\theta } - \frac{1}{2}\varvec{\theta }^{\top }\varvec{H}\varvec{\theta } \right) \end{aligned}$$

(9)

$$\begin{aligned}&=\varvec{b}^{\top }\widehat{\varvec{\theta }} - \frac{1}{2}\widehat{\varvec{\theta }}^{\top }\varvec{H}\widehat{\varvec{\theta }}. \end{aligned}$$

(10)

Equation (8) rewrites the terms in Eq. (7) using matrix and vector notations, where $\varvec{1}$ is an m-dimensional column vector of ones, $\varvec{P}$, $\varvec{K}$, $\varvec{S}$, and $\varvec{H}$ are $m \times m$ symmetric matrices, and $\odot$ is the Hadamard product. The (i, j)-th elements of $\varvec{P}$, $\varvec{K}$, and $\varvec{S}$ are $p_{ij} = p\left ((\phi (\varvec{x}_{i}),y_{i},l_{i}),(\phi (\varvec{x}_{j}),y_{j},l_{j})\right )$, $k_{ij}=k^{1}(\phi (\varvec{x}_{i}),\phi (\varvec{x}_{j}))k^{2}(y_{i},y_{j})$, and $s_{ij}=k^{3}(l_{i},l_{j})$. Additionally, the (i, j)-th element of $\varvec{H}$ is $h_{i,j}$, where $h_{i,j}=\exp \left(\frac{\pi \Vert \phi (\varvec{x}_{i}) - \phi (\varvec{x}_{j}) \Vert ^{2}}{-4}\right)\delta (y_{i},y_{j})\delta (l_{i},l_{j})$ when y is a discrete variable in the classification tasks, and $h_{i,j}=\exp \left (\frac{\pi \Vert \phi (\varvec{x}_{i}) - \phi (\varvec{x}_{j}) \Vert ^{2}}{-4}\right )$ $\exp \left (\frac{\pi \Vert y_{i} - y_{j} \Vert ^{2}}{-4}\right )\delta (l_{i},l_{j})$ when y is a continuous variable in the regression tasks. Please refer to Appendix 1 for the detailed mathematical derivation of $h_{i,j}$. Eq. (9) defines vector $\varvec{b} = \frac{1}{m}\varvec{P}\varvec{1} - \frac{1}{m^{2}}[(\varvec{K}\varvec{1}) \circ (\varvec{S}\varvec{1})]$ to make the unconstrained quadratic maximization problem explicit. In Eq. (10), we solve the maximization problem analytically and derive the analytic estimate of the $L^{2}$-distance, where $\widehat{\varvec{\theta }} = (\varvec{H}+\epsilon {\textbf{I}})^{-1}\varvec{b}$. Note that, here, a diagonal matrix $\epsilon {\textbf{I}}$ is added to $\varvec{H}$ to ensure that the matrix is always numerically invertible in practice, where $\epsilon$ is a small positive value and ${\textbf{I}}$ is the identity matrix.

3.4 Optimization problem

We combine the alignment loss (the estimated $L^{2}$-distance) and the classification/regression loss, and present the optimization problem of the proposed JPRL solution as follows

$$\begin{aligned}&\min _{\phi , g} \frac{1}{n}\sum _{s=1}^{n}\frac{1}{m_{s}} \sum _{i=1}^{m_{s}}\ell (g(\phi (\varvec{x}_{i}^{s})),y_{i}^{s}) \nonumber \\&~~~~~~~+ \lambda \widehat{L^{2}} \left (P(\phi (\varvec{x}),y,l), P(\phi (\varvec{x}),y)P(l)\right ). \end{aligned}$$

(11)

Here, in line with the common practice in [14, 52], $\ell$ is the cross-entropy loss for classification or the square loss for regression, and $\lambda ~(>0)$ is a tradeoff parameter for balancing the alignment loss and the task loss. We optimize the network prediction model $h=g \circ \phi$, where $\phi$ is the representation function and g is the classifier/regressor, to jointly minimize the two losses.

We employ minibatch SGD to solve problem (11). In every iteration of the algorithm, a minibatch consists of n minibatches respectively sampled from the n source datasets, containing the corresponding domain variables, and the objective in Eq. (11) is calculated using these minibatches.

4 Discussion on the assumption

In domain generalization, since data from the target domain are not available during training, one must assume the existence of certain relationship between the source and target domains, and exploits such relationship to improve the generalization performance of the prediction model [28, 31, 37,38,39, 52]. For instance, in the work of Piratla et al. [39], the authors assumed that the source and target domains share some “stable” features whose relationship with the output variable is invariant across domains, and the goal is to learn those features.

In our work, the fundamental assumption is that the source and target domains $P^{1}(\varvec{x},y), \cdots , P^{n}(\varvec{x},y), P^{t}(\varvec{x},y)$ could be related by a representation function that makes them similar. Importantly, the representation function to a certain extent can be captured and learned by aligning the multiple source domains with available training data. As such, in the learned representation space, the source (training) and target (test) data follow similar distributions, which approximates the independent and identically distributed (i.i.d.) supervised learning setting. Hence, the source trained classifier/regressor can generalize to the target domain. In Sect. 5.1, we provide synthetic examples to further explain this point.

As noted by Zhang et al. [50], if the target domain $P^{t}(\varvec{x},y)$ changes arbitrarily, the available source data would be of no use to make predictions in the target domain. Under such circumstances, domain generalization methods (the domain alignment ones [28, 31, 37, 38, 52] including ours, and others [36, 39, 47, 48]) may not succeed in learning prediction models that generalize to the target domain.

5 Experiments

5.1 Experiments on synthetic datasets

We evaluate JPRL on synthetic datasets to verify its effectiveness in aligning domains. For simplicity and clear visualization, we construct two source domains $P^{1}(\varvec{x},y),$ $P^{2}(\varvec{x},y)$ and a target domain $P^{t}(\varvec{x},y)$. We write vector $\varvec{x}=(x_{1}, x_{2}, x_{3}, x_{4})^{\top }$ and use ${\mathcal {N}}(\varvec{x}; \varvec{\mu },\varvec{\Sigma })$ to denote a multivariate Gaussian distribution with mean vector $\varvec{\mu }$ and covariance matrix $\varvec{\Sigma }$. Similarly, ${\mathcal {N}}(y; \mu (\varvec{x}),\sigma ^{2})$ denotes a Gaussian distribution with mean $\mu (\varvec{x})$ and variance $\sigma ^{2}$. We implement JPRL with a one-Hidden-Layer Neural Network (1HLNN).

Classification. We consider the case where $y \in \{-1, +1\}$ is a discrete variable for classification. We define the source and target domains in Table 3 and their corresponding parameters in Table 4. The domains are constructed to be different but related by a projection matrix $\varvec{W} = \left( \begin{array}{cccc} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0\\ \end{array} \right)$ such that $P^{1}(\varvec{W}\varvec{x},y) \approx P^{2}(\varvec{W}\varvec{x},y) \approx P^{t}(\varvec{W}\varvec{x},y)$. We first draw 200 samples from $P^{1}(\varvec{x},y)$, $P^{2}(\varvec{x},y)$, and $P^{t}(\varvec{x},y)$, respectively, and then train our JPRL network on the 400 source samples. For comparison, we also train a vanilla neural network on the same data without domain alignment and use it as a Baseline. Figure 3 plots the source and target data, which live in the representation spaces of the Baseline network and the JPRL network. Comparing Fig. 3a against Fig. 3b, we observe that JPRL aligns the source data much better than the Baseline. Importantly, while the target data do not participate in training the network, the learned representation function of JPRL well generalizes to the target data and aligns them with the source data. Consequently, JPRL performs well in the target domain and yields a small classification error of 0.04, which is superior to the error of 0.31 from the Baseline. Clearly, this experiment verifies the effectiveness of JPRL in domain alignment and in target domain classification.

Table 3 Definition of the source and target domains in cases where $y \in \{-1,+1\}$ is a discrete variable for classification and $y \in {\mathbb {R}}$ is a continuous variable for regression. For classification, $P(y=-1 \vert \varvec{x})=1-P(y=+1 \vert \varvec{x})$

Full size table

Table 4 Parameter of the source and target domains defined in Table 3

Full size table

Regression. We consider the case where y is a continuous variable for regression. We define the source and target domains in Table 3 and their corresponding parameters in Table 4. The domains are different but related by a vector $\varvec{w} = (1, 0, 0, 0)^{\top }$ such that $P^{1}(\varvec{w}^{\top }\varvec{x},y) \approx P^{2}(\varvec{w}^{\top }\varvec{x},y) \approx P^{t}(\varvec{w}^{\top }\varvec{x},y)$. Same as the classification case, we draw samples from the source and target domains, and train the JPRL and Baseline networks on the source samples. Figure 4 visualizes the source and target data in the representation spaces of these two networks. From Fig. 3a, we observe that JPRL well aligns the source and target domains in the representation space, and obtains a low target regression error of 0.09. These results are significantly better than the ones in Fig. 3b for the Baseline. Evidently, this experiment confirms the effectiveness of JPRL in domain alignment and in target domain regression.

5.2 Experiments on real-world datasets

In domain generalization, there exist two settings for conducting experiments on real-world datasets: one commonly practiced in [47, 48, 52, 53], and the other one introduced by Gulrajani and Lopez-Paz [20]. We conduct the experiments following the former setting, so that we can quote the available results reported by the authors themselves.

5.2.1 Datasets

We evaluate JPRL on classification and regression datasets. The classification datasets include a text dataset Amazon Reviews [8] and two image datasets PACS [27] and Office-Home [45]. Among them, the number of classes varies from 2 classes to 65 classes. The regression datasets include a WiFi dataset UJIIndoorLoc [43] and an image dataset UTKFace [51]. We use these datasets in the experiments, since they are utilized in prior works [10, 13, 48, 52]. Importantly, we aim to verify that JPRL can (1) perform well on several types of data (text, image, and WiFi), (2) scale and perform well in classification tasks with a few classes and many classes, and (3) handle domain generalization in regression. We summarize the dataset information in Table 5, and briefly describe each set in the following.

Table 5 Statistics of the real-world datasets for classification and regression (*)

Full size table

Amazon reviews [8] contains online reviews from 4 domains (products): Books (B), DVDs (D), Electronics (E), and Kitchens (K). Each domain has 2000 reviews which are encoded in 5000-dimensional feature vectors of unigrams and bigrams. The task is classification with 2 classes representing positive and negative reviews.

PACS [27] includes images from 4 domains: ArtPainting (A) with 2048 images, Cartoon (C) with 2344 images, Photo (P) with 1670 images, and Sketch (S) with 3929 images. See Fig. 5a for the example images. The task is classification with 7 classes.

Office-Home [45] contains images of everyday objects organized into 4 domains: Art (A) with 2427 images, Clipart (C) with 4365 images, Product (P) with 4439 images, and RealWorld (R) with 4357 images. See Fig. 5b for the example images. The task is classification with 65 classes.

UJIIndoorLoc [43] is a multi-floor indoor localization dataset to test indoor positioning systems that rely on WiFi fingerprint. Each WiFi fingerprint is characterized by 520 Received Signal Strength Intensity (RSSI) values. Inspired by [13], we split the dataset into 4 domains: Floor1 (F1), Floor2 (F2), Floor3 (F3), and Floor4 (F4), leading to 4501, 5464, 4722, and 5220 samples in each of these domains. The task is regression with the latitude and longitude variables.

UTKFace [51] includes face images labeled by age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, etc. We use face images with ages spanning from 1 to 100, and split the dataset into 4 domains by the ethnicity label: Asian (A), Black (B), Indian (I), and White (W), leading to 3419, 4488, 3949, and 9970 samples in each of these domains. See Fig. 5c for the example images. The task is regression with the age variable.

5.2.2 Comparison methods

We compare JPRL against existing domain generalization methods including (1) the ones that perform domain alignment (Alignment) for domain generalization, and (2) the ones that tackle the problem via other strategies (Others). The former methods include: MMD-based Adversarial AutoEncoder (MMD-AAE) [28], CIDDG [31], Domain Generalization via Entropy Regularization (DGER) [52]. The latter methods include: Cross-Gradient (CrossGrad) [42], Meta-Regularization (MetaReg) [3], Deep Domain-Adversarial Image Generation (DDAIG) [53], Representation Self-Challenging (RSC) [23], Common Specific Decomposition (CSD) [39], Fourier Augmented Co-Teacher (FACT) [47], MixStyle [54], Domain Generalization via Gradient Surgery (DGGS) [36], and Adversarial Teacher-Student Representation Learning (ATSRL) [48].

5.2.3 Evaluation protocol

In line with the leave-one-domain-out evaluation protocol [47, 52], we employ neural network classification or regression model trained on the source datasets to predict the labels of samples from the target set. The performance of a classification model is measured by the classification accuracy (%) following [52], and the performance of a regression model is measured by the sum of Mean Absolute Error (MAE) following [14]. On every task, we follow [47, 52] and repeat the experiments 5 times with different random seeds to report the average result.

5.2.4 Implementation details

We implement JPRL with both shallow and deep networks (backbones), and present in Table 6 an overview of the backbone configuration for the datasets. To be specific, on the text dataset Amazon Reviews with the 5000-dimensional pre-processed features [8], following [3, 16], the backbone is a 1HLNN, which has 100 hidden neurons with the sigmoid activation, and 2 output neurons (i.e, the number of classes in Amazon Reviews) with the softmax transformation. On each of the 2 RGB image datasets PACS and Office-Home, we follow the practice in previous works [47, 48, 52] and use the ResNet18 and ResNet50 [21] backbones, where their final layers are both reconstructed to have the same number of outputs as the number of classes in that dataset (7 for PACS and 65 for Office-Home). On the WiFi dataset UJIIndoorLoc with the 520-dimensional features, the backbone is a 1HLNN which has 260 hidden neurons with the ReLU activation, and 2 output neurons (i.e, the number of regression variables in UJIIndoorLoc) with the sigmoid transformation. By such design, the number of hidden neurons for the network is half of the number of its input neurons. Finally, on the RGB image dataset UTKFace, the backbone is the ResNet50 model, where its final layer is reconstructed to have one output (i.e, the number of regression variables in UTKFace) with the sigmoid transformation.

Table 6 Backbone configuration for the datasets, where 1HLNN represents one-Hidden-Layer Neural Network

Full size table

We preprocess and split the training data as follows. On Amazon Reviews and UJIIndoorLoc, we utilize the common z-score standardization to preprocess the features. On PACS, Office-Home, and UTKFace, we follow the standard practice in [47] and process the RGB images via random resized cropping, horizontal flipping, and color jittering. Moreover, for the regression datasets UJIIndoorLoc and UTKFace, we follow the work of Chen et al. [14] and normalize the regression labels to [0, 1] to eliminate the effects of diverse scales in the regression variables. Regarding the data splitting protocol, we follow the general practice in prior works [47, 48, 52] and use 90% of available data as training data and 10% as validation data.

For training the networks, JPRL with its shallow implementations on Amazon Reviews and UJIIndoorLoc is trained from scratch by minibatch SGD with a momentum of 0.9 and a learning rate of $10^{-3}$. The tradeoff parameter $\lambda$ is selected from the range $\{10^{-3}, 10^{-1}, \cdots , 10^{3}\}$ by using the validation data. Furthermore, JPRL with its deep implementations on PACS, Office-Home, and UTKFace is trained from the ImageNet pretrained models. The optimizer is still the minibatch SGD and the learning rate is initially set to $10^{-3}$ and shrunk to $10^{-4}$ after 50 iterations. This time, the tradeoff parameter $\lambda$ is not selected through a grid search, since the corresponding procedure would be computationally costly. Instead, following [12, 16], we gradually change $\lambda$ from 0 to 1 by a progressive schedule: $\lambda _{t}=\frac{2}{1+\exp (-10t)}-1$, where t is the training progress linearly changing from 0 to 1.

5.2.5 Results

We report in Table 7, the classification results on Amazon Reviews, in Tables 8 and 9 the results on PACS, in Tables 10 and 11 the results on Office-Home. In addition, we report in Table 12 the regression results on UJIIndoorLoc and in Table 13 the regression results on UTKFace. In every table, the names of the source domains are omitted under the leave-one-domain-out evaluation protocol. For every column in the table, the best result is highlighted in bold.

Table 7 Classification accuracy (%) of the Alignment and Other methods with 1HLNN backbone on dataset Amazon Reviews.

Full size table

Table 8 Classification accuracy (%) of the Alignment and Other methods with ResNet18 backbone on dataset PACS

Full size table

Table 9 Classification accuracy (%) of the Alignment and Other methods with ResNet50 backbone on dataset PACS

Full size table

Table 10 Classification accuracy (%) of the Alignment and Other methods with ResNet18 backbone on dataset Office-Home

Full size table

Table 11 Classification accuracy (%) of the Alignment and Other methods with ResNet50 backbone on dataset Office-Home

Full size table

Table 12 Sum of MAE of the Alignment and Other methods with 1HLNN backbone on regression dataset UJIIndoorLoc

Full size table

Table 13 Sum of MAE of the Alignment and Other methods with ResNet50 backbone on regression dataset UTKFace

Full size table

Table 14 Sum of MAE of the Alignment and Other methods with 2HLNN backbone on regression dataset Cooling Capacity

Full size table

Classification. We quote the available results of the comparison methods in Table 7 from [3], the results in Tables 8 and 9 from [39, 47, 48, 52], and the results in Tables 10 and 11 from [47, 48], since our experimental settings coincide with these works. To compare with the relevant Alignment methods CIDDG and DGER, we use their source codes, follow their hyperparameter tuning protocols, and produce their results on datasets Amazon Reviews and Office-Home. Besides, we also produce the results of Other methods, i.e, CSD and DGGS, on Amazon Reviews. Note that, with the cited and produced results, we aim for a comprehensive comparison of JPRL with the Alignment methods and Other recent methods on different types of data: text (Amazon Reviews) and image (PACS, Office-Home), and in tasks with a few classes (Amazon Reviews, PACS) and many classes (Office-Home).

In Table 7 for the text dataset Amazon Reviews, JPRL with the 1HLNN backbone consistently outperforms its competitors on all the 4 tasks, and yields the highest average classification accuracy of 82.26%. The outperformance verifies that, compared to the alignment of marginals and class-conditionals (CIDDG, DGER) or other strategies (MetaReg, CSD, DGGS), our joint-product distribution alignment under the $L^{2}$-distance would be more effective in domain generalization for text classification. In Tables 8 and 9 for the image dataset PACS, JPRL with the ResNet18 and ResNet50 backbones is among the top performing methods. It significantly outperforms the Alignment methods MMD-AAE and DGER on the first 2 tasks in Table 8, and achieves better average accuracy than the very recent methods MixStyle, FACT, and ATSRL in both tables. These results demonstrate that JPRL can achieve general preferable performance regardless of the backbone choice. In Tables 10 and 11 for another image dataset Office-Home, the results show that JPRL would be preferable to the comparison methods considered. Here, it is worth mentioning that, CIDDG and DGER align distributions in a class-wise manner using adversarial training, which might not be easy to achieve with many classes (65 in Office-Home). By contrast, our JPRL learns a representation function to align 2 distributions: the joint distribution and the product distribution, which conveniently leads to the alignment of multiple domains simultaneously. Moreover, our distribution alignment is conducted by solving a simple minimization problem, and not the challenging adversarial problem. In summary, the results from Tables 8, 9, 10 and 11 show that in domain generalization for image classification, the proposed JPRL solution (1) would be more advantageous than the relevant Alignment competitors, and (2) can also yield better or comparable results to the Others.

Regression. Since the regression results of the domain generalization methods are not available, we use the source codes of the Alignment method MMD-AAE and the Other methods (MetaReg, CSD, and DGGS) to produce their regression results on datasets UJIIndoorLoc and UTKFace. The results are produced by replacing the original classification loss with the regression loss (square loss) and following the hyperparameter tuning protocols of the methods. Here, we do not include the relevant CIDDG and DGER as comparison methods, since their domain generalization losses (not the classification loss) are derived by assuming that the label y is a discrete variable for classification. Similar to classification, we aim for a comprehensive comparison of JPRL with the Alignment and Other methods on different types of data: WiFi (UJIIndoorLoc) and image (UTKFace), and in tasks with one regression variable (UTKFace) and multiple regression variables (UJIIndoorLoc).

In Table 12 for the WiFi dataset UJIIndoorLoc and Table 13 for the image dataset UTKFace, JPRL with the 1HLNN and ResNet50 backbones obtains lower sum of Mean Absolute Error (MAE) than its comparison methods on 7 out of the 8 tasks. This verifies that in the less studied problem of domain generalization for regression, our JPRL is more effective in handling different regression tasks. Furthermore, we also evaluate our solution on the Cooling Capacity dataset [25] from the field of energy conversion and management. We split the dataset into 3 subsets (domains): D500, D600, and D900, according to the cycle time. The task is regression with the cooling capacity variable. We follow the work of Krzywanski et al. [25] and implement our JPRL with a two-Hidden-Layer Neural Network (2HLNN). Table 14 reports the regression results. We observe that JPRL consistently outperforms its comparison methods, which again demonstrates the effectiveness of our solution.

5.2.6 Statistical test

To be strict in a statistical sense, we further conduct statistical test to check whether our solution is significantly better than the others in the classification tasks. We conduct the Wilcoxon signed-ranks test [9, 15] based on the classification results from Tables 7, 8, 9, 10 and 11. The test uses a statistic z to compare the performance of two methods over multiple tasks. Specifically, in each task the classification accuracy is adopted as the performance measure of the methods. We fix JPRL as a control method, and conduct 8 pairs of tests: CIDDG versus JPRL, DGER versus JPRL, MetaReg versus JPRL, DDAIG versus JPRL, MixStyle versus JPRL, ASTRL versus JPRL, RSC versus JPRL, and FACT versus JPRL. The detailed description of the test procedure is presented in Appendix 1, and the resulting 8 z values are reported in Table 15. We observe from Table 15 that the z values for these 8 pairs are all below the critical value of -1.96. According to [9, 15], this indicates that with a significance level of 0.05, JPRL is statistically better than its comparison methods.

Table 15 z values of different methods versus JPRL on the domain generalization tasks

Full size table

5.3 Experimental analysis

The main experimental results above have demonstrated the advantage of our JPRL solution to domain generalization in classification and regression. In the following, we further conduct several experiments to analyze the contributions of JPRL. As comparison methods, we include Domain Adaptation Networks (DAN) [32], Joint Adaptation Networks (JAN) [33], and Deep Subdomain Adaptation Network (DSAN) [56], which are popular MMD-based methods. Note that, since these methods are originally designed for domain adaptation, we slightly modify them such that their inputs contain data from multiple source domains, which are the same as the inputs of domain generalization methods.

5.3.1 Feature visualization

We visualize the domain alignment ability of JPRL on real-world dataset, and exploit the adversarial methods CIDDG and DGER, and the MMD-based method DSAN for comparison. We use Office-Home as the experimental dataset, which has many classes in each domain. Therefore, this dataset could be challenging for domain alignment and can better assess the ability of JPRL. In particular, we take Product as the target and the remaining 3 as the source domains, and plot in Fig. 6 the t-SNE [34] embeddings of the domain data, which are obtained from the representation spaces of the above methods with ResNet18 backbone. By comparing Fig. 6d against Fig. 6a, b, c, we observe that JPRL better aligns the source and target domains in the representation space than its comparison methods CIDDG, DGER, and DSAN. This suggests that by joint-product distribution alignment under the $L^{2}$-distance, JPRL better reduces the discrepancy among domains, approximates the i.i.d. supervised learning setting to a certain extent, and eventually leads to superior target classification results.

5.3.2 Regression loss

According to a recent survey by Lathuilière et al. [26], besides the square loss, the $L^{1}$ loss and the Huber loss are also possible regression losses. Here, we investigate whether the advantage of JPRL over the comparison methods is dependent on the regression loss. To this end, we utilize all three losses, and run JPRL and its comparison methods DGGS and MMD-AAE on two tasks from UJIIndoorLoc and UTKFace. The target regression results on the two tasks are reported in Fig. 7a, b, where the horizontal dash lines are plotted for easy comparison. Figure 7a shows that with different losses, JPRL consistently achieves lower target error than DGGS and MMD-AAE. For JPRL itself, the performance varies with the losses: the best performance is attained at the $L^{1}$ loss and the other two losses lead to similar performance. From Fig. 7b, we again observe that JPRL yields better results than its competitors regardless of the regression losses. From these evidences, we conclude that while the performance of JPRL varies with different regression losses, its advantage over the comparison methods is still a constant, which shows that our joint-product representation learning component plays an important role.

5.3.3 Comparison with MMD-based methods

We compare JPRL with DAN, JAN, and DSAN, which are also free from adversarial training. We run the comparison experiments on Office-Home and equip all the methods with the ResNet18 backbone. The target classification results are reported in Table 16. We observe that JPRL recognizably outperforms DAN, JAN, and DSAN. This shows that in domain generalization, joint-product distribution alignment under the $L^{2}$-distance has advantage over the other alignment strategies under the MMD distance, which are implemented by the comparison methods.

Table 16 Classification accuracy (%) of MMD-based methods and our solution with ResNet18 backbone on dataset Office-Home. In each column, the best result is highlighted in bold

Full size table

6 Conclusion

In this work, we study the domain generalization problem and propose the JPRL solution that generalizes a source trained network prediction model to the target domain. Our solution works by (1) aligning in the network representation space two probability distributions: the joint distribution and the product distribution, and (2) minimizing the downstream classification or regression loss. Particularly, we align the two distributions under the $L^{2}$-distance, and estimate it via analytically solving an unconstrained quadratic maximization problem, leading to an explicit estimate of the distance. We implement our solution with both shallow and deep network architectures, and experimentally demonstrate its effectiveness on synthetic and real-world datasets for classification and regression.

The limitation of our solution is that when the domains are very different from each other, it could be difficult to well align the joint distribution and the product distribution in the network representation space. As a result, our solution may not bring much performance improvement for the network prediction model. In the future, we plan to strengthen our solution to handle such a case, by incorporating other complementary domain generalization strategies (eg, [36, 53]) into our solution. Furthermore, we also plan to explore the related multi-source domain adaptation problem [46], and extend the current solution to address that problem.

Data availibility

Datasets are open source and can be obtained through website links or references.

Notes

From the viewpoint of [2], this could be viewed as nonlinear distribution alignment under MMD with a linear kernel. Alternatively, one can also place the projection matrix inside the kernel mapping and perform linear or nonlinear distribution alignment under MMD with a Gaussian or Laplacian kernel, which is known as a universal/characteristic kernel [19].
As noted by Gretton et al. [19], MMD with a universal/characteristic kernel is a proper metric that well reflects the distance between marginals, i.e, $\text{MMD}(P(\varvec{x}),Q(\varvec{x}))=0$ if and only if $P(\varvec{x})=Q(\varvec{x})$.

References

Akuzawa K, Iwasawa Y, Matsuo Y (2019) Adversarial invariant feature learning with accuracy constraint for domain generalization. In: Joint European conference on machine learning and knowledge discovery in databases, pp 315–331
Baktashmotlagh M, Harandi M, Salzmann M (2016) Distribution-matching embedding for visual domain adaptation. J Mach Learn Res 17(108):1–30
MathSciNet MATH Google Scholar
Balaji Y, Sankaranarayanan S, Chellappa R (2018) Metareg: towards domain generalization using meta-regularization. Adv Neural Inf Process Syst 31:998–1008
Google Scholar
Bickel S, Bogojeska J, Lengauer T, et al (2008) Multi-task learning for HIV therapy screening. In: International conference on machine learning, pp 56–63
Blanchard G, Lee G, Scott C (2011) Generalizing from several related classification tasks to a new unlabeled sample. Adv Neural Inform Process Syst, pp 2178–2186
Blanchard G, Deshmukh AA, Dogan U et al (2021) Domain generalization by marginal transfer learning. J Mach Learn Res 22(2):1–55
MathSciNet MATH Google Scholar
Carletti V, Greco A, Percannella G et al (2019) Age from faces in the deep learning revolution. IEEE Trans Pattern Anal Mach Intell 42(9):2113–2132
Article Google Scholar
Chen M, Xu Z, Weinberger K, et al (2012) Marginalized denoising autoencoders for domain adaptation. Int Conf Mach Learn, pp 767–774
Chen S, Yang X (2019) Tailoring density ratio weight for covariate shift adaptation. Neurocomputing 333:135–144
Article Google Scholar
Chen S, Han L, Liu X et al (2020) Subspace distribution adaptation frameworks for domain adaptation. IEEE Trans Neural Netw Learn Syst 31(12):5204–5218
Article MathSciNet Google Scholar
Chen S, Wu H, Liu C (2021) Domain invariant and agnostic adaptation. Knowl-Based Syst 227(107):192
Google Scholar
Chen S, Hong Z, Harandi M, et al (2022) Domain neural adaptation. IEEE Trans Neural Netw Learn Syst, pp 1–12
Chen X, Monfort M, Liu A, et al (2016) Robust covariate shift regression. Int Conf Artificial Intell Statis, pp 1270–1279
Chen X, Wang S, Wang J, et al (2021) Representation subspace distance for domain adaptation regression. Int Conf Mach Learn, pp 1749–1759
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Ganin Y, Ustinova E, Ajakan H et al (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(59):1–35
MathSciNet MATH Google Scholar
Ghifary M, Balduzzi D, Kleijn WB et al (2017) Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE Trans Pattern Anal Mach Intell 39(7):1414–1430
Article Google Scholar
Gong M, Zhang K, Liu T et al (2016) Domain adaptation with conditional transferable components. Int Conf Mach Learn, pp 2839–2848
Gretton A, Borgwardt KM, Rasch MJ et al (2012) A kernel two-sample test. J Mach Learn Res 13:723–773
MathSciNet MATH Google Scholar
Gulrajani I, Lopez-Paz D (2021) In search of lost domain generalization. Int Conf Learn Represent, pp 1–13
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778
Hu S, Zhang K, Chen Z et al (2019) Domain generalization via multidomain discriminant analysis. In: Conference on uncertainty in artificial intelligence
Huang Z, Wang H, Xing EP et al (2020) Self-challenging improves cross-domain generalization. In: European conference on computer vision, pp 124–140
Khosla A, Zhou T, Malisiewicz T et al (2012) Undoing the damage of dataset bias. In: European conference on computer vision, Springer, pp 158–171
Krzywanski J, Grabowska K, Herman F et al (2017) Optimization of a three-bed adsorption chiller by genetic algorithms and neural networks. Energy Conv Manag 153:313–322
Article Google Scholar
Lathuilière S, Mesejo P, Alameda-Pineda X et al (2019) A comprehensive analysis of deep regression. IEEE Trans Pattern Anal Mach Intell 42(9):2065–2081
Article Google Scholar
Li D, Yang Y, Song YZ, et al (2017) Deeper, broader and artier domain generalization. In: IEEE International conference on computer vision, pp 5542–5550
Li H, Pan SJ, Wang S, et al (2018) Domain generalization with adversarial feature learning. In: IEEE Conference on computer vision and pattern recognition, pp 5400–5409
Li J, Li Z, Lü S (2021) Unsupervised double weighted domain adaptation. Neural Comput Appl 33(8):3545–3566
Article Google Scholar
Li Y, Gong M, Tian X, et al (2018) Domain generalization via conditional invariant representations. In: AAAI conference on artificial intelligence, pp 3579–3587
Li Y, Tian X, Gong M, et al (2018) Deep domain generalization via conditional invariant adversarial networks. In: European conference on computer vision, pp 624–639
Long M, Cao Y, Wang J, et al (2015) Learning transferable features with deep adaptation networks. In: International conference on machine learning, pp 97–105
Long M, Zhu H, Wang J, et al (2017) Deep transfer learning with joint adaptation networks. In: International conference on machine learning, pp 2208–2217
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605
MATH Google Scholar
Mahajan D, Tople S, Sharma A (2021) Domain generalization using causal matching. In: International conference on machine learning, pp 7313–7324
Mansilla L, Echeveste R, Milone DH, et al (2021) Domain generalization via gradient surgery. In: IEEE international conference on computer vision, pp 6630–6638
Muandet K, Balduzzi D, Schölkopf B (2013) Domain generalization via invariant feature representation. In: International conference on machine learning, pp 10–18
Nguyen AT, Tran T, Gal Y, et al (2021) Domain invariant representation learning with domain density transformations. Adv Neural Inform Process Syst
Piratla V, Netrapalli P, Sarawagi S (2020) Efficient domain generalization via common-specific low-rank decomposition. In: International conference on machine learning, pp 7728–7738
Rockafellar RT (1970) Convex analysis, vol 36. Princeton University Press, Princeton
Book MATH Google Scholar
Schölkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge
Google Scholar
Shankar S, Piratla V, Chakrabarti S, et al (2018) Generalizing across domains via cross-gradient training. In: International conference on learning representations
Torres-Sospedra J, Montoliu R, Martínez-Usó A, et al (2014) Ujiindoorloc: a new multi-building and multi-floor database for Wlan fingerprint-based indoor localization problems. In: International conference on indoor positioning and indoor navigation, pp 261–270
Vapnik VN (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Venkateswara H, Eusebio J, Chakraborty S, et al (2017) Deep hashing network for unsupervised domain adaptation. In: IEEE conference on computer vision and pattern recognition, pp 5018–5027
Wen J, Greiner R, Schuurmans D (2020) Domain aggregation networks for multi-source domain adaptation. In: International conference on machine learning, pp 10,214–10,224
Xu Q, Zhang R, Zhang Y, et al (2021) A fourier-based framework for domain generalization. In: IEEE Conference on computer vision and pattern recognition, pp 14,383–14,392
Yang FE, Cheng YC, Shiau ZY, et al (2021) Adversarial teacher-student representation learning for domain generalization. Adv Neural Inform Process Syst, pp 19,448–19,460
Zellinger W, Moser B, Grubinger T et al (2019) Robust unsupervised domain adaptation for neural networks via moment alignment. Inform Sci 483:174–191
Article MathSciNet MATH Google Scholar
Zhang K, Schölkopf B, Muandet K, et al (2013) Domain adaptation under target and conditional shift. In: International conference on machine learning, pp 819–827
Zhang Z, Song Y, Qi H (2017) Age progression/regression by conditional adversarial autoencoder. In: IEEE conference on computer vision and pattern recognition, pp 5810–5818
Zhao S, Gong M, Liu T, et al (2020) Domain generalization via entropy regularization. In: Advances in neural information processing systems, pp 3118–3129
Zhou K, Yang Y, Hospedales T, et al (2020) Deep domain-adversarial image generation for domain generalisation. In: AAAI conference on artificial intelligence, pp 13,025–13,032
Zhou K, Yang Y, Qiao Y, et al (2021) Domain generalization with mixstyle. In: International conference on learning representations, pp 1–12
Zhou Q, Zhou W, Wang S et al (2021) Unsupervised domain adaptation with adversarial distribution adaptation network. Neural Comput Appl 33(13):7709–7721
Article Google Scholar
Zhu Y, Zhuang F, Wang J et al (2021) Deep subdomain adaptation network for image classification. IEEE Trans Neural Netw Learn Syst 32(4):1713–1722
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62106137 and Grant 62002212, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515012954, and in part by Shantou University under Grant NTF21035.

Author information

Authors and Affiliations

Department of Computer Science, Shantou University, Daxue Road, Shantou, 515063, Guangdong, China
Sentao Chen & Liang Chen

Authors

Sentao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Liang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sentao Chen.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Here, we present the proof of Proposition 1.

Proof

From probability theory, we know that $P(\varvec{x},y,l)=P(\varvec{x},y \vert l)P(l)$. Since the joint distribution $P(\varvec{x},y,l)$ and the product distribution $P(\varvec{x},y)P(l)$ are aligned under the representation function $\phi$, i.e, $P(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y)P(l)$, we have $P(\phi (\varvec{x}),y,l)=P(\phi (\varvec{x}),y \vert l)P(l)=P(\phi (\varvec{x}),y)P(l)$, which implies that $P(\phi (\varvec{x}),y \vert l=k)=P(\phi (\varvec{x}),y), \forall k \in \{1, \cdots , n\}$. Therefore, we arrive at the result $P(\phi (\varvec{x}),y \vert l=1) = \cdots =P(\phi (\varvec{x}),y \vert l=n)$. $\square$

Appendix B

Here, we give the detailed derivation of matrix $\varvec{H}$ in Eq. (8). In Eq. (8), $\varvec{H}$ is an $m \times m$ symmetric matrix. The (i, j)-th element of $\varvec{H}$ is derived as follows

$$\begin{aligned} h_{i,j}&= \int p \left ((\phi (\varvec{x}),y,l),(\phi (\varvec{x}_{i}),y_{i},l_{i})\right ) \nonumber \\&~~~~~~ \times p\left ((\phi (\varvec{x}),y,l),(\phi (\varvec{x}_{j}),y_{j},l_{j})\right )d\phi (\varvec{x})dydl \end{aligned}$$

(B.1)

$$\begin{aligned}&= \int k^{1}(\phi (\varvec{x}),\phi (\varvec{x}_{i}))k^{1}(\phi (\varvec{x}),\phi (\varvec{x}_{j}))d\phi (\varvec{x}) \nonumber \\&~~\times \int k^{2}(y,y_{i})k^{2}(y,y_{j})dy \times \int k^{3}(l,l_{i})k^{3}(l,l_{j})dl \end{aligned}$$

(B.2)

$$\begin{aligned}&= \exp \left (\frac{\pi \Vert \phi (\varvec{x}_{i}) - \phi (\varvec{x}_{j})\Vert ^{2}}{-4}\right )\int k^{2}(y,y_{i})k^{2}(y,y_{j})dy \nonumber \\&~~~~\times \delta (l_{i},l_{j}). \end{aligned}$$

(B.3)

When y is a discrete variable for classification, with the delta kernel

$$\begin{aligned}&\int k^{2}(y,y_{i})k^{2}(y,y_{j})dy \nonumber \\&= \sum _{y=1}^{c}\delta (y,y_{i})\delta (y,y_{j}) \end{aligned}$$

(B.4)

$$\begin{aligned}&= \delta (y_{i},y_{j}). \end{aligned}$$

(B.5)

When y is a continuous variable for regression, with the Gaussian kernel

$$\begin{aligned}&\int k^{2}(y,y_{i})k^{2}(y,y_{j})dy \nonumber \\&= \int \exp \left (\frac{\pi \Vert y - y_{i}\Vert ^{2}}{-2}\right ) \exp \left (\frac{\pi \Vert y- y_{j} \Vert ^{2}}{-2}\right )dy \end{aligned}$$

(B.6)

$$\begin{aligned}&= \exp \left (\frac{\pi \Vert y_{i} - y_{j} \Vert ^{2}}{-4}\right ). \end{aligned}$$

(B.7)

Appendix C

Here, we give the detailed description of the statistical test in Sect. 5.2.6. We describe the procedure of the Wilcoxon signed-ranks test [9, 15] on the task s from Tables 7, 8, 9, 10 and 11. The test compares the performance of two methods over multiple tasks. Specifically, in each task the classification accuracy is adopted as the performance measure of the methods. We fix JPRL as a control method, and conduct 8 pairs of tests: CIDDG versus JPRL, DGER versus JPRL, MetaReg versus JPRL, DDAIG versus JPRL, MixStyle versus JPRL, ASTRL versus JPRL, RSC versus JPRL, and FACT versus JPRL. To run the test, we rank the differences in performance of two methods for each task out of N tasks. The differences are ranked according to their absolute values. The smallest absolute value gets the rank of 1, the second smallest gets the rank of 2, and so on. In case of equality, average ranks are assigned. The statistic of the Wilcoxon signed-ranks test is:

$$\begin{aligned} z(a,b) = \frac{T(a,b) - N(N + 1) / 4}{\sqrt{N(N + 1)(2N + 1) / 24}}, \end{aligned}$$

(C.1)

where $T(a,b) = \min \{R^{+}(a,b), R^{-}(a,b)\}$. $R^{+}(a,b)$ is the sum of ranks for the tasks on which method b outperforms method a and $R^{-}(a,b)$ is the sum of ranks for the opposite. They are defined as follows:

$$\begin{aligned}&R^{+}(a,b) = \sum _{\text{diff}_{i}>0}\text{rank}(\text{diff}_{i}) + \frac{1}{2}\sum _{\text{diff}_{i}=0}\text{rank}(\text{diff}_{i}), \end{aligned}$$

(C.2)

$$R^{ - } (a,b) = \sum\limits_{{{\text{diff}}_{i} < 0}} {{\text{rank}}({\text{diff}}_{i} )} + \frac{1}{2}\sum\limits_{{diff_{i} = 0}} {{\text{rank}}({\text{diff}}_{i} )} ,$$

(C.3)

where $\text {diff}_{i}$ is the difference between the accuracy of two methods on the i-th task out of N tasks, and $\text {rank}(\text {diff}_{i})$ is the rank of $\vert \text {diff}_{i}\vert$. We fix b as JPRL, and let a vary from CIDDG to FACT in turn. Based on formulas (C.1)–(C.3), we can compute z(a, b) for the 8 pairs of tests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, S., Chen, L. Joint-product representation learning for domain generalization in classification and regression. Neural Comput & Applic 35, 16509–16526 (2023). https://doi.org/10.1007/s00521-023-08520-1

Download citation

Received: 15 July 2022
Accepted: 21 March 2023
Published: 23 April 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00521-023-08520-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Joint-product representation learning for domain generalization in classification and regression

Abstract

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

A survey of transfer learning

Learning from imbalanced data: open challenges and future directions

1 Introduction

2 Related work

3 Methodology

3.1 Problem formulation

3.2 Joint-product distribution alignment

Proposition 1

3.3 Analytic estimation of \(L^{2}\)-Distance

3.4 Optimization problem

4 Discussion on the assumption

5 Experiments

5.1 Experiments on synthetic datasets

5.2 Experiments on real-world datasets

5.2.1 Datasets

5.2.2 Comparison methods

5.2.3 Evaluation protocol

5.2.4 Implementation details

5.2.5 Results

5.2.6 Statistical test

5.3 Experimental analysis

5.3.1 Feature visualization

5.3.2 Regression loss

5.3.3 Comparison with MMD-based methods

6 Conclusion

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Appendices

Appendix A

Proof

Appendix B

Appendix C

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation