One of the important concepts in information geometry is the maximum entropy principle. The unique probability distribution that maximizes entropy under fixed moments is characterized by an exponential family ([15]). The exponential and mixture families induce a dually flat structure in the space of probability distributions (e.g. [2]).

We establish a version of the maximum entropy principle over the family of distributions generated by coordinate-wise transformations. The coordinate-wise transformations naturally arise in copula theory to make distributions have uniform marginals (e.g. [27]) and are considered as a subset of optimal transport maps.

Before going into details, we first consider a linear analogue of the problem. Marshall and Olkin [25] proved the following diagonal scaling theorem on matrices.

Theorem 1

([25]) Let \(S=(S_{ij})\in \mathbb {R}^{d\times d}\) be a positive semi-definite matrix and assume that S is strictly copositive in the sense that

$$\begin{aligned} \inf _{w_1,\ldots ,w_d>0}\frac{\sum _i\sum _j w_iS_{ij}w_j}{\sum _i w_i^2} > 0. \end{aligned}$$

Then, there exists a unique positive-definite diagonal matrix D such that

$$\begin{aligned} \sum _{j=1}^d (DSD)_{ij} = 1 \end{aligned}$$

for each \(i\in \{1,\dots ,d\}\).

Note that (1) is satisfied if S is positive definite. The equation (2) is, as pointed out by [20], the stationary condition of a convex function

$$\begin{aligned} \psi (w)=\sum _i(-\log w_i)+\frac{1}{2}\sum _i\sum _j w_iS_{ij}w_j, \end{aligned}$$

where \(w=(w_1,\ldots ,w_d)\) is the diagonal component of D.

Theorem 1 is interpreted in a probabilistic framework as follows. Let \(\mu \) be a normal distribution on \(\mathbb {R}^d\) with mean zero and covariance matrix \(S_{ij}=\int x_ix_j{\mathrm {d}\mu }\). Let \(\nu \) be the push-forward measure of \(\mu \) by a linear transformation \(x\mapsto Dx\) in \(\mathbb {R}^d\). Since the covariance matrix of \(\nu \) is DSD, the equation (2) is rewritten as

$$\begin{aligned} \sum _{j=1}^d\int x_ix_j {\mathrm {d}\nu }=1 \end{aligned}$$

for each \(i\in \{1,\ldots ,d\}\). In other words, each coordinate \(x_i\) and the sum \(\sum _j x_j\) have unit covariance under the law \(\nu \).

In the present paper, we provide a nonlinear analogue of the theorem. We admit a nonlinear monotone coordinate-wise transformation to achieve a stronger condition than (4). The condition will be referred to as the Stein-type identity (see Sect. 2 for the precise definition). Under some mild conditions on \(\mu \), it is shown that there exists such a unique transformation and it minimizes a free energy functional like (3), which has an entropy term. The space we use in the proof is the Wasserstein space, a distance space induced from optimal transportation. A key observation is that our functional is displacement convex in the sense of [26]. Refer to [30, 38] for comprehensive studies of optimal transportation and its applications.

Under the Stein-type identity, the sum of variables has positive correlation with each variable. This property is applied to a rating problem of multivariate data in Sect. 3.

As is well known, Sklar’s theorem (see, e.g., [27]) states that any multi-dimensional distribution is transformed by the probability integral transformation into a distribution with uniform marginals. The transformed distribution is called a copula. A linear analogue of Sklar’s theorem is that for any covariance matrix S there exists a unique positive-definite diagonal matrix D such that every diagonal element of DSD is unity. This is nothing but the correlation matrix corresponding to S.

There are some papers relevant to our study. A relation between copula and diagonal scaling is investigated in [4] from different perspectives. Their scaling operation does not correspond to transformation of random variables. Optimal transportation between two distributions sharing the same copula is considered in [1], where the various cost functions are the center of discussion. Optimal transportation is used to determine multi-dimensional quantiles in [7] and [13]. Although our motivation is also to define a kind of quantile functions of multivariate data, the construction is different from theirs (see Sect. 3). A particular class of optimal transport maps called moment maps has a deep connection to another Stein-type identity as investigated in [10].

The remainder of the present paper is organized as follows. In Sect. 2, we define the Stein-type distributions and transformations. In Sect. 3, we briefly explain its application to a rating problem of multivariate data. In Sect. 4, we describe the existence and uniqueness theorem as well as a variational characterization theorem. In Sect. 5, we prove the main results using the theory of optimal transportation. In Sect. 6, a numerical method to find the transformation for piecewise uniform distributions is proposed. Finally, we discuss open problems in Sect. 7.

Definition of Stein-type distributions and transformations

We define a class of distributions that satisfy a stronger condition than (4). Let \(\mathcal {P}^2=\mathcal {P}^2(\mathbb {R}^d)\) be the set of probability distributions \(\mu \) on \(\mathbb {R}^d\) with mean zero and finite second moments such that each marginal distribution \(\mu _i\) of \(\mu \) is absolutely continuous with respect to the Lebesgue measure on \(\mathbb {R}\). Note that \(\mu \) itself is not assumed to be absolutely continuous. The mean-zero condition is imposed only for simplicity. We say that a function \(f:\mathbb {R}\rightarrow \mathbb {R}\) is absolutely continuous if there exists a locally integrable function \(f'\) such that \(f(x)=f(0)+\int _0^x f'(y)\mathrm {d}y\) in Lebesgue’s sense.

Definition 1

A distribution \(\mu \in \mathcal {P}^2\) is said to be Stein-type if it satisfies

$$\begin{aligned} \int f(x_i)\left( \sum _{j=1}^d x_j\right) {\mathrm {d}\mu }= \int f'(x_i){\mathrm {d}\mu }, \quad i=1,\ldots ,d, \end{aligned}$$

for any absolutely continuous function \(f:\mathbb {R}\rightarrow \mathbb {R}\) with bounded derivative \(f'\).

Note that the equation (4) is a special case of (5) with \(f(x_i)=x_i\).

We refer to the equation (5) as the Stein-type identity. Indeed, if \(d=1\), it reduces to the Stein identity \(\int f(x_1)x_1{\mathrm {d}\mu }= \int f'(x_1){\mathrm {d}\mu }\), which implies that \(\mu \) is the standard normal distribution (see [35] and [6]). The Stein identity is used to evaluate distance between a given distribution and the normal distribution. Although the Stein-type identity we defined is a generalization of the Stein identity, the author does not aware of its applications to the distance evaluation. Instead, we develop a different application. More specifically, if a random vector \((X_1,\ldots ,X_d)\) has a Stein-type distribution, then the sum \(\sum _j X_j\) is positively correlated with any monotone transformation of \(X_i\) due to (5). This property is applied to a rating problem in Sect. 3.

If \(\mu \) is completely independent in the sense that \(\mu \) is the direct product of its marginals \(\mu _i\), then the Stein-type distribution has to be the d-dimensional standard normal distribution. Hereafter, we focus on dependent cases.

For Gaussian random variables, we obtain the following lemma. We denote the expectation by \(\mathrm{E}\).

Lemma 1

(Theorem 5 of [33]) Let \(\mu \) denote the d-dimensional normal distribution with mean zero and covariance matrix \(S=(S_{ij})\). Then, \(\mu \) is Stein-type if and only if \(\sum _j S_{ij}=1\) for each i.


Let \(X=(X_i)\) be a random vector distributed according to \(\mu \). Then, the conditional expectation of \(X_j\) given \(X_i\) is \(\mathrm{E}[X_j|X_i]=S_{ij}X_i/S_{ii}\). The left hand side of (5) is

$$\begin{aligned} \mathrm{E}\left[ f(X_i)\sum _j \mathrm{E}[X_j|X_i]\right]&= \frac{\sum _j S_{ij}}{S_{ii}} \mathrm{E}[f(X_i)X_i] \\&= \left( \sum _j S_{ij}\right) \mathrm{E}[f'(X_i)], \end{aligned}$$

where the last equality follows from the Stein identity for one-dimensional case. Hence (5) holds if and only if \(\sum _j S_{ij}=1\). \(\square \)

The following example gives a rich class of Stein-type distributions.

Example 1

Let W be a random variable with the standard normal distribution and let U be any random variable independent of W such that \(\mathrm{E}[U]=0\) and \(\mathrm{E}[U^2]<\infty \). The condition \(\mathrm{E}[U]=0\) is assumed to make the following distribution belong to \(\mathcal {P}^2\) and not essential here. Consider two variables

$$\begin{aligned} X_1 = \frac{W+U}{\sqrt{2}} \quad \text{ and } \quad X_2 = \frac{W-U}{\sqrt{2}}. \end{aligned}$$

Then the distribution of \((X_1,X_2)\) is Stein-type. Indeed, we obtain

$$\begin{aligned} \mathrm{E}\left[ f\left( \frac{W\pm U}{\sqrt{2}}\right) \sqrt{2}W\right] = \mathrm{E}\left[ f'\left( \frac{W\pm U}{\sqrt{2}}\right) \right] \end{aligned}$$

for any f by the one-dimensional Stein identity with respect to W conditional on U. The variable \(W=(X_1+X_2)/\sqrt{2}\) has a meaning of “an overall score” of the two variables \(X_1\) and \(X_2\). The identity implies that W is positively correlated with any increasing function \(f(X_i)\) of \(X_i\).

This example is generalized to the case \(d\ge 3\). Define a random vector \((X_1,\ldots ,X_d)\) by \(X_i = (W+U_i)/\sqrt{d}\), where W has the standard normal distribution independent of \(U_1,\ldots ,U_{d-1}\) and \(U_d=-\sum _{j=1}^{d-1} U_j\). Then, the distribution of \((X_1,\ldots ,X_d)\) is Stein-type as long as \(\mathrm{E}[U_i]=0\) and \(\mathrm{E}[U_i^2]<\infty \). As in the two-dimensional case, W has positive correlation with any increasing function \(f(X_i)\) of \(X_i\). In Sect. 3, we will show by an example that the positive-correlation property does not hold in general for copulas.

The example does not cover the entire class of Stein-type distributions. Other examples are given in Sect. 6 and Appendix B.

For each \(\mu \in \mathcal {P}^2\), let \(\mathcal {T}_{\mathrm{cw}}(\mu )\) be the set of coordinate-wise transformations

$$\begin{aligned} T(x)=(T_1(x_1),\ldots ,T_d(x_d)), \quad x = (x_1,\ldots ,x_d)\in \mathbb {R}^d, \end{aligned}$$

such that each \(T_i:\mathbb {R}\rightarrow \mathbb {R}\cup \{-\infty ,\infty \}\) is non-decreasing and \(T_\sharp \mu \) belongs to \(\mathcal {P}^2\). Here, \(T_\sharp \mu \) is the push-forward measure defined by \((T_\sharp \mu )(A)=\mu (T^{-1}(A))\) for any measurable set A. The set \(\mathcal {T}_{\mathrm{cw}}(\mu )\) depends only on the marginal distributions of \(\mu \). Two maps T and U in \(\mathcal {T}_\mathrm{cw}(\mu )\) are identified if \(\mu (T=U)=1\). Note that \(T_i\) may have discontinuous points if the support of the i-th marginal \((T_\sharp \mu )_i\) is not connected.

We consider a problem to find a map \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) such that \(T_\sharp \mu \) is Stein-type.

Definition 2

A map \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) is called a Stein-type transformation of \(\mu \) if \(T_\sharp \mu \) is a Stein-type distribution.

For example, if \(\mu \) is the product measure of one-dimensional continuous distributions \(\mu _i\), then the map T defined by \(T_i(x_i)=\varPhi ^{-1}(\mu _i((-\infty ,x_i]))\) is the Stein-type transformation of \(\mu \), where \(\varPhi \) is the cumulative distribution function of the standard normal distribution. The map is nothing but the Brenier map between the two product measures.

The following lemma is immediate.

Lemma 2

Let \(\mu \) be the normal distribution with a covariance matrix \(S=(S_{ij})\). Then, \(\mu \) has a Stein-type transformation if S is strictly copositive in the sense of (1).


Let D be the diagonal matrix with entries \(w_1,\ldots ,w_d\) satisfying (4). Set \(T(x)=Dx\). Then, \(T_\sharp \mu \) is Stein-type due to Lemma 1. \(\square \)

Application to a rating problem

In this section, we briefly describe an application of Stein-type transformations. We first explain a linear rating method of multivariate data according to [33]. Let S be a covariance matrix of an \(\mathbb {R}^d\)-valued random vector \(x=(x_i)\). Suppose that each variable \(x_i\) has a meaning of “score”. For example, \(x_1\), \(x_2\) and \(x_3\) are student scores of mathematics, physics and history, and so forth. We want to determine positive weights \(w_i\) such that \(\sum _j w_jx_j\) reflects the scores \(x_1,\dots ,x_d\). A candidate of such a weight \(w_i\) is the i-th diagonal element of D in (2). Indeed, under (2), the variable \(x_i\) and the overall score \(\sum _j w_jx_j\) have positive covariance for each i. This property is not attained in general for other methods of weighting. The quantity \(\sum _j w_jx_j\) is called the objective general index (OGI) in [33], which reflects the d scores in this sense.

In a similar manner, we can define a nonlinear version of the objective general index via Stein-type transformations. Let \(\mu \) be a probability distribution on \(\mathbb {R}^d\) and x be a random vector distributed according to \(\mu \). Again each coordinate \(x_i\) is assumed to have a meaning of score. Then the nonlinear general index is defined by

$$\begin{aligned} g(x)=\sum _j T_j(x_j), \end{aligned}$$

where \(T=(T_i)\) is the Stein-type transformation of \(\mu \). The quantity g(x) satisfies \(\mathrm{E}[g(X)f(X_i)]\ge 0\) for any increasing function \(f(x_i)\) of \(x_i\) due to the Stein-type identity (5). In particular, by taking the step function \(f(x_i)=I_{[a,\infty )}(x_i)\) and noting that \(\mathrm{E}[g(X)]=0\), we obtain

$$\begin{aligned} \mathrm{E}[g(X) \mid X_i < a] \le \mathrm{E}[g(X)\mid X_i \ge a] \end{aligned}$$

for each \(i\in \{1,\ldots ,d\}\) and \(a\in \mathbb {R}\). The property means that the conditional average of the overall score of students taking higher score on \(x_i\) is larger than that of students taking lower score. In this sense, g(x) reflects the d scores.

General indices g(x) satisfying the inequality (6) are not unique. Indeed, for two-dimensional distributions, the probability integral transformation \(T_i(x_i)=\mu _i((-\infty ,x_i])\) provides (6); see Appendix D for a proof. However, for higher dimensional distributions, it is not trivial to find such a transformation. The Stein-type transformation solves the problem. We confirm this point by an example.

Example 2

Suppose that \(x_1\), \(x_2\) and \(x_3\) are random variables that have a probability distribution \(\mu \) on \([-1,1]^3\) with the density function

$$\begin{aligned} f(x_1,x_2,x_3) = {\left\{ \begin{array}{ll} 3/8 &{} \text{ if }\ (x_1,x_2,x_3)\in ([-1,0]\times [0,1]^2)\cup ([0,1]\times [-1,0]^2),\\ 1/24 &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$

The three marginal distributions are uniform over \([-1,1]\), i.e., \(\mu \) is a copula. We can see

$$\begin{aligned} E\left[ \left. \sum _j X_j \right| X_1 < 0\right] = \frac{1}{6}> E\left[ \left. \sum _j X_j \right| X_1 > 0\right] = -\frac{1}{6}. \end{aligned}$$

Hence the property in (6) does not hold if we adopt \(g(x)=\sum _i x_i\). As will be demonstrated in Sect. 6, the unique Stein-type transformation of \(\mu \) is

$$\begin{aligned} T_i(x_i) = {\left\{ \begin{array}{ll} -c_i + \varPhi ^{-1}((1+x_i)\varPhi (c_i)) &{} \text{ if }\ x_i<0,\\ c_i - \varPhi ^{-1}((1-x_i)\varPhi (c_i)) &{} \text{ if }\ x_i>0 \end{array}\right. } \end{aligned}$$

with \(c_1=1.2490\) and \(c_2=c_3=0.3445\), where \(\varPhi \) denotes the standard normal distribution function and \(c_1,c_2\) are numerically obtained. The nonlinear objective general index \(g(x)=\sum _i T_i(x_i)\) satisfies the relation (6) as a result.

Main results

For given \(\mu \in \mathcal {P}^2\), denote the set of coordinate-wise transformed distributions of \(\mu \) by

$$\begin{aligned} \mathcal {F}_{\mu } = \{T_\sharp \mu \mid T\in \mathcal {T}_\mathrm{cw}(\mu )\} \subset \mathcal {P}^2. \end{aligned}$$

We refer to \(\mathcal {F}_{\mu }\) as a fiber. The following lemma is a direct consequence of the one-dimensional optimal transportation. See Appendix A.

Lemma 3

For given \(\mu \in \mathcal {P}^2\) and \(\nu \in \mathcal {F}_{\mu }\), the map \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) satisfying \(\nu =T_\sharp \mu \) is uniquely determined \(\mu \)-almost everywhere. Furthermore, the relation \(\nu \in \mathcal {F}_{\mu }\) between two measures \(\mu \) and \(\nu \) is an equivalence relation. In particular, \(\mathcal {P}^2\) is partitioned into mutually disjoint fibers.

Now, we state our three main theorems. All proofs are presented in Sect. 5.

The first main theorem characterizes Stein-type distributions in terms of the variational principle. Define an energy functional \(\mathcal {E}(\mu )\) of \(\mu \) by

$$\begin{aligned} \mathcal {E}(\mu ) = \sum _{i=1}^d \int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i + \int \left( \sum _{j=1}^d x_j\right) ^2 {\mathrm {d}\mu }, \end{aligned}$$

where \(p_i={\mathrm {d}\mu }_i/{\mathrm {d}x}_i\) is the marginal density function. The first term of \(\mathcal {E}(\mu )\) represents the negative entropy of the marginal distributions, and the second term is the variance of the diagonal part \(\sum _j x_j\). The functional has displacement convexity over each fiber in the sense of [26]. See Sect. 5 for details. If we replace the entropy term with the joint entropy, the functional becomes displacement convex over the whole space as proved by McCann [26].

Theorem 2

A measure \(\mu \in \mathcal {P}^2\) is Stein-type if and only if \(\mathcal {E}(\mu )\) is finite and \(\mu \) minimizes \(\mathcal {E}\) over the fiber \(\mathcal {F}_{\mu }\).

The second main theorem is on the uniqueness of Stein-type transformations. A distribution \(\mu \) on \(\mathbb {R}^d\) is said to have a regular support if the support of \(\mu \) is equal to the direct product of the supports of the marginal distributions \(\mu _i\). This property is invariant under coordinate-wise transformations. Note that the regular support condition does not imply absolute continuity of \(\mu \) with respect to \(\prod _{i=1}^d \mu _i\).

Theorem 3

(Uniqueness) Assume that \(\mu \in \mathcal {P}^2\) has a regular support. Then, a Stein-type transformation of \(\mu \) is unique if it exists.

We conjecture that the uniqueness follows without the regular support condition. See Sect. 7 for more details.

The third main theorem is on existence. A measure \(\mu \in \mathcal {P}^2\) is said to be copositive if

$$\begin{aligned} \beta (\mu ) := \inf _{T \in \mathcal {T}_{\mathrm{cw}}(\mu )} \frac{\int (\sum _i T_i(x_i))^2{\mathrm {d}\mu }}{\sum _i\int T_i(x_i)^2 {\mathrm {d}\mu }} > 0. \end{aligned}$$

For example, if \(\mu \) is pairwise independent, then \(\int (\sum _i T_i(x_i))^2{\mathrm {d}\mu }= \sum _i \int T_i(x_i)^2{\mathrm {d}\mu }\) for any \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\), and therefore \(\beta (\mu )=1\). It is not difficult to see that \(\beta (\mu )\le 1\) for any \(\mu \). If \(\mu \) is associated in the sense of [9, 11], and [21], then \(\int T_iT_j{\mathrm {d}\mu }\ge 0\) for each pair of i and j, and therefore \(\beta (\mu )\ge 1\). On the other hand, if \(d=2\) and \(\mu (\{x\mid x_1+x_2=0\})=1\), then \(\beta (\mu )=0\) because \(\int (x_1+x_2)^2{\mathrm {d}\mu }=0\). Sufficient conditions for copositivity are presented in Appendix E.

Theorem 4

(Existence) Let \(\mu \in \mathcal {P}^2\) be copositive. Then, there exists a Stein-type transformation of \(\mu \).

We now present a few remarks before proceeding to the following section.

The uniqueness and existence results in Theorem 3 and Theorem 4 are consequences of the variational characterization in Theorem 2, as will be shown in Sect. 5. For \(d=1\), the functional \(\mathcal {E}(\mu )\) is the Kullback-Leibler divergence from \(\mu \) to the standard normal density up to a constant term. For \(d\ge 2\), however, \(\mathcal {E}\) is not even bounded from below. Indeed, for each \(t>0\), let \(\mu ^t\) be the multivariate normal distribution with mean zero and covariance matrix \(\Sigma _t = P + t(I-P)\), where I is the identity matrix and

$$\begin{aligned} P = \frac{1}{d}\begin{pmatrix} 1&{} \cdots &{} 1\\ \vdots &{}\ddots &{} \vdots \\ 1&{} \cdots &{} 1 \end{pmatrix}\in \mathbb {R}^{d\times d}. \end{aligned}$$

Then, each marginal distribution of \(\mu ^t\) is normal with variance \(\sigma _t^2=(1/d)+t(1-1/d)\). We can show that \(\mathcal {E}(\mu ^t) = -(d/2)\log (2\pi \sigma _t^2)\), which tends to \(-\infty \) as \(t\rightarrow \infty \). Therefore, it is not trivial if there is a minimizer of \(\mathcal {E}\) over the fiber. Nevertheless, the existence and uniqueness theorems are obtained.

If \(\mu \) has the joint density function p(x) with respect to the Lebesgue measure, then the negative joint entropy is defined by

$$\begin{aligned} \mathcal {U}_d(\mu )=\int p(x) \log p(x) {\mathrm {d}x}. \end{aligned}$$

In most cases, we can replace the marginal entropy term in \(\mathcal {E}(\mu )\) with the joint entropy because the difference \(\mathcal {U}_d(\mu )-\sum _{i=1}^d \mathcal {U}_1(\mu _i)\), which is referred to as the multi-information function or the measure of multivariate dependence, is invariant in each fiber (e.g., [16] and [36]). However, in some pathological cases, the difference diverges. Therefore, it is more appropriate to adopt the marginal entropy.

According to Sklar’s theorem (e.g. [27]), any d-dimensional distribution \(\mu \) is transformed by the probability integral transformation \(T_i(x_i)=\int _{-\infty }^{x_i}{\mathrm {d}\mu }_i\) into the distribution \(T_\sharp \mu \) with uniform marginals unless some \(\mu _i\) has an atom. The resultant distribution \(T_\sharp \mu \) is a copula. The Stein-type distribution we defined is considered as an alternative representation of the copula. Copulas are also characterized by an energy minimization problem. Here, the potential term in (7) is replaced with \(\int V(x){\mathrm {d}\mu }\), where \(V(x)=\infty \) if \(x\notin [0,1]^d\) and 0 otherwise. In parallel, we have to remove the condition \(\int x_i{\mathrm {d}\mu }_i=0\) from the definition of \(\mathcal {P}^2\). Maximum entropy copulas under a given diagonal section are discussed in [5], where, in contrast to the present paper, the marginals are fixed to be uniform.

Proofs based on the theory of optimal transportation

In this section, we prove the three main theorems stated in Sect. 4. The proof is based on the theory of optimal transportation. Necessary facts about one-dimensional optimal transportation are summarized in Appendix A.

Regularity of Stein-type distributions

The Stein-type identity forces regularity of marginal density functions. We first characterize this by an integral equation.

Theorem 5

Let \(\mu \in \mathcal {P}^2\). Denote the marginal density functions of \(\mu \) with respect to the Lebesgue measure by \(p_i(x_i)={\mathrm {d}\mu }_i/{\mathrm {d}x}_i\). Then, \(\mu \) is a Stein-type distribution if and only if it satisfies a set of integral equations

$$\begin{aligned} p_i(a) = \int _a^{\infty } p_i(x_i) m_i(x_i) {\mathrm {d}x}_i, \quad a\in \mathbb {R}, \quad i=1,\ldots ,d, \end{aligned}$$

where \(m_i(x_i)\) denotes the conditional expectation of \(\sum _{j=1}^d x_j\) given \(x_i\) with respect to \(\mu \).


First, note that \(m_i(x_i)\) is finite \(\mu _i\)-almost everywhere because \(\mu \) belongs to \(\mathcal {P}^2\).

Assume \(\mu \) is Stein-type. For \(-\infty<a<b<\infty \), let \(h_{ab}(x)=(b-a)^{-1}\int _{-\infty }^x I_{(a,b)}(\xi ){\mathrm {d}\xi }\), where \(I_{(a,b)}\) is the indicator function of (ab). The Stein-type identity for \(h_{ab}\) implies

$$\begin{aligned} \int h_{ab}'(x_i){\mathrm {d}\mu }_i&= \int h_{ab}(x_i)\sum _j x_j{\mathrm {d}\mu }\\&= \int h_{ab}(x_i)m_i(x_i) p_i(x_i){\mathrm {d}x}_i. \end{aligned}$$

Letting \(b\rightarrow a\), we obtain (9).

Conversely, assume (9). The right-hand side of (9) converges to zero as \(a\rightarrow \pm \infty \) because \(\int x_j{\mathrm {d}\mu }_j=0\) for all j. Then, for any bounded and absolutely continuous function f with bounded derivative \(f'\), we obtain the Stein-type identity for f:

$$\begin{aligned} \int f'(a) p_i(a) {\mathrm {d}a}&= \int _{-\infty }^{\infty } f'(a)\left( \int _a^{\infty } p_i(x_i) m_i(x_i) {\mathrm {d}x}_i\right) {\mathrm {d}a}\\&= \int _{-\infty }^{\infty } f(x_i)p_i(x_i) m_i(x_i) {\mathrm {d}x}_i \\&= \int f(x_i) \sum _j x_j {\mathrm {d}\mu }, \end{aligned}$$

where the second equality follows from the integral-by-parts formula. If f is not bounded, then let \(f_M(x)=f(0)+\int _0^xf'(u)1_{\{|u|\le M\}}\mathrm {d}u\) and take \(M\rightarrow \infty \). \(\square \)

As a corollary, the regularity of the marginal density functions is established.

Corollary 1

Let \(\mu \) be Stein-type. Then, its marginal density functions \(p_i(x_i)\) are bounded, absolutely continuous, and converge to zero as \(x_i\rightarrow \pm \infty \).


From the formula (9), it is obvious that \(p_i\) is absolutely continuous and bounded by \(\int \sum _i|x_i|{\mathrm {d}\mu }_i<\infty \). We also have \(p_i(x_i)\rightarrow 0\) as \(x_i\rightarrow \pm \infty \) because the right-hand side of (9) vanishes as \(a\rightarrow \pm \infty \). \(\square \)

Although the marginal density function of any Stein-type distribution is absolutely continuous, it can have non-differentiable points as shown in an example in Sect. 6. The continuous differentiability of \(p_i(x_i)\) follows from the regularity of the pair-wise copula of \(\mu \) along formula (9). We do not pursue this line of investigation here. On the other hand, we conjecture that the marginal density of any Stein-type distribution is positive everywhere. See Sect. 7 for more details.

The following corollary will be used later.

Corollary 2

Let \(\mu \) be Stein-type. Then, its negative marginal entropy \(\int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i\) is finite.


Since the marginal density \(p_i(x_i)\) is bounded, we have \(\int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i<\infty \). To prove \(\int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i>-\infty \), we use the non-negativity of the Kullback-Leibler information from \(p_i\) to the standard normal density \(\phi (x_i)=e^{-x_i^2/2}/\sqrt{2\pi }\). Indeed,

$$\begin{aligned} \int p_i(x_i)\log p_i(x_i) {\mathrm {d}x}_i&\ge \int p_i(x_i)\log \phi (x_i) {\mathrm {d}x}_i\\&= \int \{-(1/2)\log (2\pi )-x_i^2/2\}{\mathrm {d}\mu }_i\\&>-\infty \end{aligned}$$

because \(\int x_i^2 {\mathrm {d}\mu }_i<\infty \). \(\square \)

Other properties of Stein-type distributions are given in Appendix C.

Variational problem over a fiber of Wasserstein space

Let \(\mathcal {F}\) be a fiber of \(\mathcal {P}^2\) (see Sect. 4 for the definition) and choose two measures \(\mu \) and \(\nu \) in \(\mathcal {F}\), where \(\nu \) is written as \(\nu =T_\sharp \mu \) with some \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) by definition. Define the geodesic, which is also referred to as the displacement interpolation [26], from \(\mu \) to \(\nu \) by

$$\begin{aligned} {[}\mu ,\nu ]_t = [(1-t)\mathrm{Id}+tT]_\sharp \mu ,\quad t\in [0,1], \end{aligned}$$

where \(\mathrm{Id}\) denotes the identity map. Based on the one-dimensional optimal transportation, it follows that \([\mu ,\nu ]_t\in \mathcal {F}\) and \([\mu ,\nu ]_t=[\nu ,\mu ]_{1-t}\) for each t. A functional on \(\mathcal {F}\) is said to be displacement convex if it is convex along a geodesic. Refer to [26] and [3] for further details on displacement convexity.

Although a geodesic between any pair of distributions in \(\mathcal {P}^2\) is similarly defined, we need only geodesics in a common fiber. It is known that a geodesic actually attains the minimum length of a path between two measures with respect to the \(L^2\)-Wasserstein distance (see e.g. [3] and [38]). Here the \(L^2\)-Wasserstein distance is the infimum of \(\left( \int \Vert x-y\Vert ^2 \mathrm {d}\gamma (x,y)\right) ^{1/2}\) over the set of joint distributions \(\gamma \) on \(\mathbb {R}^{2d}\) with the marginals \(\mu \) and \(\nu \). Each fiber \(\mathcal {F}\) is totally geodesic in the sense of [37].

Recall that \(\mu \) is said to have a regular support if its support is the direct product of the supports of marginal distributions.

Lemma 4

Let \(\mathcal {F}\) be a fiber and choose any two distributions \(\mu \) and \(\nu \) in \(\mathcal {F}\), where \(\mu \ne \nu \). Then, \(\mathcal {E}([\mu ,\nu ]_t)\) is convex in t, that is, \(\mathcal {E}\) is displacement convex over \(\mathcal {F}\). Furthermore, \(\mathcal {E}([\mu ,\nu ]_t)\) is strictly convex if one of the following conditions is satisfied:

  1. (i)

    \(\mu \) (and therefore \(\nu \)) has a regular support, or

  2. (ii)

    the supports of \(\mu _i\) and \(\nu _i\) are connected, respectively, for each i.


Let \(\nu =T_\sharp \mu \), with \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\). Let \(p_i={\mathrm {d}\mu }_i/{\mathrm {d}x}_i\) be the marginal density of \(\mu \). By the change-of-variable formula (Lemma 7 in Appendix A), we obtain

$$\begin{aligned} \mathcal {E}([\mu ,\nu ]_t) = \sum _i \int p_i(x_i)\log \frac{p_i(x_i)}{(1-t)+tT_i'(x_i)}{\mathrm {d}x}_i + \frac{1}{2}\int \left( \sum _i ((1-t)x_i+tT_i(x_i))\right) ^2 {\mathrm {d}\mu }, \end{aligned}$$

where \(T_i'(x_i)\) is the derivative of \(T_i\) if it exists, and \(T_i'(x_i)=0\) otherwise. Both terms in (10) are convex in t.

Assume (i) and that \(\mathcal {E}([\mu ,\nu ]_t)\) is not strictly convex. Then, there is an interval over which \(\mathcal {E}([\mu ,\nu ]_t)\) is linear. It is deduced from (10) that \(\sum _{i=1}^d (T_i(x_i)-x_i)=0\), \(\mu \)-almost everywhere. Let I be the set of indices i such that \(\mu _i(T_i(x_i)\ne x_i)>0\). Then, I is not empty because \(T\ne \mathrm{Id}\). For each \(i\in I\), the probability \(\mu _i(T_i(x_i)-x_i>0)\) is positive because \(\int (T_i(x_i)-x_i){\mathrm {d}\mu }_i=0\). Then, by the regular support condition, we have \(\mu (\sum _{i\in I}(T_i(x_i)-x_i)>0)\) is positive. However, this contradicts \(\sum _{i=1}^d (T_i(x_i)-x_i)=0\). Thus, \(\mathcal {E}([\mu ,\nu ]_t)\) should be strictly convex under (i).

Next, assume (ii). Then, \(T_i\) has no discontinuous points. Assume \(\mathcal {E}([\mu ,\nu ]_t)\) is not strictly convex. Then, if follows from (10) that \(T_i'(x_i)=1\) and, therefore, \(T_i(x_i)=x_i\) by the connectedness of the support together with the condition \(\int T_i{\mathrm {d}\mu }_i=0\). However, this contradicts \(\mu \ne \nu \). Thus, \(\mathcal {E}([\mu ,\nu ]_t)\) is strictly convex. \(\square \)

Example 3

The strict convexity of \(\mathcal {E}([\mu ,\nu ]_t)\) can fail if neither condition (i) nor condition (ii) in Lemma 4 is satisfied. For example, let \(d=2\) and assume that \(\mu \) is uniformly distributed over the region \(([-1,0]\times [0,1])\cup ([0,1]\times [-1,0])\). Define the map T by \(T_i(x_i)=x_i+1\) if \(x_i>0\), and \(T_i(x_i)=x_i-1\) otherwise for each i. Let \(\nu =T_\sharp \mu \). Then, \(\mathcal {E}([\mu ,\nu ]_t)\) is constant along \(t\in [0,1]\) because \(T_i'(x_i)=1\) and \(\sum _i T_i(x_i)=\sum _i x_i\), \(\mu \)-almost everywhere. In this case, \(\mu _i\) is supported on \([-1,1]\), whereas \(\nu _i\) is supported on \([-2,-1]\cup [1,2]\).

Proof of Theorem 2

Let \(\mu \) be a Stein-type distribution. Corollary 2 implies that \(\mu \) belongs to \(\mathrm{dom}\mathcal {E}\). From the convexity (Lemma 4), it is sufficient to show that

$$\begin{aligned} \left. \frac{{\mathrm {d}}}{{\mathrm {d}t}_+}\mathcal {E}([\mu ,\nu ]_t)\right| _{t=0} \ge 0 \end{aligned}$$

for any \(\nu =T_\sharp \mu \in \mathcal {F}\), where \({\mathrm {d}}/{\mathrm {d}t}_+\) denotes the right derivative. It follows from formula (10) that

$$\begin{aligned} \left. \frac{{\mathrm {d}}}{{\mathrm {d}t}_+}\mathcal {E}([\mu ,\nu ]_t)\right| _{t=0}&= -\sum _i \int p_i(x_i) (T_i'(x_i)-1) {\mathrm {d}x}_i + \sum _i \sum _j \int (T_i(x_i)-x_i) x_j {\mathrm {d}\mu }. \end{aligned}$$

If \(T_i\) is absolutely continuous, the right-hand side vanishes by the Stein-type identity, where the boundedness of the derivatives \(T_i'\) can be assumed by a standard approximation argument, as in the proof of Theorem 5. If \(T_i\) is not absolutely continuous, \(T_i\) can be decomposed into an absolutely continuous part and a discontinuous part as \(T_i=T_i^\mathrm{ac}+T_i^{\mathrm{d}}\). See Appendix A. The contribution of \(T_i^{\mathrm{ac}}\) in (11) vanishes due to the Stein-type identity. It is sufficient to prove that \(\sum _j \int T_i^{\mathrm{d}}(x_i) x_j {\mathrm {d}\mu }\ge 0\) for each i because \((T_i^\mathrm{d})'=0\) by definition. We can take a sequence \(\{f_{i,n}\}_{n=1}^{\infty }\) of non-decreasing differentiable functions with a bounded derivative such that \(f_{i,n}(x_i)\) converges to \(T_i^{\mathrm{d}}(x_i)\) \(\mu \)-almost everywhere. More specifically, a step function \(I_{[\xi ,\infty )}(x_i)\) at each \(\xi \in \mathbb {R}\) is approximated by a logistic function \(1/(1+\exp (-n(x_i-\xi )))\). Then, by Lebesgue’s dominated convergence theorem and the Stein-type identity, we obtain

$$\begin{aligned} \sum _j \int T_i^{\mathrm{d}}(x_i) x_j {\mathrm {d}\mu }&= \lim _{n\rightarrow \infty }\sum _j \int f_{i,n}(x_i) x_j {\mathrm {d}\mu }\\&= \lim _{n\rightarrow \infty }\int f_{i,n}'(x_i) {\mathrm {d}\mu }\\&\ge 0. \end{aligned}$$

Conversely, assume that \(\mathcal {E}(T_\sharp \mu )\) is minimized at \(T=\mathrm{Id}\). Fix \(1\le i\le d\), and let f be an absolutely continuous function with bounded derivative \(f'\). For sufficiently small \(\varepsilon >0\), two maps \(T(x)=x\pm \varepsilon f(x_i)e_i\), where \(e_i\) is the i-th unit vector, belong to \(\mathcal {T}_\mathrm{cw}(\mu )\). Thus, the right derivative (11) has to be zero, and \(\mu \) satisfies the Stein-type identity.

Proof of Theorem 3

Assume that \(\mu \) has a regular support and admits a Stein-type transformation T. Then, Theorem 2 implies that \(T_\sharp \mu \) minimizes \(\mathcal {E}\) over the fiber \(\mathcal {F}_{\mu }\). However, it is deduced from Lemma 4 that \(\mathcal {E}\) is strictly convex over \(\mathcal {F}_{\mu }\). Thus, the minimizer is unique.

Proof of Theorem 4

Assume that \(\mu \) is copositive. Denote the functional \(\mathcal {E}\) restricted to the fiber \(\mathcal {F}_{\mu }\) by \(\mathcal {E}_{\mu }\). From Theorem 2, it is sufficient to show that \(\mathcal {E}_{\mu }\) has a minimum point. We first show that \(\mathcal {E}_{\mu }\) is bounded from below and that the level set \(\{\nu \mid \mathcal {E}_{\mu }(\nu )\le c\}\) for each \(c\in \mathbb {R}\) is tight. For any \(\nu \in \mathcal {F}_{\mu }\), the copositivity condition implies

$$\begin{aligned} \mathcal {E}_{\mu }(\nu ) \ge \sum _{i=1}^d \left\{ \int q_i(x_i) \log q_i(x_i) {\mathrm {d}x}_i + \frac{\beta }{2} \int x_i^2 {\mathrm {d}\nu }_i \right\} , \end{aligned}$$

where \(q_i={\mathrm {d}\nu }_i/{\mathrm {d}x}_i\) and \(\beta =\beta (\nu )=\beta (\mu )>0\). We obtain

$$\begin{aligned} \int q_i(x_i)\log q_i(x_i){\mathrm {d}x}_i&= \int q_i(x_i)\log \frac{q_i(x_i)}{\sqrt{\beta /(4\pi )}e^{-\beta x_i^2/4}}{\mathrm {d}x}_i - \frac{\beta }{4} \int x_i^2{\mathrm {d}\nu }_i + \frac{1}{2}\log \frac{\beta }{4\pi } \\&\ge - \frac{\beta }{4} \int x_i^2{\mathrm {d}\nu }_i + \frac{1}{2}\log \frac{\beta }{4\pi }, \end{aligned}$$

where the last inequality follows from the nonnegativity of the Kullback-Leibler divergence. Then, \(\mathcal {E}_{\mu }\) is bounded from below as

$$\begin{aligned} \mathcal {E}_{\mu }(\nu ) \ge C + \frac{\beta }{4}\sum _{i=1}^d \int x_i^2 {\mathrm {d}\nu }_i, \end{aligned}$$

where C is a constant independent of \(\nu \). This inequality also implies that the level set \(\{\nu \mid \mathcal {E}_{\mu }(\nu )\le c\}\) is tight.

Now there exists a weakly converging sequence \(\nu _k\) such that \(\mathcal {E}_{\mu }(\nu _k)\) converges to \(\inf \mathcal {E}_{\mu }(\nu )\). Let \(\nu _*\) be the weak limit. Then, Corollary 3.5 of [26] shows that \(\nu _*\in \mathcal {P}^2\) and \(\mathcal {E}_{\mu }(\nu _*)\le \lim _k \mathcal {E}_{\mu }(\nu _k)\). The distribution \(\nu _*\) gives a minimum point of \(\mathcal {E}_{\mu }\). This completes the proof.

Piecewise uniform densities

In this section, it is shown that if \(\mu \) has piecewise uniform density function, then the Stein-type transformation of \(\mu \) is obtained by finite-dimensional optimization. Here, we do not impose the zero mean condition \(\int x_i\mathrm{d}\mu =0\) on \(\mu \). We can always translate it to have zero mean if necessary.

We say that a probability density function c(u) on \([0,1]^d\) is piecewise uniform if its two-dimensional marginal densities \(c_{ij}\) (\(1\le i<j\le d\)) are written as

$$\begin{aligned} c_{ij}(u_i,u_j) = n^2\pi _{ab}^{ij}\quad \text{ if }\quad (u_i,u_j)\in (\tfrac{a-1}{n},\tfrac{a}{n}]\times (\tfrac{b-1}{n},\tfrac{b}{n}], \quad a,b\in \{1,\ldots ,n\}, \end{aligned}$$

for some n, where \(\pi _{ab}^{ij}\) is a positive number such that

$$\begin{aligned} \sum _{a=1}^n \sum _{b=1}^n \pi _{ab}^{ij} = 1. \end{aligned}$$

Let \(\pi _a^i=\sum _{b=1}^n \pi _{ab}^{ij}\). Although c is not a copula density unless \(\pi _a^i=1/n\) for all i and a, it is transformed by a piecewise linear transform into a copula density. Then Corollary 3 in Appendix E, together with Theorem 4, guarantees the existence of a Stein-type transformation as long as the support of c(u) is \([0,1]^d\).

By solving Equation (9), we obtain an expression of the Stein-type transformation of c as follows. Denote the cumulative distribution function and density function of the standard normal distribution by \(\varPhi \) and \(\phi \), respectively.

Lemma 5

Suppose that c(u) satisfies (12) and its support is \([0,1]^d\). Let p be the unique Stein-type density transformed from c. Then, there exist real constants \(\alpha _{1i},\ldots ,\alpha _{ni}\) and \(\xi _{1i}<\cdots <\xi _{n-1,i}\) such that

$$\begin{aligned} p_i(x_i) = \pi _a^i\frac{\phi (x_i-\alpha _{ai})}{Z_{ai}} \quad \text{ for } \quad \xi _{a-1,i}<x_i\le \xi _{ai} \end{aligned}$$

where \(\xi _{0i}=-\infty \), \(\xi _{ni}=\infty \), and \(Z_{ai}= \varPhi (\xi _{ai}-\alpha _{ai}) - \varPhi (\xi _{a-1,i}-\alpha _{ai})\). The Stein-type transformation is

$$\begin{aligned} x_i = T_i(u_i) = \alpha _{ai} + \varPhi ^{-1}\left( \varPhi (\xi _{a-1,i}-\alpha _{ai})+n(u_i-\tfrac{a-1}{n})Z_{ai}\right) ,\quad u_i\in (\tfrac{a-1}{n},\tfrac{a}{n}], \end{aligned}$$

and the two-dimensional marginal density is

$$\begin{aligned} p_{ij}(x_i,x_j) = \pi _{ab}^{ij}\frac{\phi (x_i-\alpha _{ai})}{Z_{ai}}\frac{\phi (x_j-\alpha _{bj})}{Z_{bj}}, \quad (x_i,x_j)\in (\xi _{a-1,i},\xi _{ai}]\times (\xi _{b-1,j},\xi _{bj}]. \end{aligned}$$

Furthermore, the following identity is satisfied:

$$\begin{aligned} \alpha _{ai} = - \sum _{j\ne i}\sum _{b=1}^n \frac{\pi _{ab}^{ij}}{\pi _a^i} \int _{\xi _{b-1,j}}^{\xi _{bj}} \frac{x_j\phi (x_j-\alpha _{bj})}{Z_{bj}}{\mathrm {d}x}_j. \end{aligned}$$


Equation (9) implies that \(\partial _i p_i(x_i) = -(x_i + \sum _{j\ne i}E[X_j|x_i])p_i(x_i)\), where \(\partial _i=\partial /\partial x_i\). Since the conditional expectation \(E[X_j|x_i]\) has to be piecewise constant, \(p_i(x_i)\) is piecewise Gaussian up to a normalizing constant. Since the mass of each piece is preserved under a coordinate-wise transformation, we obtain the form (13). Then, the unique monotone transformation (14) is derived from \(c_i(u_i){\mathrm {d}u}_i=p_i(x_i){\mathrm {d}x}_i\). Equation (15) results from the transformation of \(c_{ij}(u_i,u_j)\). Finally, Equation (16) is obtained from \(\partial _i \log p_i(x_i) = -(x_i+\sum _{j\ne i}E[X_j|x_i])\). \(\square \)

The parameters \(\alpha _{ai}\) and \(\xi _{ai}\) are determined by the continuity of (13) at \(x_i=\xi _{ai}\) and the identity (16). However, instead of solving the simultaneous equations directly, we adopt an optimization approach.

Assume the density of a distribution \(\mu \) obeys the parametric form given by Equation (13). Then, the energy function \(\mathcal {E}(\mu )\) defined in Sect. 5 is a function of \(\alpha \) and \(\xi \), which is denoted by \(F(\alpha ,\xi )\) and is obtained as follows:

$$\begin{aligned} F(\alpha ,\xi )&= \sum _i \int p_i(x_i)\log p_i(x_i) + \frac{1}{2}\sum _i \int x_i^2 p_i(x_i){\mathrm {d}x}_i + \sum _{i<j} \int x_ix_jp_{ij}(x_i,x_j) {\mathrm {d}x}_i{\mathrm {d}x}_j \\&= \sum _i\sum _a \pi _a^i\int _{\xi _{a-1,i}}^{\xi _{ai}} \frac{\phi (x_i-\alpha _{ai})}{Z_{ai}}\left( \log \frac{\pi _a^i}{\sqrt{2\pi }} -\frac{(x_i-\alpha _{ai})^2}{2} - \log Z_{ai} +\frac{x_i^2}{2} \right) \\&\quad + \sum _{i<j}\sum _a\sum _b \frac{\pi _{ab}^{ij}}{Z_{ai}Z_{bj}}\int _{\xi _{a-1,i}}^{\xi _{ai}} \int _{\xi _{b-1,j}}^{\xi _{bj}} x_ix_j\phi (x_i-\alpha _{ai})\phi (x_j-\alpha _{bj}){\mathrm {d}x}_i{\mathrm {d}x}_j \\&= \sum _i\sum _a \pi _a^i \log \frac{\pi _a^i}{\sqrt{2\pi }} + \sum _i\sum _a \pi _a^i \left( - \frac{\alpha _{ai}^2}{2} + \alpha _{ai}M_{ai} - \log Z_{ai} \right) \\&\quad + \sum _{i<j}\sum _a\sum _b \pi _{ab}^{ij} M_{ai}M_{bj}, \end{aligned}$$


$$\begin{aligned} M_{ai}&= \frac{1}{Z_{ai}}\int _{\xi _{a-1,i}}^{\xi _{ai}} x_i \phi (x_i-\alpha _{ai}){\mathrm {d}x}_i \\&= \alpha _{ai} + \frac{1}{Z_{ai}}\left( -\phi (\xi _{ai}-\alpha _{ai}) + \phi (\xi _{a-1,i}-\alpha _{ai}) \right) . \end{aligned}$$

Since \(Z_{ai}\) and \(M_{ai}\) are functions of three parameters \(\alpha _{ai}\), \(\xi _{ai}\), and \(\xi _{a-1,i}\), we denote the corresponding partial derivative by \(D_1\), \(D_2\), and \(D_3\). The derivatives of F are

$$\begin{aligned} \frac{\partial F}{\partial \alpha _{ai}}&= \pi _a^i \left( -\alpha _{ai}+M_{ai}+\alpha _{ai}D_1M_{ai} - \frac{D_1 Z_{ai}}{Z_{ai}} \right) + \sum _{j\ne i} \sum _b \pi _{ab}^{ij} (D_1 M_{ai})M_{bj} \nonumber \\&= \pi _a^i\left( \alpha _{ai} + \sum _{j\ne i}\sum _b \frac{\pi _{ab}^{ij}}{\pi _a^i}M_{bj} \right) (D_1 M_{ai}), \end{aligned}$$
$$\begin{aligned} \frac{\partial F}{\partial \xi _{ai}}&= \pi _a^i\left( \alpha _{ai}D_2M_{ai} - \frac{D_2 Z_{ai}}{Z_{ai}} \right) + \pi _{a+1}^i\left( \alpha _{a+1,i}D_3M_{a+1,i} - \frac{D_3 Z_{a+1,i}}{Z_{a+1,i}} \right) \nonumber \\&\quad + \sum _{j\ne i} \sum _b \left\{ \pi _{ab}^{ij}(D_2 M_{ai}) +\pi _{a+1,b}^{ij}(D_3 M_{a+1,i}) \right\} M_{bj} \nonumber \\&= \pi _a^i\left( \alpha _{ai} + \sum _{j\ne i}\sum _b \frac{\pi _{ab}^{ij}}{\pi _a^i} M_{bj}\right) D_2 M_{ai} + \pi _{a+1}^i\left( \alpha _{a+1,i} + \sum _{j\ne i}\sum _b \frac{\pi _{a+1,b}^{ij}}{\pi _{a+1}^i} M_{bj}\right) D_3 M_{a+1,i} \nonumber \\&\quad - \pi _a^i\frac{D_2 Z_{ai}}{Z_{ai}} - \pi _{a+1}^i\frac{D_3 Z_{a+1,i}}{Z_{a+1,i}}. \end{aligned}$$

By using these formulas, we obtain the following theorem.

Theorem 6

A stationary point of F together with formula (13) provides the global minimum point of the energy functional \(\mathcal {E}(\mu )\) over the fiber. In other words, F has a unique stationary point that corresponds to the Stein-type density.


Since \(M_{ai}=\int x_i\phi (x_i-\alpha _{ai}){\mathrm {d}x}_i/Z_{ai}\) is the expectation parameter of an exponential family \(\phi (x_i-\alpha _{ai})/Z_{ai}\), it is an increasing function of \(\alpha _{ai}\) (e.g., [23]). Therefore, \(D_1 M_{ai}>0\). Thus, the stationary condition \(\partial F/\partial \alpha _{ai}=0\) is equivalent to

$$\begin{aligned} \alpha _{ai} + \sum _{j\ne i}\sum _b \frac{\pi _{ab}^{ij}}{\pi _a^i} M_{bj} = 0, \end{aligned}$$

which is equivalent to (16) and solves the integral equation (9) except at boundary points \(\xi _{ai}\). Furthermore, substituting this relation into (18), we obtain

$$\begin{aligned} \frac{\partial F}{\partial \xi _{ai}}&= -\pi _a^i\frac{D_2 Z_{ai}}{Z_{ai}} -\pi _{a+1}^i\frac{D_3 Z_{a+1,i}}{Z_{a+1,i}} \\&= -p_i(\xi _{ai}-) + p_i(\xi _{ai}+). \end{aligned}$$

Therefore, \(\partial F/\partial \xi _{ai}=0\) is equivalent to the continuity of \(p_i\) at \(\xi _{ai}\). Then, the density p is the Stein-type density, which is unique due to Theorem 3. \(\square \)

The minimization problem of \(F(\alpha ,\xi )\) over \(\alpha _{ai}\in \mathbb {R}\) and \(\xi _{1i}<\cdots <\xi _{n-1,i}\) is performed using a standard optimization package (e.g., the function optim in R [29]) when the coordinate \(\tau _{ai}=\xi _{ai}-\xi _{a-1,i}\), rather than \(\xi _{ai}\), is used for \(2\le a\le n-1\).

Example 4

We numerically obtain the Stein-type densities of discretized copulas. The result is shown in Fig. 1. The copula used here is the Clayton copula

$$\begin{aligned} C_{\theta }(x_1,x_2) = \left[ \max (x_1^{-\theta }+x_2^{-\theta }-1,0) \right] ^{-1/\theta }. \end{aligned}$$

The discretized copula density of \(n\times n\) cells is given by (12) with

$$\begin{aligned} \pi _{ab}^{12} = {\textstyle C_{\theta }(\frac{a}{n},\frac{b}{n})-C_{\theta }(\frac{a-1}{n},\frac{b}{n}) -C_{\theta }(\frac{a}{n},\frac{b-1}{n})+C_{\theta }(\frac{a-1}{n},\frac{b-1}{n}) }. \end{aligned}$$
Fig. 1
figure 1

Coordinate-wise transformation for the two-dimensional discretized Clayton copula with \(\theta =2\) and \(n=10\) is shown. The joint density function (top-left) is transformed into a Stein-type density function (top-right) by the inverse function of a cumulative distribution (bottom-left), the density of which is piece-wise Gaussian (bottom-right). The dashed curve in the bottom figures represents the standard normal distribution


In the present paper, we showed that a class of multi-dimensional distributions has a unique representation via the Stein-type identity. Now, we describe areas for future study and some open problems.

In Sect. 5.1, we derived some properties of Stein-type distributions. The author could not find any counter-example against the following conjecture.

Conjecture 1

The marginal density function of any Stein-type distribution is positive everywhere.

A partial answer to Conjecture 1 is given in the following lemma.

Lemma 6

Let \(\mu \) be a Stein-type distribution. If the copula of \(\mu \) has pair-wise marginal densities \(c_{ij}\) such that

$$\begin{aligned} D = \sup _{i,j:i\ne j} \sup _{u_i\in [0,1]}\int _0^1 c_{ij}(u_i,u_j)^2 {\mathrm {d}u}_j < \infty , \end{aligned}$$

then each marginal density \(p_i\) of \(\mu \) is positive everywhere. In particular, if the copula density of \(\mu \) is bounded, then the same consequence follows.


The density \(p_i(x_i)\) satisfies \(\partial _i p_i(x_i) + p_i(x_i) m_i(x_i)=0\) with \(m_i(x_i)=E[\sum _j X_j|x_i]\) by Theorem 5. The conditional expectation satisfies

$$\begin{aligned} |\mathrm{E}[X_j|x_i]| = \left| \int x_j c_{ij}(F_i(x_i),F_j(x_j)) p_j(x_j) {\mathrm {d}x}_j \right| \le (D\mathrm{E}[X_j^2])^{1/2}, \end{aligned}$$

where \(F(x_i)=\int _{-\infty }^{x_i}p_i(\xi ){\mathrm {d}\xi }\). Let \(D_*=\sum _{j\ne i}(D \mathrm{E}[X_j^2])^{1/2}\). Then, we obtain an inequality

$$\begin{aligned} -(x_i + D_*) p_i(x_i)\le \partial _i p_i(x_i) \le -(x_i-D_*)p_i(x_i) \end{aligned}$$

Let \(a\in \mathbb {R}\) be a point at which \(p_i(a)>0\). Then, Gronwall’s lemma shows that \(p_i(x_i)\ge p_i(a) e^{-(x_i+D_*)^2/2+(a+D_*)^2/2}>0\) for \(x_i>a\), and similarly \(p_i(x_i)>0\) for \(x_i<a\). \(\square \)

If Conjecture 1 is positively solved, then the following conjecture, which is based on Theorem 3, is also positive according to Lemma 4 (ii).

Conjecture 2

A Stein-type transformation is unique if it exists.

We state a relevant conjecture that is the converse of Theorem 4.

Conjecture 3

A distribution is copositive if it has a Stein-type transformation.

In Sect. 5, we showed that a Stein-type distribution is characterized by the stationary point of an energy functional \(\mathcal {E}\) over a fiber \(\mathcal {F}\). From the perspective of optimal transportation, we can construct the gradient flow of the energy functional with respect to the \(L^2\)-Wasserstein space ([19, 28] and [38]). The formal equation is as follows

$$\begin{aligned} \partial _tp_i = \partial _i(\partial _ip_i + m_i p_i), \quad t\ge 0,\quad i=1,\ldots ,d, \end{aligned}$$

where \(m_i(x_i)=\mathrm{E}[\sum _j X_j | x_i]\). Although this appears to be an independent system of one-dimensional Fokker-Planck equations, the equations interact with each other via \(m_i(x_i)\). The physical meaning of the equation is not clear. From Theorem 5, it follows that each Stein-type density is a stationary point of (19). The time evolution will be theoretically of interest.

In Appendix E, we presented sufficient conditions for copositivity of distributions. In particular, a Gaussian distribution is copositive if its covariance matrix is not degenerated. Conversely, if a Gaussian distribution is copositive, then the covariance matrix must, by definition, be strictly copositive (see Equation (1)). The following conjecture naturally arises but is not proven. This is positively solved if Conjecture 3 is correct, due to Lemma 2.

Conjecture 4

A Gaussian distribution is copositive if the covariance matrix is strictly copositive.

As stated in Sect. E.4, tail-dependent copulas do not satisfy the sufficient condition in Theorem 7. The copositivity of tail-dependent copulas remains unclear.

In the present paper, we did not consider statistical models that explain a given data set. A statistical model involving a Stein-type distribution is essentially equivalent to a copula model because such models correspond to each other through coordinate-wise transformations, whereas the marginal distributions are not of much interest in copula modelling. The class given in Example 1 provides a flexible model because the distribution of \(U_i\)’s in the construction can be selected arbitrarily.

Finally, it is expected that there is a coordinate-wise transformation to satisfy

$$\begin{aligned} \mathrm{E}[f(X_i)g(X_1+\cdots +X_d)] > 0, \quad i=1,\ldots ,d, \end{aligned}$$

for any monotone increasing functions f and g. If g is fixed, we can make a similar discussion as the present paper (see [34]).