Coordinate-wise transformation of probability distributions to achieve a Stein-type identity

It is shown that for any given multi-dimensional probability distribution with regularity conditions, there exists a unique coordinate-wise transformation such that the transformed distribution satisfies a Stein-type identity. A sufficient condition for the existence is referred to as copositivity of distributions. The proof is based on an energy minimization problem over a totally geodesic subset of the Wasserstein space. The result is considered as an alternative to Sklar’s theorem regarding copulas, and is also interpreted as a generalization of a diagonal scaling theorem. The Stein-type identity is applied to a rating problem of multivariate data. A numerical procedure for piece-wise uniform densities is provided. Some open problems are also discussed.


Introduction
One of the important concepts in information geometry is the maximum entropy principle. The unique probability distribution that maximizes entropy under fixed moments is characterized by an exponential family ( [15]). The exponential and mixture families induce a dually flat structure in the space of probability distributions (e.g. [2]).
We establish a version of the maximum entropy principle over the family of distributions generated by coordinate-wise transformations. The coordinate-wise transformations naturally arise in copula theory to make distributions have uniform marginals (e.g. [27]) and are considered as a subset of optimal transport maps.
B Tomonari Sei sei@mist.i.u-tokyo.ac.jp Before going into details, we first consider a linear analogue of the problem. Marshall and Olkin [25] proved the following diagonal scaling theorem on matrices. Theorem 1 ([25]) Let S = (S i j ) ∈ R d×d be a positive semi-definite matrix and assume that S is strictly copositive in the sense that Then, there exists a unique positive-definite diagonal matrix D such that d j=1 (DS D) i j = 1 ( 2 ) for each i ∈ {1, . . . , d}.
Note that (1) is satisfied if S is positive definite. The equation (2) is, as pointed out by [20], the stationary condition of a convex function where w = (w 1 , . . . , w d ) is the diagonal component of D.
Theorem 1 is interpreted in a probabilistic framework as follows. Let μ be a normal distribution on R d with mean zero and covariance matrix S i j = x i x j dμ. Let ν be the push-forward measure of μ by a linear transformation x → Dx in R d . Since the covariance matrix of ν is DS D, the equation (2) is rewritten as d j=1 x i x j dν = 1 ( 4 ) for each i ∈ {1, . . . , d}. In other words, each coordinate x i and the sum j x j have unit covariance under the law ν.
In the present paper, we provide a nonlinear analogue of the theorem. We admit a nonlinear monotone coordinate-wise transformation to achieve a stronger condition than (4). The condition will be referred to as the Stein-type identity (see Sect. 2 for the precise definition). Under some mild conditions on μ, it is shown that there exists such a unique transformation and it minimizes a free energy functional like (3), which has an entropy term. The space we use in the proof is the Wasserstein space, a distance space induced from optimal transportation. A key observation is that our functional is displacement convex in the sense of [26]. Refer to [30,38] for comprehensive studies of optimal transportation and its applications.
Under the Stein-type identity, the sum of variables has positive correlation with each variable. This property is applied to a rating problem of multivariate data in Sect. 3.
As is well known, Sklar's theorem (see, e.g., [27]) states that any multi-dimensional distribution is transformed by the probability integral transformation into a distribution with uniform marginals. The transformed distribution is called a copula. A linear analogue of Sklar's theorem is that for any covariance matrix S there exists a unique positive-definite diagonal matrix D such that every diagonal element of DS D is unity. This is nothing but the correlation matrix corresponding to S.
There are some papers relevant to our study. A relation between copula and diagonal scaling is investigated in [4] from different perspectives. Their scaling operation does not correspond to transformation of random variables. Optimal transportation between two distributions sharing the same copula is considered in [1], where the various cost functions are the center of discussion. Optimal transportation is used to determine multi-dimensional quantiles in [7] and [13]. Although our motivation is also to define a kind of quantile functions of multivariate data, the construction is different from theirs (see Sect. 3). A particular class of optimal transport maps called moment maps has a deep connection to another Stein-type identity as investigated in [10].
The remainder of the present paper is organized as follows. In Sect. 2, we define the Stein-type distributions and transformations. In Sect. 3, we briefly explain its application to a rating problem of multivariate data. In Sect. 4, we describe the existence and uniqueness theorem as well as a variational characterization theorem. In Sect. 5, we prove the main results using the theory of optimal transportation. In Sect. 6, a numerical method to find the transformation for piecewise uniform distributions is proposed. Finally, we discuss open problems in Sect. 7.

Definition of Stein-type distributions and transformations
We define a class of distributions that satisfy a stronger condition than (4). Let P 2 = P 2 (R d ) be the set of probability distributions μ on R d with mean zero and finite second moments such that each marginal distribution μ i of μ is absolutely continuous with respect to the Lebesgue measure on R. Note that μ itself is not assumed to be absolutely continuous. The mean-zero condition is imposed only for simplicity. We say that a function f : R → R is absolutely continuous if there exists a locally integrable function f such that for any absolutely continuous function f : R → R with bounded derivative f .
Note that the equation (4) is a special case of (5) with f (x i ) = x i . We refer to the equation (5) as the Stein-type identity. Indeed, if d = 1, it reduces to the Stein identity f (x 1 )x 1 dμ = f (x 1 )dμ, which implies that μ is the standard normal distribution (see [35] and [6]). The Stein identity is used to evaluate distance between a given distribution and the normal distribution. Although the Stein-type identity we defined is a generalization of the Stein identity, the author does not aware of its applications to the distance evaluation. Instead, we develop a different application. More specifically, if a random vector (X 1 , . . . , X d ) has a Stein-type distribution, then the sum j X j is positively correlated with any monotone transformation of X i due to (5). This property is applied to a rating problem in Sect. 3.
If μ is completely independent in the sense that μ is the direct product of its marginals μ i , then the Stein-type distribution has to be the d-dimensional standard normal distribution. Hereafter, we focus on dependent cases.
For Gaussian random variables, we obtain the following lemma. We denote the expectation by E. Lemma 1 (Theorem 5 of [33]) Let μ denote the d-dimensional normal distribution with mean zero and covariance matrix S = (S i j ). Then, μ is Stein-type if and only if where the last equality follows from the Stein identity for one-dimensional case. Hence (5) holds if and only if j S i j = 1.
The following example gives a rich class of Stein-type distributions.
Example 1 Let W be a random variable with the standard normal distribution and let U be any random variable independent of W such that E[U ] = 0 and E[U 2 ] < ∞.
The condition E[U ] = 0 is assumed to make the following distribution belong to P 2 and not essential here. Consider two variables Then the distribution of (X 1 , X 2 ) is Stein-type. Indeed, we obtain for any f by the one-dimensional Stein identity with respect to W conditional on U . The variable W = (X 1 + X 2 )/ √ 2 has a meaning of "an overall score" of the two variables X 1 and X 2 . The identity implies that W is positively correlated with any increasing function f (X i ) of X i . This example is generalized to the case d ≥ 3. Define a random vector (X 1 , . . . , X d ) by i ] < ∞. As in the two-dimensional case, W has positive correlation with any increasing function f (X i ) of X i . In Sect. 3, we will show by an example that the positive-correlation property does not hold in general for copulas.
The example does not cover the entire class of Stein-type distributions. Other examples are given in Sect. 6 and Appendix B.
For each μ ∈ P 2 , let T cw (μ) be the set of coordinate-wise transformations ∞} is non-decreasing and T μ belongs to P 2 .
Here, T μ is the push-forward measure defined by (T μ)(A) = μ(T −1 (A)) for any measurable set A. The set T cw (μ) depends only on the marginal distributions of μ.
We consider a problem to find a map T ∈ T cw (μ) such that T μ is Stein-type.
For example, if μ is the product measure of one-dimensional continuous distributions μ i , then the map T defined by ) is the Stein-type transformation of μ, where Φ is the cumulative distribution function of the standard normal distribution. The map is nothing but the Brenier map between the two product measures.
The following lemma is immediate.

Lemma 2
Let μ be the normal distribution with a covariance matrix S = (S i j ). Then, μ has a Stein-type transformation if S is strictly copositive in the sense of (1).
Proof Let D be the diagonal matrix with entries w 1 , . . . , w d satisfying (4). Set T (x) = Dx. Then, T μ is Stein-type due to Lemma 1.

Application to a rating problem
In this section, we briefly describe an application of Stein-type transformations. We first explain a linear rating method of multivariate data according to [33]. Let S be a covariance matrix of an R d -valued random vector x = (x i ). Suppose that each variable x i has a meaning of "score". For example, x 1 , x 2 and x 3 are student scores of mathematics, physics and history, and so forth. We want to determine positive weights w i such that j w j x j reflects the scores x 1 , . . . , x d . A candidate of such a weight w i is the i-th diagonal element of D in (2). Indeed, under (2), the variable x i and the overall score j w j x j have positive covariance for each i. This property is not attained in general for other methods of weighting. The quantity j w j x j is called the objective general index (OGI) in [33], which reflects the d scores in this sense. In a similar manner, we can define a nonlinear version of the objective general index via Stein-type transformations. Let μ be a probability distribution on R d and x be a random vector distributed according to μ. Again each coordinate x i is assumed to have a meaning of score. Then the nonlinear general index is defined by ≥ 0 for any increasing function f (x i ) of x i due to the Stein-type identity (5). In particular, by taking the step function f ( for each i ∈ {1, . . . , d} and a ∈ R. The property means that the conditional average of the overall score of students taking higher score on x i is larger than that of students taking lower score. In this sense, g(x) reflects the d scores.
General indices g(x) satisfying the inequality (6) are not unique. Indeed, for two-dimensional distributions, the probability integral transformation T i (x i ) = μ i ((−∞, x i ]) provides (6); see Appendix D for a proof. However, for higher dimensional distributions, it is not trivial to find such a transformation. The Stein-type transformation solves the problem. We confirm this point by an example.

Example 2
Suppose that x 1 , x 2 and x 3 are random variables that have a probability distribution μ on [−1, 1] 3 with the density function The three marginal distributions are uniform over [−1, 1], i.e., μ is a copula. We can see Hence the property in (6) does not hold if we adopt g(x) = i x i . As will be demonstrated in Sect. 6, the unique Stein-type transformation of μ is with c 1 = 1.2490 and c 2 = c 3 = 0.3445, where Φ denotes the standard normal distribution function and c 1 , c 2 are numerically obtained. The nonlinear objective general index g(x) = i T i (x i ) satisfies the relation (6) as a result.

Main results
For given μ ∈ P 2 , denote the set of coordinate-wise transformed distributions of μ by We refer to F μ as a fiber. The following lemma is a direct consequence of the onedimensional optimal transportation. See Appendix A.
Lemma 3 For given μ ∈ P 2 and ν ∈ F μ , the map T ∈ T cw (μ) satisfying ν = T μ is uniquely determined μ-almost everywhere. Furthermore, the relation ν ∈ F μ between two measures μ and ν is an equivalence relation. In particular, P 2 is partitioned into mutually disjoint fibers.
Now, we state our three main theorems. All proofs are presented in Sect. 5. The first main theorem characterizes Stein-type distributions in terms of the variational principle. Define an energy functional E(μ) of μ by where p i = dμ i /dx i is the marginal density function. The first term of E(μ) represents the negative entropy of the marginal distributions, and the second term is the variance of the diagonal part j x j . The functional has displacement convexity over each fiber in the sense of [26]. See Sect. 5 for details. If we replace the entropy term with the joint entropy, the functional becomes displacement convex over the whole space as proved by McCann [26].
The second main theorem is on the uniqueness of Stein-type transformations. A distribution μ on R d is said to have a regular support if the support of μ is equal to the direct product of the supports of the marginal distributions μ i . This property is invariant under coordinate-wise transformations. Note that the regular support condition does not imply absolute continuity of μ with respect to d i=1 μ i .

Theorem 3 (Uniqueness)
Assume that μ ∈ P 2 has a regular support. Then, a Steintype transformation of μ is unique if it exists.
We conjecture that the uniqueness follows without the regular support condition. See Sect. 7 for more details.
The third main theorem is on existence. A measure μ ∈ P 2 is said to be copositive if for any T ∈ T cw (μ), and therefore β(μ) = 1. It is not difficult to see that β(μ) ≤ 1 for any μ. If μ is associated in the sense of [9,11], and [21], then T i T j dμ ≥ 0 for each pair of i and j, and therefore β(μ) ≥ 1. On the other hand, if d = 2 and Sufficient conditions for copositivity are presented in Appendix E.
We now present a few remarks before proceeding to the following section.
The uniqueness and existence results in Theorem 3 and Theorem 4 are consequences of the variational characterization in Theorem 2, as will be shown in Sect. 5. For d = 1, the functional E(μ) is the Kullback-Leibler divergence from μ to the standard normal density up to a constant term. For d ≥ 2, however, E is not even bounded from below. Indeed, for each t > 0, let μ t be the multivariate normal distribution with mean zero and covariance matrix t = P + t(I − P), where I is the identity matrix and Then, each marginal distribution of μ t is normal with variance , which tends to −∞ as t → ∞. Therefore, it is not trivial if there is a minimizer of E over the fiber. Nevertheless, the existence and uniqueness theorems are obtained.
If μ has the joint density function p(x) with respect to the Lebesgue measure, then the negative joint entropy is defined by In most cases, we can replace the marginal entropy term in E(μ) with the joint entropy because the difference , which is referred to as the multi-information function or the measure of multivariate dependence, is invariant in each fiber (e.g., [16] and [36]). However, in some pathological cases, the difference diverges. Therefore, it is more appropriate to adopt the marginal entropy.
According to Sklar's theorem (e.g. [27]), any d-dimensional distribution μ is transformed by the probability integral transformation T i (x i ) = x i −∞ dμ i into the distribution T μ with uniform marginals unless some μ i has an atom. The resultant distribution T μ is a copula. The Stein-type distribution we defined is considered as an alternative representation of the copula. Copulas are also characterized by an energy minimization problem. Here, the potential term in (7) In parallel, we have to remove the condition x i dμ i = 0 from the definition of P 2 . Maximum entropy copulas under a given diagonal section are discussed in [5], where, in contrast to the present paper, the marginals are fixed to be uniform.

Proofs based on the theory of optimal transportation
In this section, we prove the three main theorems stated in Sect. 4. The proof is based on the theory of optimal transportation. Necessary facts about one-dimensional optimal transportation are summarized in Appendix A.

Regularity of Stein-type distributions
The Stein-type identity forces regularity of marginal density functions. We first characterize this by an integral equation.
where m i (x i ) denotes the conditional expectation of d j=1 x j given x i with respect to μ.
Conversely, assume (9). The right-hand side of (9) converges to zero as a → ±∞ because x j dμ j = 0 for all j. Then, for any bounded and absolutely continuous function f with bounded derivative f , we obtain the Stein-type identity for f : where the second equality follows from the integral-by-parts formula.
As a corollary, the regularity of the marginal density functions is established. Proof From the formula (9), it is obvious that p i is absolutely continuous and bounded by i |x i |dμ i < ∞. We also have p i (x i ) → 0 as x i → ±∞ because the right-hand side of (9) vanishes as a → ±∞.
Although the marginal density function of any Stein-type distribution is absolutely continuous, it can have non-differentiable points as shown in an example in Sect. 6. The continuous differentiability of p i (x i ) follows from the regularity of the pair-wise copula of μ along formula (9). We do not pursue this line of investigation here. On the other hand, we conjecture that the marginal density of any Stein-type distribution is positive everywhere. See Sect. 7 for more details.
The following corollary will be used later.

Corollary 2 Let μ be Stein-type. Then, its negative marginal entropy
Other properties of Stein-type distributions are given in Appendix C.

Variational problem over a fiber of Wasserstein space
Let F be a fiber of P 2 (see Sect. 4 for the definition) and choose two measures μ and ν in F, where ν is written as ν = T μ with some T ∈ T cw (μ) by definition. Define the geodesic, which is also referred to as the displacement interpolation [26], from μ to ν by where Id denotes the identity map. Based on the one-dimensional optimal transportation, it follows that Refer to [26] and [3] for further details on displacement convexity. Although a geodesic between any pair of distributions in P 2 is similarly defined, we need only geodesics in a common fiber. It is known that a geodesic actually attains the minimum length of a path between two measures with respect to the L 2 -Wasserstein distance (see e.g. [3] and [38]). Here the L 2 -Wasserstein distance is the infimum of x − y 2 dγ (x, y) 1/2 over the set of joint distributions γ on R 2d with the marginals μ and ν. Each fiber F is totally geodesic in the sense of [37].
Recall that μ is said to have a regular support if its support is the direct product of the supports of marginal distributions.

Lemma 4 Let F be a fiber and choose any two distributions μ and ν in
is strictly convex if one of the following conditions is satisfied: (i) μ (and therefore ν) has a regular support, or (ii) the supports of μ i and ν i are connected, respectively, for each i.
Proof Let ν = T μ, with T ∈ T cw (μ). Let p i = dμ i /dx i be the marginal density of μ. By the change-of-variable formula (Lemma 7 in Appendix A), we obtain where Then, by the regular support condition, we have should be strictly convex under (i).
Next, assume (ii). Then, T i has no discontinuous points. Assume E([μ, ν] t ) is not strictly convex. Then, if follows from (10) that T i (x i ) = 1 and, therefore, T i (x i ) = x i by the connectedness of the support together with the condition T i dμ i = 0. However, this contradicts μ = ν. Thus, E([μ, ν] t ) is strictly convex.

Example 3
The strict convexity of E([μ, ν] t ) can fail if neither condition (i) nor condition (ii) in Lemma 4 is satisfied. For example, let d = 2 and assume that μ is uniformly distributed over the region

Proof of Theorem 2
Let μ be a Stein-type distribution. Corollary 2 implies that μ belongs to domE. From the convexity (Lemma 4), it is sufficient to show that ≥ 0 for any ν = T μ ∈ F, where d/dt + denotes the right derivative. It follows from formula (10) that If T i is absolutely continuous, the right-hand side vanishes by the Stein-type identity, where the boundedness of the derivatives T i can be assumed by a standard approximation argument, as in the proof of Theorem 5. If T i is not absolutely continuous, T i can be decomposed into an absolutely continuous part and a discontinuous part as The contribution of T ac i in (11) vanishes due to the Stein-type identity. It is sufficient to prove that More specifically, a step function I [ξ,∞) (x i ) at each ξ ∈ R is approximated by a logistic function 1/(1 + exp(−n(x i − ξ ))). Then, by Lebesgue's dominated convergence theorem and the Stein-type identity, we obtain Conversely, assume that E(T μ) is minimized at T = Id. Fix 1 ≤ i ≤ d, and let f be an absolutely continuous function with bounded derivative f . For sufficiently small ε > 0, two maps T (x) = x ± ε f (x i )e i , where e i is the i-th unit vector, belong to T cw (μ). Thus, the right derivative (11) has to be zero, and μ satisfies the Stein-type identity.

Proof of Theorem 3
Assume that μ has a regular support and admits a Stein-type transformation T . Then, Theorem 2 implies that T μ minimizes E over the fiber F μ . However, it is deduced from Lemma 4 that E is strictly convex over F μ . Thus, the minimizer is unique.

Proof of Theorem 4
Assume that μ is copositive. Denote the functional E restricted to the fiber F μ by E μ . From Theorem 2, it is sufficient to show that E μ has a minimum point. We first show that E μ is bounded from below and that the level set {ν | E μ (ν) ≤ c} for each c ∈ R is tight. For any ν ∈ F μ , the copositivity condition implies where q i = dν i /dx i and β = β(ν) = β(μ) > 0. We obtain where the last inequality follows from the nonnegativity of the Kullback-Leibler divergence. Then, E μ is bounded from below as where C is a constant independent of ν. This inequality also implies that the level set {ν | E μ (ν) ≤ c} is tight. Now there exists a weakly converging sequence ν k such that E μ (ν k ) converges to inf E μ (ν). Let ν * be the weak limit. Then, Corollary 3.5 of [26] shows that ν * ∈ P 2 and E μ (ν * ) ≤ lim k E μ (ν k ). The distribution ν * gives a minimum point of E μ . This completes the proof.

Piecewise uniform densities
In this section, it is shown that if μ has piecewise uniform density function, then the Stein-type transformation of μ is obtained by finite-dimensional optimization. Here, we do not impose the zero mean condition x i dμ = 0 on μ. We can always translate it to have zero mean if necessary.
We say that a probability density function c(u) on [0, 1] d is piecewise uniform if its two-dimensional marginal densities c i j (1 ≤ i < j ≤ d) are written as for some n, where π i j ab is a positive number such that n a=1 n b=1 π i j ab = 1.
Let π i a = n b=1 π i j ab . Although c is not a copula density unless π i a = 1/n for all i and a, it is transformed by a piecewise linear transform into a copula density. Then Corollary 3 in Appendix E, together with Theorem 4, guarantees the existence of a Stein-type transformation as long as the support of c(u) is [0, 1] d .
By solving Equation (9), we obtain an expression of the Stein-type transformation of c as follows. Denote the cumulative distribution function and density function of the standard normal distribution by Φ and φ, respectively. (12) and its support is [0, 1] d . Let p be the unique Stein-type density transformed from c. Then, there exist real constants α 1i , . . . , α ni and ξ 1i < · · · < ξ n−1,i such that

Lemma 5 Suppose that c(u) satisfies
where ξ 0i = −∞, ξ ni = ∞, and Z ai = Φ(ξ ai −α ai )−Φ(ξ a−1,i −α ai ). The Stein-type transformation is and the two-dimensional marginal density is Furthermore, the following identity is satisfied: Proof Equation (9) Since the conditional expectation E[X j |x i ] has to be piecewise constant, p i (x i ) is piecewise Gaussian up to a normalizing constant. Since the mass of each piece is preserved under a coordinate-wise transformation, we obtain the form (13). Then, the unique monotone transformation (14) is derived from Equation (15) results from the transformation of c i j (u i , u j ). Finally, Equation (16) is The parameters α ai and ξ ai are determined by the continuity of (13) at x i = ξ ai and the identity (16). However, instead of solving the simultaneous equations directly, we adopt an optimization approach.
Assume the density of a distribution μ obeys the parametric form given by Equation (13). Then, the energy function E(μ) defined in Sect. 5 is a function of α and ξ , which is denoted by F(α, ξ ) and is obtained as follows: Since Z ai and M ai are functions of three parameters α ai , ξ ai , and ξ a−1,i , we denote the corresponding partial derivative by D 1 , D 2 , and D 3 . The derivatives of F are By using these formulas, we obtain the following theorem.

Theorem 6 A stationary point of F together with formula (13) provides the global minimum point of the energy functional E(μ) over the fiber. In other words, F has a unique stationary point that corresponds to the Stein-type density.
Proof Since M ai = x i φ(x i − α ai )dx i /Z ai is the expectation parameter of an exponential family φ(x i −α ai )/Z ai , it is an increasing function of α ai (e.g., [23]). Therefore, D 1 M ai > 0. Thus, the stationary condition ∂ F/∂α ai = 0 is equivalent to which is equivalent to (16) and solves the integral equation (9) except at boundary points ξ ai . Furthermore, substituting this relation into (18), we obtain Therefore, ∂ F/∂ξ ai = 0 is equivalent to the continuity of p i at ξ ai . Then, the density p is the Stein-type density, which is unique due to Theorem 3.
The minimization problem of F(α, ξ ) over α ai ∈ R and ξ 1i < · · · < ξ n−1,i is performed using a standard optimization package (e.g., the function optim in R [29]) when the coordinate τ ai = ξ ai − ξ a−1,i , rather than ξ ai , is used for 2 ≤ a ≤ n − 1.  4 We numerically obtain the Stein-type densities of discretized copulas. The result is shown in Fig. 1. The copula used here is the Clayton copula The discretized copula density of n × n cells is given by (12) with

Discussion
In the present paper, we showed that a class of multi-dimensional distributions has a unique representation via the Stein-type identity. Now, we describe areas for future study and some open problems. In Sect. 5.1, we derived some properties of Stein-type distributions. The author could not find any counter-example against the following conjecture.

Conjecture 1
The marginal density function of any Stein-type distribution is positive everywhere.
A partial answer to Conjecture 1 is given in the following lemma.

Lemma 6 Let μ be a Stein-type distribution. If the copula of μ has pair-wise marginal densities c i j such that
then each marginal density p i of μ is positive everywhere. In particular, if the copula density of μ is bounded, then the same consequence follows.

. Then, we obtain an inequality
Let a ∈ R be a point at which p i (a) > 0. Then, Gronwall's lemma shows that If Conjecture 1 is positively solved, then the following conjecture, which is based on Theorem 3, is also positive according to Lemma 4 (ii).

Conjecture 2 A Stein-type transformation is unique if it exists.
We state a relevant conjecture that is the converse of Theorem 4.

Conjecture 3 A distribution is copositive if it has a Stein-type transformation.
In Sect. 5, we showed that a Stein-type distribution is characterized by the stationary point of an energy functional E over a fiber F. From the perspective of optimal transportation, we can construct the gradient flow of the energy functional with respect to the L 2 -Wasserstein space ( [19,28] and [38]). The formal equation is as follows where . Although this appears to be an independent system of one-dimensional Fokker-Planck equations, the equations interact with each other via m i (x i ). The physical meaning of the equation is not clear. From Theorem 5, it follows that each Stein-type density is a stationary point of (19). The time evolution will be theoretically of interest.
In Appendix E, we presented sufficient conditions for copositivity of distributions. In particular, a Gaussian distribution is copositive if its covariance matrix is not degenerated. Conversely, if a Gaussian distribution is copositive, then the covariance matrix must, by definition, be strictly copositive (see Equation (1)). The following conjecture naturally arises but is not proven. This is positively solved if Conjecture 3 is correct, due to Lemma 2.

Conjecture 4
A Gaussian distribution is copositive if the covariance matrix is strictly copositive.
As stated in Sect. E.4, tail-dependent copulas do not satisfy the sufficient condition in Theorem 7. The copositivity of tail-dependent copulas remains unclear.
In the present paper, we did not consider statistical models that explain a given data set. A statistical model involving a Stein-type distribution is essentially equivalent to a copula model because such models correspond to each other through coordinatewise transformations, whereas the marginal distributions are not of much interest in copula modelling. The class given in Example 1 provides a flexible model because the distribution of U i 's in the construction can be selected arbitrarily.
Finally, it is expected that there is a coordinate-wise transformation to satisfy for any monotone increasing functions f and g. If g is fixed, we can make a similar discussion as the present paper (see [34]).

A One-dimensional optimal transportation
Necessary information about one-dimensional optimal transportation is summarized. Refer to [30] and [38] for further details. Let P 2 (R) be the set of absolutely continuous probability distributions μ on R such that xdμ = 0 and x 2 dμ < ∞. For given μ ∈ P 2 (R), let T (μ) be the set of non-decreasing functions T : R → R ∪ {−∞, ∞} such that T μ ∈ P 2 (R).
For given μ and ν in P 2 (R), there exists T ∈ T (μ) such that ν = T μ. The map is uniquely determined μ-almost everywhere. More explicitly, T is given by The map T is called the optimal transportation from μ to ν because this map minimizes the functional (T (x) − x) 2 dμ over {T | T μ = ν}. Since μ and ν are absolutely continuous, T is decomposed into an absolutely continuous part, T ac , and a discontinuous part, T d , without a singular continuous part. This is because G − constructed above has the same property. The decomposition is unique up to a μ-negligible set.
The following lemmas are used in Sect. 5 and Sect. E. These lemmas were originally proven for multi-dimensional measures but here we simplify them for the one-dimensional case.

Lemma 7 (Theorem 4.4 of [26])
For given μ and ν in P 2 (R), let T be a unique monotone map such that ν = T μ. Let p and q be density functions of μ and ν, respectively. Let X ⊂ R denote the set of points where the derivative T is defined and positive. Then, μ(X ) = 1. Furthermore,

Lemma 9 (Proposition 4.2 of [26])
Let μ ∈ P 2 . If T : R → R is a non-decreasing function written as T = T ac +T d and the derivative (T ac ) of the absolutely continuous part is strictly positive μ-almost everywhere, then T μ is absolutely continuous.

B Explicit expression of Stein-type distributions
We formally derive an explicit expression of the Stein-type distributions. Assume that μ ∈ P 2 has a smooth density function p with decay at infinity. Then, μ is Stein-type if and only if there exists a function r (x) such that where dx −i means dx 1 · · · dx i−1 dx i+1 · · · dx d . In fact, formula (20) is rewritten as is the conditional expectation of j x j given x i , and this equation is equivalent to (9). Equation (20) is explicitly solved if r (x) is given. Let Q be a fixed orthogonal matrix such that (Q x) 1 where Q denotes the matrix transpose of Q. Then, (20) is written as The general solution is where q is any probability density function on R d−1 .
In particular, if r (x) = 0, we obtain a simple formula Example 1 in Sect. 4 is this solution. The class of densities (21) is characterized by a stronger condition than the Stein-type identity, i.e.,

C Properties of Stein-type distributions
We provide some properties of Stein-type distributions that are not used in the main stream of the paper. We first point out that Stein-type distributions have finite Fisher information. The Fisher information of a density function q on R is defined by where q is assumed to be absolutely continuous, and q (x)/q(x) is set to 0 if q is not differentiable or not positive at x. See [18] for properties implied by finite Fisher information. Note that the Fisher information we defined is that of location family {q(x − θ) | θ ∈ R} in statistics (e.g., [23]).

Lemma 10 For any Stein-type distribution μ, the Fisher information I ( p i ) of each marginal density p i is bounded by the dimension d. In particular, p i has bounded variation.
Proof From (9), the score function is the conditional expectation of j x j given x i , we obtain where the last equality follows from the Stein-type identity with f (x i ) = x i . By the Cauchy-Schwarz inequality, we also have | p i (x i )|dx i ≤ √ I ( p i ). Then, p i has bounded variation.
Let S be the set of Stein-type distributions on R d . We prove that S is closed under mixture, normalized convolution, and weak limit. Proof The Stein-type identity with respect to X implies that for each i, because X and Y are independent. By changing the roles of X and Y , we have Thus, the Stein-type identity for a X + bY holds if and only if a 2 + b 2 = 1.
The set S is also closed under weak limit in the following sense. Denote the Euclidean norm on R d by x for x ∈ R d . Lemma 13 (Weak convergence) Let μ (n) be a sequence in S. If μ (n) converges to μ in law and x 2 dμ (n) converges to x 2 dμ < ∞, then μ belongs to S.
Proof These conditions imply that ϕdμ (n) → ϕdμ for any continuous function ϕ such that |ϕ(x)| ≤ C(1 + x 2 ) for some C > 0. (Refer to Theorem 7.12 of [38].) Letting ϕ(x) be f (x i ) j x j and f (x i ), respectively, we obtain the Stein-type identity for μ. Absolute continuity of μ i is shown in the same manner as in the proof of Theorem 5.
The condition regarding moment convergence in Lemma 13 is necessary. Indeed, we can construct a sequence (W , U (n) ) of Stein-type random variables in the same manner as in Example 1 of Sect. 4 such that U (n) converges in law to a random variable U with E[U 2 ] = ∞.
By Lemma 12 and Lemma 13 together with the central limit theorem, if we have independent and identically distributed samples X 1 , . . . , X n according to a Stein-type distribution μ, then the limit distribution of (X 1 +· · ·+ X n )/ √ n is a Stein-type normal distribution that is characterized by Lemma 1.
Note that the set of copulas satisfies the same consequence as Lemma 11 and Lemma 13. If we modify the definition of the copulas in such a way that the marginal distribution is standard normal, then the same consequence as Lemma 12 also follows.

D A property of two-dimensional copulas
We prove that the simple sum g(x) = x 1 + x 2 satisfies the inequality (6) for any two-dimensional continuous copula density function c(x 1 , x 2 ).

Proof
This proves the lemma.
From the lemma, we have

E Sufficient conditions for copositivity
We present the sufficient conditions for copositivity (8) of a given distribution μ. In Sect. E.1, we first take into account the measures with a non-zero mean as well as coordinate-wise transformations that are constant over an interval. We then present a lower bound of the quantity β(μ) in (8). Subsequent subsections are devoted to finding sufficient conditions for copositivity. Refer to [34] for other sufficient conditions based on positive super-modular dependence.

E.1 Extension of the definition and a lower bound
Let P 2 * be the set of measures on R d such that each marginal μ i is absolutely continuous and x 2 i dμ i < ∞ without assuming x i dμ i = 0. The set T cw * (μ) for μ ∈ P 2 * is defined by the set of coordinate-wise non-decreasing map T : The following lemma is useful to consider copositivity. Denote the inner product and norm of L 2 (μ) by f , g = f (x)g(x)dμ and f = f , f 1/2 , respectively.

Lemma 15
If μ ∈ P 2 , then Proof Denote the right-hand side of (22) by δ(μ). Then, it is obvious that β(μ) ≥ δ(μ) since T cw (μ) ⊂ T cw * (μ). In order to prove the converse inequality, choose 0 = T ∈ T cw * (μ) such that i T i 2 /( i T i 2 ) ≤ δ(μ) + ε for given ε. It follows from Lemma 9 in Appendix A that a map T η defined by T η (x) = T (x) + ηx belongs to T cw (μ) for each η > 0. Then, we have We extend the definition of β(μ) for any μ ∈ P 2 * by (22). In this section, μ is a measure in P 2 * unless otherwise stated. Let L 2 0 (μ i ) be the set of functions T i : R → R such that T i dμ i = 0 and By relaxing the set T cw * (μ) in (22), we obtain a lower bound of β(μ) as Therefore, μ is copositive if β L (μ) > 0. It is shown that β(μ) and β L (μ) are invariant under coordinate-wise transformations. Thus, β(μ) and β L (μ) depend only on the copula of μ. Furthermore, they depend only on the set of two-dimensional marginal copulas of μ.
We conjecture that β(μ) coincides with (1) if μ is Gaussian and S is its covariance matrix. See Sect. 7.

E.3 Rényi's condition of positive copula densities
The following theorem, which has been proven by [31] for d = 2, provides a checkable condition for copositivity. Theorem 7 ([31] for d = 2) Assume that μ has a regular support (see Sect. 4 for the definition) and for each pair i = j, the two-dimensional marginal copula density function c i j of μ is square integrable. Then, β L (μ) > 0. In particular, μ is copositive.
Proof We first prove that if T ∈ d i=1 L 2 0 (μ i ) satisfies an equation i T i = 0, then T = 0. Assume i T i = 0. Let I ⊂ {1, . . . , d} be the set of indices i such that μ(T i = 0) > 0. Next, by contradiction, assume I is not empty. Let A i = {x i | T i > 0} for i ∈ I . Since T i dμ i = 0, we have μ i (A i ) > 0. However, based on the assumption about the support, we obtain μ(∩ i∈I A i ) > 0, which implies that μ( i T i > 0) > 0 and contradiction. Thus, I is empty, and T = 0. Now, we prove that β L (μ) > 0 using elementary cocepts of functional analysis (refer to [39]). Assume that μ i is uniform over [0, 1], i.e., μ is a copula distribution. Based on the assumption that c 2 i j dx i dx j < ∞, we deduce that C is a Hilbert-Schmidt operator. It is easy to see that C is self-adjoint. Now, we can write where I is the identity operator. If (I + C)T = 0, then (23) implies i T i = 0 and, therefore, T = 0. Thus, I + C is injective. Since the operator I + C is an injective Fredholm operator, it is surjective. By the continuous inverse theorem, we deduce that the inverse operator (I + C) −1 is bounded. Therefore, we have T , (I + C)T H ≥ 1 (I + C) −1 T , T H , which means β L (μ) ≥ (I + C) −1 −1 > 0.

Corollary 3
If μ has a positive and bounded copula density function, then μ is copositive.
In Sect. 6, we deal with positive and piecewise uniform copula density functions. Note that the support of μ is not determined from the support of two-dimensional marginal distributions. See the following example. Refer to [32] for related topics.

E.4 Tail dependence
Many useful copulas in application exhibit tail dependence (e.g. [14,17,27]). The following lemma shows that, unfortunately, Theorem 7 is not helpful for this class of copulas.

Lemma 18
Let d = 2 and assume that the copula density c(u 1 , u 2 ) has lower-tail dependency λ = lim If c is square-integrable, then the left-hand side should converge to 0 as δ → 0, which is impossible. Thus, c is not square-integrable.
We conjecture that many copulas with tail dependence are copositive. On the other hand, there is a non-copositive measure with positive copula density, as follows.