Abstract
It is shown that for any given multidimensional probability distribution with regularity conditions, there exists a unique coordinatewise transformation such that the transformed distribution satisfies a Steintype identity. A sufficient condition for the existence is referred to as copositivity of distributions. The proof is based on an energy minimization problem over a totally geodesic subset of the Wasserstein space. The result is considered as an alternative to Sklar’s theorem regarding copulas, and is also interpreted as a generalization of a diagonal scaling theorem. The Steintype identity is applied to a rating problem of multivariate data. A numerical procedure for piecewise uniform densities is provided. Some open problems are also discussed.
Introduction
One of the important concepts in information geometry is the maximum entropy principle. The unique probability distribution that maximizes entropy under fixed moments is characterized by an exponential family ([15]). The exponential and mixture families induce a dually flat structure in the space of probability distributions (e.g. [2]).
We establish a version of the maximum entropy principle over the family of distributions generated by coordinatewise transformations. The coordinatewise transformations naturally arise in copula theory to make distributions have uniform marginals (e.g. [27]) and are considered as a subset of optimal transport maps.
Before going into details, we first consider a linear analogue of the problem. Marshall and Olkin [25] proved the following diagonal scaling theorem on matrices.
Theorem 1
([25]) Let \(S=(S_{ij})\in \mathbb {R}^{d\times d}\) be a positive semidefinite matrix and assume that S is strictly copositive in the sense that
Then, there exists a unique positivedefinite diagonal matrix D such that
for each \(i\in \{1,\dots ,d\}\).
Note that (1) is satisfied if S is positive definite. The equation (2) is, as pointed out by [20], the stationary condition of a convex function
where \(w=(w_1,\ldots ,w_d)\) is the diagonal component of D.
Theorem 1 is interpreted in a probabilistic framework as follows. Let \(\mu \) be a normal distribution on \(\mathbb {R}^d\) with mean zero and covariance matrix \(S_{ij}=\int x_ix_j{\mathrm {d}\mu }\). Let \(\nu \) be the pushforward measure of \(\mu \) by a linear transformation \(x\mapsto Dx\) in \(\mathbb {R}^d\). Since the covariance matrix of \(\nu \) is DSD, the equation (2) is rewritten as
for each \(i\in \{1,\ldots ,d\}\). In other words, each coordinate \(x_i\) and the sum \(\sum _j x_j\) have unit covariance under the law \(\nu \).
In the present paper, we provide a nonlinear analogue of the theorem. We admit a nonlinear monotone coordinatewise transformation to achieve a stronger condition than (4). The condition will be referred to as the Steintype identity (see Sect. 2 for the precise definition). Under some mild conditions on \(\mu \), it is shown that there exists such a unique transformation and it minimizes a free energy functional like (3), which has an entropy term. The space we use in the proof is the Wasserstein space, a distance space induced from optimal transportation. A key observation is that our functional is displacement convex in the sense of [26]. Refer to [30, 38] for comprehensive studies of optimal transportation and its applications.
Under the Steintype identity, the sum of variables has positive correlation with each variable. This property is applied to a rating problem of multivariate data in Sect. 3.
As is well known, Sklar’s theorem (see, e.g., [27]) states that any multidimensional distribution is transformed by the probability integral transformation into a distribution with uniform marginals. The transformed distribution is called a copula. A linear analogue of Sklar’s theorem is that for any covariance matrix S there exists a unique positivedefinite diagonal matrix D such that every diagonal element of DSD is unity. This is nothing but the correlation matrix corresponding to S.
There are some papers relevant to our study. A relation between copula and diagonal scaling is investigated in [4] from different perspectives. Their scaling operation does not correspond to transformation of random variables. Optimal transportation between two distributions sharing the same copula is considered in [1], where the various cost functions are the center of discussion. Optimal transportation is used to determine multidimensional quantiles in [7] and [13]. Although our motivation is also to define a kind of quantile functions of multivariate data, the construction is different from theirs (see Sect. 3). A particular class of optimal transport maps called moment maps has a deep connection to another Steintype identity as investigated in [10].
The remainder of the present paper is organized as follows. In Sect. 2, we define the Steintype distributions and transformations. In Sect. 3, we briefly explain its application to a rating problem of multivariate data. In Sect. 4, we describe the existence and uniqueness theorem as well as a variational characterization theorem. In Sect. 5, we prove the main results using the theory of optimal transportation. In Sect. 6, a numerical method to find the transformation for piecewise uniform distributions is proposed. Finally, we discuss open problems in Sect. 7.
Definition of Steintype distributions and transformations
We define a class of distributions that satisfy a stronger condition than (4). Let \(\mathcal {P}^2=\mathcal {P}^2(\mathbb {R}^d)\) be the set of probability distributions \(\mu \) on \(\mathbb {R}^d\) with mean zero and finite second moments such that each marginal distribution \(\mu _i\) of \(\mu \) is absolutely continuous with respect to the Lebesgue measure on \(\mathbb {R}\). Note that \(\mu \) itself is not assumed to be absolutely continuous. The meanzero condition is imposed only for simplicity. We say that a function \(f:\mathbb {R}\rightarrow \mathbb {R}\) is absolutely continuous if there exists a locally integrable function \(f'\) such that \(f(x)=f(0)+\int _0^x f'(y)\mathrm {d}y\) in Lebesgue’s sense.
Definition 1
A distribution \(\mu \in \mathcal {P}^2\) is said to be Steintype if it satisfies
for any absolutely continuous function \(f:\mathbb {R}\rightarrow \mathbb {R}\) with bounded derivative \(f'\).
Note that the equation (4) is a special case of (5) with \(f(x_i)=x_i\).
We refer to the equation (5) as the Steintype identity. Indeed, if \(d=1\), it reduces to the Stein identity \(\int f(x_1)x_1{\mathrm {d}\mu }= \int f'(x_1){\mathrm {d}\mu }\), which implies that \(\mu \) is the standard normal distribution (see [35] and [6]). The Stein identity is used to evaluate distance between a given distribution and the normal distribution. Although the Steintype identity we defined is a generalization of the Stein identity, the author does not aware of its applications to the distance evaluation. Instead, we develop a different application. More specifically, if a random vector \((X_1,\ldots ,X_d)\) has a Steintype distribution, then the sum \(\sum _j X_j\) is positively correlated with any monotone transformation of \(X_i\) due to (5). This property is applied to a rating problem in Sect. 3.
If \(\mu \) is completely independent in the sense that \(\mu \) is the direct product of its marginals \(\mu _i\), then the Steintype distribution has to be the ddimensional standard normal distribution. Hereafter, we focus on dependent cases.
For Gaussian random variables, we obtain the following lemma. We denote the expectation by \(\mathrm{E}\).
Lemma 1
(Theorem 5 of [33]) Let \(\mu \) denote the ddimensional normal distribution with mean zero and covariance matrix \(S=(S_{ij})\). Then, \(\mu \) is Steintype if and only if \(\sum _j S_{ij}=1\) for each i.
Proof
Let \(X=(X_i)\) be a random vector distributed according to \(\mu \). Then, the conditional expectation of \(X_j\) given \(X_i\) is \(\mathrm{E}[X_jX_i]=S_{ij}X_i/S_{ii}\). The left hand side of (5) is
where the last equality follows from the Stein identity for onedimensional case. Hence (5) holds if and only if \(\sum _j S_{ij}=1\). \(\square \)
The following example gives a rich class of Steintype distributions.
Example 1
Let W be a random variable with the standard normal distribution and let U be any random variable independent of W such that \(\mathrm{E}[U]=0\) and \(\mathrm{E}[U^2]<\infty \). The condition \(\mathrm{E}[U]=0\) is assumed to make the following distribution belong to \(\mathcal {P}^2\) and not essential here. Consider two variables
Then the distribution of \((X_1,X_2)\) is Steintype. Indeed, we obtain
for any f by the onedimensional Stein identity with respect to W conditional on U. The variable \(W=(X_1+X_2)/\sqrt{2}\) has a meaning of “an overall score” of the two variables \(X_1\) and \(X_2\). The identity implies that W is positively correlated with any increasing function \(f(X_i)\) of \(X_i\).
This example is generalized to the case \(d\ge 3\). Define a random vector \((X_1,\ldots ,X_d)\) by \(X_i = (W+U_i)/\sqrt{d}\), where W has the standard normal distribution independent of \(U_1,\ldots ,U_{d1}\) and \(U_d=\sum _{j=1}^{d1} U_j\). Then, the distribution of \((X_1,\ldots ,X_d)\) is Steintype as long as \(\mathrm{E}[U_i]=0\) and \(\mathrm{E}[U_i^2]<\infty \). As in the twodimensional case, W has positive correlation with any increasing function \(f(X_i)\) of \(X_i\). In Sect. 3, we will show by an example that the positivecorrelation property does not hold in general for copulas.
The example does not cover the entire class of Steintype distributions. Other examples are given in Sect. 6 and Appendix B.
For each \(\mu \in \mathcal {P}^2\), let \(\mathcal {T}_{\mathrm{cw}}(\mu )\) be the set of coordinatewise transformations
such that each \(T_i:\mathbb {R}\rightarrow \mathbb {R}\cup \{\infty ,\infty \}\) is nondecreasing and \(T_\sharp \mu \) belongs to \(\mathcal {P}^2\). Here, \(T_\sharp \mu \) is the pushforward measure defined by \((T_\sharp \mu )(A)=\mu (T^{1}(A))\) for any measurable set A. The set \(\mathcal {T}_{\mathrm{cw}}(\mu )\) depends only on the marginal distributions of \(\mu \). Two maps T and U in \(\mathcal {T}_\mathrm{cw}(\mu )\) are identified if \(\mu (T=U)=1\). Note that \(T_i\) may have discontinuous points if the support of the ith marginal \((T_\sharp \mu )_i\) is not connected.
We consider a problem to find a map \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) such that \(T_\sharp \mu \) is Steintype.
Definition 2
A map \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) is called a Steintype transformation of \(\mu \) if \(T_\sharp \mu \) is a Steintype distribution.
For example, if \(\mu \) is the product measure of onedimensional continuous distributions \(\mu _i\), then the map T defined by \(T_i(x_i)=\varPhi ^{1}(\mu _i((\infty ,x_i]))\) is the Steintype transformation of \(\mu \), where \(\varPhi \) is the cumulative distribution function of the standard normal distribution. The map is nothing but the Brenier map between the two product measures.
The following lemma is immediate.
Lemma 2
Let \(\mu \) be the normal distribution with a covariance matrix \(S=(S_{ij})\). Then, \(\mu \) has a Steintype transformation if S is strictly copositive in the sense of (1).
Proof
Let D be the diagonal matrix with entries \(w_1,\ldots ,w_d\) satisfying (4). Set \(T(x)=Dx\). Then, \(T_\sharp \mu \) is Steintype due to Lemma 1. \(\square \)
Application to a rating problem
In this section, we briefly describe an application of Steintype transformations. We first explain a linear rating method of multivariate data according to [33]. Let S be a covariance matrix of an \(\mathbb {R}^d\)valued random vector \(x=(x_i)\). Suppose that each variable \(x_i\) has a meaning of “score”. For example, \(x_1\), \(x_2\) and \(x_3\) are student scores of mathematics, physics and history, and so forth. We want to determine positive weights \(w_i\) such that \(\sum _j w_jx_j\) reflects the scores \(x_1,\dots ,x_d\). A candidate of such a weight \(w_i\) is the ith diagonal element of D in (2). Indeed, under (2), the variable \(x_i\) and the overall score \(\sum _j w_jx_j\) have positive covariance for each i. This property is not attained in general for other methods of weighting. The quantity \(\sum _j w_jx_j\) is called the objective general index (OGI) in [33], which reflects the d scores in this sense.
In a similar manner, we can define a nonlinear version of the objective general index via Steintype transformations. Let \(\mu \) be a probability distribution on \(\mathbb {R}^d\) and x be a random vector distributed according to \(\mu \). Again each coordinate \(x_i\) is assumed to have a meaning of score. Then the nonlinear general index is defined by
where \(T=(T_i)\) is the Steintype transformation of \(\mu \). The quantity g(x) satisfies \(\mathrm{E}[g(X)f(X_i)]\ge 0\) for any increasing function \(f(x_i)\) of \(x_i\) due to the Steintype identity (5). In particular, by taking the step function \(f(x_i)=I_{[a,\infty )}(x_i)\) and noting that \(\mathrm{E}[g(X)]=0\), we obtain
for each \(i\in \{1,\ldots ,d\}\) and \(a\in \mathbb {R}\). The property means that the conditional average of the overall score of students taking higher score on \(x_i\) is larger than that of students taking lower score. In this sense, g(x) reflects the d scores.
General indices g(x) satisfying the inequality (6) are not unique. Indeed, for twodimensional distributions, the probability integral transformation \(T_i(x_i)=\mu _i((\infty ,x_i])\) provides (6); see Appendix D for a proof. However, for higher dimensional distributions, it is not trivial to find such a transformation. The Steintype transformation solves the problem. We confirm this point by an example.
Example 2
Suppose that \(x_1\), \(x_2\) and \(x_3\) are random variables that have a probability distribution \(\mu \) on \([1,1]^3\) with the density function
The three marginal distributions are uniform over \([1,1]\), i.e., \(\mu \) is a copula. We can see
Hence the property in (6) does not hold if we adopt \(g(x)=\sum _i x_i\). As will be demonstrated in Sect. 6, the unique Steintype transformation of \(\mu \) is
with \(c_1=1.2490\) and \(c_2=c_3=0.3445\), where \(\varPhi \) denotes the standard normal distribution function and \(c_1,c_2\) are numerically obtained. The nonlinear objective general index \(g(x)=\sum _i T_i(x_i)\) satisfies the relation (6) as a result.
Main results
For given \(\mu \in \mathcal {P}^2\), denote the set of coordinatewise transformed distributions of \(\mu \) by
We refer to \(\mathcal {F}_{\mu }\) as a fiber. The following lemma is a direct consequence of the onedimensional optimal transportation. See Appendix A.
Lemma 3
For given \(\mu \in \mathcal {P}^2\) and \(\nu \in \mathcal {F}_{\mu }\), the map \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) satisfying \(\nu =T_\sharp \mu \) is uniquely determined \(\mu \)almost everywhere. Furthermore, the relation \(\nu \in \mathcal {F}_{\mu }\) between two measures \(\mu \) and \(\nu \) is an equivalence relation. In particular, \(\mathcal {P}^2\) is partitioned into mutually disjoint fibers.
Now, we state our three main theorems. All proofs are presented in Sect. 5.
The first main theorem characterizes Steintype distributions in terms of the variational principle. Define an energy functional \(\mathcal {E}(\mu )\) of \(\mu \) by
where \(p_i={\mathrm {d}\mu }_i/{\mathrm {d}x}_i\) is the marginal density function. The first term of \(\mathcal {E}(\mu )\) represents the negative entropy of the marginal distributions, and the second term is the variance of the diagonal part \(\sum _j x_j\). The functional has displacement convexity over each fiber in the sense of [26]. See Sect. 5 for details. If we replace the entropy term with the joint entropy, the functional becomes displacement convex over the whole space as proved by McCann [26].
Theorem 2
A measure \(\mu \in \mathcal {P}^2\) is Steintype if and only if \(\mathcal {E}(\mu )\) is finite and \(\mu \) minimizes \(\mathcal {E}\) over the fiber \(\mathcal {F}_{\mu }\).
The second main theorem is on the uniqueness of Steintype transformations. A distribution \(\mu \) on \(\mathbb {R}^d\) is said to have a regular support if the support of \(\mu \) is equal to the direct product of the supports of the marginal distributions \(\mu _i\). This property is invariant under coordinatewise transformations. Note that the regular support condition does not imply absolute continuity of \(\mu \) with respect to \(\prod _{i=1}^d \mu _i\).
Theorem 3
(Uniqueness) Assume that \(\mu \in \mathcal {P}^2\) has a regular support. Then, a Steintype transformation of \(\mu \) is unique if it exists.
We conjecture that the uniqueness follows without the regular support condition. See Sect. 7 for more details.
The third main theorem is on existence. A measure \(\mu \in \mathcal {P}^2\) is said to be copositive if
For example, if \(\mu \) is pairwise independent, then \(\int (\sum _i T_i(x_i))^2{\mathrm {d}\mu }= \sum _i \int T_i(x_i)^2{\mathrm {d}\mu }\) for any \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\), and therefore \(\beta (\mu )=1\). It is not difficult to see that \(\beta (\mu )\le 1\) for any \(\mu \). If \(\mu \) is associated in the sense of [9, 11], and [21], then \(\int T_iT_j{\mathrm {d}\mu }\ge 0\) for each pair of i and j, and therefore \(\beta (\mu )\ge 1\). On the other hand, if \(d=2\) and \(\mu (\{x\mid x_1+x_2=0\})=1\), then \(\beta (\mu )=0\) because \(\int (x_1+x_2)^2{\mathrm {d}\mu }=0\). Sufficient conditions for copositivity are presented in Appendix E.
Theorem 4
(Existence) Let \(\mu \in \mathcal {P}^2\) be copositive. Then, there exists a Steintype transformation of \(\mu \).
We now present a few remarks before proceeding to the following section.
The uniqueness and existence results in Theorem 3 and Theorem 4 are consequences of the variational characterization in Theorem 2, as will be shown in Sect. 5. For \(d=1\), the functional \(\mathcal {E}(\mu )\) is the KullbackLeibler divergence from \(\mu \) to the standard normal density up to a constant term. For \(d\ge 2\), however, \(\mathcal {E}\) is not even bounded from below. Indeed, for each \(t>0\), let \(\mu ^t\) be the multivariate normal distribution with mean zero and covariance matrix \(\Sigma _t = P + t(IP)\), where I is the identity matrix and
Then, each marginal distribution of \(\mu ^t\) is normal with variance \(\sigma _t^2=(1/d)+t(11/d)\). We can show that \(\mathcal {E}(\mu ^t) = (d/2)\log (2\pi \sigma _t^2)\), which tends to \(\infty \) as \(t\rightarrow \infty \). Therefore, it is not trivial if there is a minimizer of \(\mathcal {E}\) over the fiber. Nevertheless, the existence and uniqueness theorems are obtained.
If \(\mu \) has the joint density function p(x) with respect to the Lebesgue measure, then the negative joint entropy is defined by
In most cases, we can replace the marginal entropy term in \(\mathcal {E}(\mu )\) with the joint entropy because the difference \(\mathcal {U}_d(\mu )\sum _{i=1}^d \mathcal {U}_1(\mu _i)\), which is referred to as the multiinformation function or the measure of multivariate dependence, is invariant in each fiber (e.g., [16] and [36]). However, in some pathological cases, the difference diverges. Therefore, it is more appropriate to adopt the marginal entropy.
According to Sklar’s theorem (e.g. [27]), any ddimensional distribution \(\mu \) is transformed by the probability integral transformation \(T_i(x_i)=\int _{\infty }^{x_i}{\mathrm {d}\mu }_i\) into the distribution \(T_\sharp \mu \) with uniform marginals unless some \(\mu _i\) has an atom. The resultant distribution \(T_\sharp \mu \) is a copula. The Steintype distribution we defined is considered as an alternative representation of the copula. Copulas are also characterized by an energy minimization problem. Here, the potential term in (7) is replaced with \(\int V(x){\mathrm {d}\mu }\), where \(V(x)=\infty \) if \(x\notin [0,1]^d\) and 0 otherwise. In parallel, we have to remove the condition \(\int x_i{\mathrm {d}\mu }_i=0\) from the definition of \(\mathcal {P}^2\). Maximum entropy copulas under a given diagonal section are discussed in [5], where, in contrast to the present paper, the marginals are fixed to be uniform.
Proofs based on the theory of optimal transportation
In this section, we prove the three main theorems stated in Sect. 4. The proof is based on the theory of optimal transportation. Necessary facts about onedimensional optimal transportation are summarized in Appendix A.
Regularity of Steintype distributions
The Steintype identity forces regularity of marginal density functions. We first characterize this by an integral equation.
Theorem 5
Let \(\mu \in \mathcal {P}^2\). Denote the marginal density functions of \(\mu \) with respect to the Lebesgue measure by \(p_i(x_i)={\mathrm {d}\mu }_i/{\mathrm {d}x}_i\). Then, \(\mu \) is a Steintype distribution if and only if it satisfies a set of integral equations
where \(m_i(x_i)\) denotes the conditional expectation of \(\sum _{j=1}^d x_j\) given \(x_i\) with respect to \(\mu \).
Proof
First, note that \(m_i(x_i)\) is finite \(\mu _i\)almost everywhere because \(\mu \) belongs to \(\mathcal {P}^2\).
Assume \(\mu \) is Steintype. For \(\infty<a<b<\infty \), let \(h_{ab}(x)=(ba)^{1}\int _{\infty }^x I_{(a,b)}(\xi ){\mathrm {d}\xi }\), where \(I_{(a,b)}\) is the indicator function of (a, b). The Steintype identity for \(h_{ab}\) implies
Letting \(b\rightarrow a\), we obtain (9).
Conversely, assume (9). The righthand side of (9) converges to zero as \(a\rightarrow \pm \infty \) because \(\int x_j{\mathrm {d}\mu }_j=0\) for all j. Then, for any bounded and absolutely continuous function f with bounded derivative \(f'\), we obtain the Steintype identity for f:
where the second equality follows from the integralbyparts formula. If f is not bounded, then let \(f_M(x)=f(0)+\int _0^xf'(u)1_{\{u\le M\}}\mathrm {d}u\) and take \(M\rightarrow \infty \). \(\square \)
As a corollary, the regularity of the marginal density functions is established.
Corollary 1
Let \(\mu \) be Steintype. Then, its marginal density functions \(p_i(x_i)\) are bounded, absolutely continuous, and converge to zero as \(x_i\rightarrow \pm \infty \).
Proof
From the formula (9), it is obvious that \(p_i\) is absolutely continuous and bounded by \(\int \sum _ix_i{\mathrm {d}\mu }_i<\infty \). We also have \(p_i(x_i)\rightarrow 0\) as \(x_i\rightarrow \pm \infty \) because the righthand side of (9) vanishes as \(a\rightarrow \pm \infty \). \(\square \)
Although the marginal density function of any Steintype distribution is absolutely continuous, it can have nondifferentiable points as shown in an example in Sect. 6. The continuous differentiability of \(p_i(x_i)\) follows from the regularity of the pairwise copula of \(\mu \) along formula (9). We do not pursue this line of investigation here. On the other hand, we conjecture that the marginal density of any Steintype distribution is positive everywhere. See Sect. 7 for more details.
The following corollary will be used later.
Corollary 2
Let \(\mu \) be Steintype. Then, its negative marginal entropy \(\int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i\) is finite.
Proof
Since the marginal density \(p_i(x_i)\) is bounded, we have \(\int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i<\infty \). To prove \(\int p_i(x_i)\log p_i(x_i){\mathrm {d}x}_i>\infty \), we use the nonnegativity of the KullbackLeibler information from \(p_i\) to the standard normal density \(\phi (x_i)=e^{x_i^2/2}/\sqrt{2\pi }\). Indeed,
because \(\int x_i^2 {\mathrm {d}\mu }_i<\infty \). \(\square \)
Other properties of Steintype distributions are given in Appendix C.
Variational problem over a fiber of Wasserstein space
Let \(\mathcal {F}\) be a fiber of \(\mathcal {P}^2\) (see Sect. 4 for the definition) and choose two measures \(\mu \) and \(\nu \) in \(\mathcal {F}\), where \(\nu \) is written as \(\nu =T_\sharp \mu \) with some \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\) by definition. Define the geodesic, which is also referred to as the displacement interpolation [26], from \(\mu \) to \(\nu \) by
where \(\mathrm{Id}\) denotes the identity map. Based on the onedimensional optimal transportation, it follows that \([\mu ,\nu ]_t\in \mathcal {F}\) and \([\mu ,\nu ]_t=[\nu ,\mu ]_{1t}\) for each t. A functional on \(\mathcal {F}\) is said to be displacement convex if it is convex along a geodesic. Refer to [26] and [3] for further details on displacement convexity.
Although a geodesic between any pair of distributions in \(\mathcal {P}^2\) is similarly defined, we need only geodesics in a common fiber. It is known that a geodesic actually attains the minimum length of a path between two measures with respect to the \(L^2\)Wasserstein distance (see e.g. [3] and [38]). Here the \(L^2\)Wasserstein distance is the infimum of \(\left( \int \Vert xy\Vert ^2 \mathrm {d}\gamma (x,y)\right) ^{1/2}\) over the set of joint distributions \(\gamma \) on \(\mathbb {R}^{2d}\) with the marginals \(\mu \) and \(\nu \). Each fiber \(\mathcal {F}\) is totally geodesic in the sense of [37].
Recall that \(\mu \) is said to have a regular support if its support is the direct product of the supports of marginal distributions.
Lemma 4
Let \(\mathcal {F}\) be a fiber and choose any two distributions \(\mu \) and \(\nu \) in \(\mathcal {F}\), where \(\mu \ne \nu \). Then, \(\mathcal {E}([\mu ,\nu ]_t)\) is convex in t, that is, \(\mathcal {E}\) is displacement convex over \(\mathcal {F}\). Furthermore, \(\mathcal {E}([\mu ,\nu ]_t)\) is strictly convex if one of the following conditions is satisfied:

(i)
\(\mu \) (and therefore \(\nu \)) has a regular support, or

(ii)
the supports of \(\mu _i\) and \(\nu _i\) are connected, respectively, for each i.
Proof
Let \(\nu =T_\sharp \mu \), with \(T\in \mathcal {T}_{\mathrm{cw}}(\mu )\). Let \(p_i={\mathrm {d}\mu }_i/{\mathrm {d}x}_i\) be the marginal density of \(\mu \). By the changeofvariable formula (Lemma 7 in Appendix A), we obtain
where \(T_i'(x_i)\) is the derivative of \(T_i\) if it exists, and \(T_i'(x_i)=0\) otherwise. Both terms in (10) are convex in t.
Assume (i) and that \(\mathcal {E}([\mu ,\nu ]_t)\) is not strictly convex. Then, there is an interval over which \(\mathcal {E}([\mu ,\nu ]_t)\) is linear. It is deduced from (10) that \(\sum _{i=1}^d (T_i(x_i)x_i)=0\), \(\mu \)almost everywhere. Let I be the set of indices i such that \(\mu _i(T_i(x_i)\ne x_i)>0\). Then, I is not empty because \(T\ne \mathrm{Id}\). For each \(i\in I\), the probability \(\mu _i(T_i(x_i)x_i>0)\) is positive because \(\int (T_i(x_i)x_i){\mathrm {d}\mu }_i=0\). Then, by the regular support condition, we have \(\mu (\sum _{i\in I}(T_i(x_i)x_i)>0)\) is positive. However, this contradicts \(\sum _{i=1}^d (T_i(x_i)x_i)=0\). Thus, \(\mathcal {E}([\mu ,\nu ]_t)\) should be strictly convex under (i).
Next, assume (ii). Then, \(T_i\) has no discontinuous points. Assume \(\mathcal {E}([\mu ,\nu ]_t)\) is not strictly convex. Then, if follows from (10) that \(T_i'(x_i)=1\) and, therefore, \(T_i(x_i)=x_i\) by the connectedness of the support together with the condition \(\int T_i{\mathrm {d}\mu }_i=0\). However, this contradicts \(\mu \ne \nu \). Thus, \(\mathcal {E}([\mu ,\nu ]_t)\) is strictly convex. \(\square \)
Example 3
The strict convexity of \(\mathcal {E}([\mu ,\nu ]_t)\) can fail if neither condition (i) nor condition (ii) in Lemma 4 is satisfied. For example, let \(d=2\) and assume that \(\mu \) is uniformly distributed over the region \(([1,0]\times [0,1])\cup ([0,1]\times [1,0])\). Define the map T by \(T_i(x_i)=x_i+1\) if \(x_i>0\), and \(T_i(x_i)=x_i1\) otherwise for each i. Let \(\nu =T_\sharp \mu \). Then, \(\mathcal {E}([\mu ,\nu ]_t)\) is constant along \(t\in [0,1]\) because \(T_i'(x_i)=1\) and \(\sum _i T_i(x_i)=\sum _i x_i\), \(\mu \)almost everywhere. In this case, \(\mu _i\) is supported on \([1,1]\), whereas \(\nu _i\) is supported on \([2,1]\cup [1,2]\).
Proof of Theorem 2
Let \(\mu \) be a Steintype distribution. Corollary 2 implies that \(\mu \) belongs to \(\mathrm{dom}\mathcal {E}\). From the convexity (Lemma 4), it is sufficient to show that
for any \(\nu =T_\sharp \mu \in \mathcal {F}\), where \({\mathrm {d}}/{\mathrm {d}t}_+\) denotes the right derivative. It follows from formula (10) that
If \(T_i\) is absolutely continuous, the righthand side vanishes by the Steintype identity, where the boundedness of the derivatives \(T_i'\) can be assumed by a standard approximation argument, as in the proof of Theorem 5. If \(T_i\) is not absolutely continuous, \(T_i\) can be decomposed into an absolutely continuous part and a discontinuous part as \(T_i=T_i^\mathrm{ac}+T_i^{\mathrm{d}}\). See Appendix A. The contribution of \(T_i^{\mathrm{ac}}\) in (11) vanishes due to the Steintype identity. It is sufficient to prove that \(\sum _j \int T_i^{\mathrm{d}}(x_i) x_j {\mathrm {d}\mu }\ge 0\) for each i because \((T_i^\mathrm{d})'=0\) by definition. We can take a sequence \(\{f_{i,n}\}_{n=1}^{\infty }\) of nondecreasing differentiable functions with a bounded derivative such that \(f_{i,n}(x_i)\) converges to \(T_i^{\mathrm{d}}(x_i)\) \(\mu \)almost everywhere. More specifically, a step function \(I_{[\xi ,\infty )}(x_i)\) at each \(\xi \in \mathbb {R}\) is approximated by a logistic function \(1/(1+\exp (n(x_i\xi )))\). Then, by Lebesgue’s dominated convergence theorem and the Steintype identity, we obtain
Conversely, assume that \(\mathcal {E}(T_\sharp \mu )\) is minimized at \(T=\mathrm{Id}\). Fix \(1\le i\le d\), and let f be an absolutely continuous function with bounded derivative \(f'\). For sufficiently small \(\varepsilon >0\), two maps \(T(x)=x\pm \varepsilon f(x_i)e_i\), where \(e_i\) is the ith unit vector, belong to \(\mathcal {T}_\mathrm{cw}(\mu )\). Thus, the right derivative (11) has to be zero, and \(\mu \) satisfies the Steintype identity.
Proof of Theorem 3
Assume that \(\mu \) has a regular support and admits a Steintype transformation T. Then, Theorem 2 implies that \(T_\sharp \mu \) minimizes \(\mathcal {E}\) over the fiber \(\mathcal {F}_{\mu }\). However, it is deduced from Lemma 4 that \(\mathcal {E}\) is strictly convex over \(\mathcal {F}_{\mu }\). Thus, the minimizer is unique.
Proof of Theorem 4
Assume that \(\mu \) is copositive. Denote the functional \(\mathcal {E}\) restricted to the fiber \(\mathcal {F}_{\mu }\) by \(\mathcal {E}_{\mu }\). From Theorem 2, it is sufficient to show that \(\mathcal {E}_{\mu }\) has a minimum point. We first show that \(\mathcal {E}_{\mu }\) is bounded from below and that the level set \(\{\nu \mid \mathcal {E}_{\mu }(\nu )\le c\}\) for each \(c\in \mathbb {R}\) is tight. For any \(\nu \in \mathcal {F}_{\mu }\), the copositivity condition implies
where \(q_i={\mathrm {d}\nu }_i/{\mathrm {d}x}_i\) and \(\beta =\beta (\nu )=\beta (\mu )>0\). We obtain
where the last inequality follows from the nonnegativity of the KullbackLeibler divergence. Then, \(\mathcal {E}_{\mu }\) is bounded from below as
where C is a constant independent of \(\nu \). This inequality also implies that the level set \(\{\nu \mid \mathcal {E}_{\mu }(\nu )\le c\}\) is tight.
Now there exists a weakly converging sequence \(\nu _k\) such that \(\mathcal {E}_{\mu }(\nu _k)\) converges to \(\inf \mathcal {E}_{\mu }(\nu )\). Let \(\nu _*\) be the weak limit. Then, Corollary 3.5 of [26] shows that \(\nu _*\in \mathcal {P}^2\) and \(\mathcal {E}_{\mu }(\nu _*)\le \lim _k \mathcal {E}_{\mu }(\nu _k)\). The distribution \(\nu _*\) gives a minimum point of \(\mathcal {E}_{\mu }\). This completes the proof.
Piecewise uniform densities
In this section, it is shown that if \(\mu \) has piecewise uniform density function, then the Steintype transformation of \(\mu \) is obtained by finitedimensional optimization. Here, we do not impose the zero mean condition \(\int x_i\mathrm{d}\mu =0\) on \(\mu \). We can always translate it to have zero mean if necessary.
We say that a probability density function c(u) on \([0,1]^d\) is piecewise uniform if its twodimensional marginal densities \(c_{ij}\) (\(1\le i<j\le d\)) are written as
for some n, where \(\pi _{ab}^{ij}\) is a positive number such that
Let \(\pi _a^i=\sum _{b=1}^n \pi _{ab}^{ij}\). Although c is not a copula density unless \(\pi _a^i=1/n\) for all i and a, it is transformed by a piecewise linear transform into a copula density. Then Corollary 3 in Appendix E, together with Theorem 4, guarantees the existence of a Steintype transformation as long as the support of c(u) is \([0,1]^d\).
By solving Equation (9), we obtain an expression of the Steintype transformation of c as follows. Denote the cumulative distribution function and density function of the standard normal distribution by \(\varPhi \) and \(\phi \), respectively.
Lemma 5
Suppose that c(u) satisfies (12) and its support is \([0,1]^d\). Let p be the unique Steintype density transformed from c. Then, there exist real constants \(\alpha _{1i},\ldots ,\alpha _{ni}\) and \(\xi _{1i}<\cdots <\xi _{n1,i}\) such that
where \(\xi _{0i}=\infty \), \(\xi _{ni}=\infty \), and \(Z_{ai}= \varPhi (\xi _{ai}\alpha _{ai})  \varPhi (\xi _{a1,i}\alpha _{ai})\). The Steintype transformation is
and the twodimensional marginal density is
Furthermore, the following identity is satisfied:
Proof
Equation (9) implies that \(\partial _i p_i(x_i) = (x_i + \sum _{j\ne i}E[X_jx_i])p_i(x_i)\), where \(\partial _i=\partial /\partial x_i\). Since the conditional expectation \(E[X_jx_i]\) has to be piecewise constant, \(p_i(x_i)\) is piecewise Gaussian up to a normalizing constant. Since the mass of each piece is preserved under a coordinatewise transformation, we obtain the form (13). Then, the unique monotone transformation (14) is derived from \(c_i(u_i){\mathrm {d}u}_i=p_i(x_i){\mathrm {d}x}_i\). Equation (15) results from the transformation of \(c_{ij}(u_i,u_j)\). Finally, Equation (16) is obtained from \(\partial _i \log p_i(x_i) = (x_i+\sum _{j\ne i}E[X_jx_i])\). \(\square \)
The parameters \(\alpha _{ai}\) and \(\xi _{ai}\) are determined by the continuity of (13) at \(x_i=\xi _{ai}\) and the identity (16). However, instead of solving the simultaneous equations directly, we adopt an optimization approach.
Assume the density of a distribution \(\mu \) obeys the parametric form given by Equation (13). Then, the energy function \(\mathcal {E}(\mu )\) defined in Sect. 5 is a function of \(\alpha \) and \(\xi \), which is denoted by \(F(\alpha ,\xi )\) and is obtained as follows:
where
Since \(Z_{ai}\) and \(M_{ai}\) are functions of three parameters \(\alpha _{ai}\), \(\xi _{ai}\), and \(\xi _{a1,i}\), we denote the corresponding partial derivative by \(D_1\), \(D_2\), and \(D_3\). The derivatives of F are
By using these formulas, we obtain the following theorem.
Theorem 6
A stationary point of F together with formula (13) provides the global minimum point of the energy functional \(\mathcal {E}(\mu )\) over the fiber. In other words, F has a unique stationary point that corresponds to the Steintype density.
Proof
Since \(M_{ai}=\int x_i\phi (x_i\alpha _{ai}){\mathrm {d}x}_i/Z_{ai}\) is the expectation parameter of an exponential family \(\phi (x_i\alpha _{ai})/Z_{ai}\), it is an increasing function of \(\alpha _{ai}\) (e.g., [23]). Therefore, \(D_1 M_{ai}>0\). Thus, the stationary condition \(\partial F/\partial \alpha _{ai}=0\) is equivalent to
which is equivalent to (16) and solves the integral equation (9) except at boundary points \(\xi _{ai}\). Furthermore, substituting this relation into (18), we obtain
Therefore, \(\partial F/\partial \xi _{ai}=0\) is equivalent to the continuity of \(p_i\) at \(\xi _{ai}\). Then, the density p is the Steintype density, which is unique due to Theorem 3. \(\square \)
The minimization problem of \(F(\alpha ,\xi )\) over \(\alpha _{ai}\in \mathbb {R}\) and \(\xi _{1i}<\cdots <\xi _{n1,i}\) is performed using a standard optimization package (e.g., the function optim in R [29]) when the coordinate \(\tau _{ai}=\xi _{ai}\xi _{a1,i}\), rather than \(\xi _{ai}\), is used for \(2\le a\le n1\).
Example 4
We numerically obtain the Steintype densities of discretized copulas. The result is shown in Fig. 1. The copula used here is the Clayton copula
The discretized copula density of \(n\times n\) cells is given by (12) with
Discussion
In the present paper, we showed that a class of multidimensional distributions has a unique representation via the Steintype identity. Now, we describe areas for future study and some open problems.
In Sect. 5.1, we derived some properties of Steintype distributions. The author could not find any counterexample against the following conjecture.
Conjecture 1
The marginal density function of any Steintype distribution is positive everywhere.
A partial answer to Conjecture 1 is given in the following lemma.
Lemma 6
Let \(\mu \) be a Steintype distribution. If the copula of \(\mu \) has pairwise marginal densities \(c_{ij}\) such that
then each marginal density \(p_i\) of \(\mu \) is positive everywhere. In particular, if the copula density of \(\mu \) is bounded, then the same consequence follows.
Proof
The density \(p_i(x_i)\) satisfies \(\partial _i p_i(x_i) + p_i(x_i) m_i(x_i)=0\) with \(m_i(x_i)=E[\sum _j X_jx_i]\) by Theorem 5. The conditional expectation satisfies
where \(F(x_i)=\int _{\infty }^{x_i}p_i(\xi ){\mathrm {d}\xi }\). Let \(D_*=\sum _{j\ne i}(D \mathrm{E}[X_j^2])^{1/2}\). Then, we obtain an inequality
Let \(a\in \mathbb {R}\) be a point at which \(p_i(a)>0\). Then, Gronwall’s lemma shows that \(p_i(x_i)\ge p_i(a) e^{(x_i+D_*)^2/2+(a+D_*)^2/2}>0\) for \(x_i>a\), and similarly \(p_i(x_i)>0\) for \(x_i<a\). \(\square \)
If Conjecture 1 is positively solved, then the following conjecture, which is based on Theorem 3, is also positive according to Lemma 4 (ii).
Conjecture 2
A Steintype transformation is unique if it exists.
We state a relevant conjecture that is the converse of Theorem 4.
Conjecture 3
A distribution is copositive if it has a Steintype transformation.
In Sect. 5, we showed that a Steintype distribution is characterized by the stationary point of an energy functional \(\mathcal {E}\) over a fiber \(\mathcal {F}\). From the perspective of optimal transportation, we can construct the gradient flow of the energy functional with respect to the \(L^2\)Wasserstein space ([19, 28] and [38]). The formal equation is as follows
where \(m_i(x_i)=\mathrm{E}[\sum _j X_j  x_i]\). Although this appears to be an independent system of onedimensional FokkerPlanck equations, the equations interact with each other via \(m_i(x_i)\). The physical meaning of the equation is not clear. From Theorem 5, it follows that each Steintype density is a stationary point of (19). The time evolution will be theoretically of interest.
In Appendix E, we presented sufficient conditions for copositivity of distributions. In particular, a Gaussian distribution is copositive if its covariance matrix is not degenerated. Conversely, if a Gaussian distribution is copositive, then the covariance matrix must, by definition, be strictly copositive (see Equation (1)). The following conjecture naturally arises but is not proven. This is positively solved if Conjecture 3 is correct, due to Lemma 2.
Conjecture 4
A Gaussian distribution is copositive if the covariance matrix is strictly copositive.
As stated in Sect. E.4, taildependent copulas do not satisfy the sufficient condition in Theorem 7. The copositivity of taildependent copulas remains unclear.
In the present paper, we did not consider statistical models that explain a given data set. A statistical model involving a Steintype distribution is essentially equivalent to a copula model because such models correspond to each other through coordinatewise transformations, whereas the marginal distributions are not of much interest in copula modelling. The class given in Example 1 provides a flexible model because the distribution of \(U_i\)’s in the construction can be selected arbitrarily.
Finally, it is expected that there is a coordinatewise transformation to satisfy
for any monotone increasing functions f and g. If g is fixed, we can make a similar discussion as the present paper (see [34]).
References
Alfonsi, A., Jourdain, B.: A remark on the optimal transport between two probability measures sharing the same copula. Stat. Probab. Let. 84, 131–134 (2014)
Amari, S., Nagaoka, H.: Methods of Information Geometry, American Mathematical Society (2000)
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows – in Metric Spaces and in the Space of Probability Measures, Birkhäuser (2005)
Borwein, J.M., Lewis, A.S., Nussbaum, R.D.: Entropy minimization, DAD problems, and doubly stochastic kernels. J. Funct. Anal. 123, 264–307 (1994)
Butucea, C., Delmas, J., Dutfoy, A., Fischer, R.: Maximum entropy copula with given diagonal section. J. Multivar. Anal. 137, 61–81 (2015)
Chen, L.H.Y., Goldstein, L., Shao, Q.: Normal Approximation by Stein’s Method, Springer (2011)
Chernozhukov, V., Galichon, A., Hallin, M., Henry, M.: MongeKantorovich depth, quantiles, ranks and signs. Ann. Stat. 45(1), 223–256 (2017)
De Rossi, A., Rodino, L.: Strengthened CauchySchwarz inequality for biorthogonal wavelets in Sobolev spaces. J. Math. Anal. Appl. 299, 49–60 (2004)
Fallat, S., Lauritzen, S., Sadeghi, K., Uhler, C., Wermuth, N., Zwiernik, P.: Total positivity in Markov structures. Ann. Stat. 45(3), 1152–1184 (2017)
Fathi, M.: Stein kernels and moment maps. Ann. Probab. 47(4), 2172–2185 (2019)
Fortuin, C.M., Kasteleyn, P.W., Ginibre, J.: Correlation inequalities on some partially ordered sets. Comm. Math. Phys. 22, 89–103 (1971)
Gebelein, H.: Das statistische Problem der Korrelation als Variations und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. Z. Angew. Math. Mech. 21(6), 364–379 (1941)
Hallin, M.: On distribution and quantile functions, ranks and signs in \({\mathbb{R}}^d\): a measure transportation approach, preprint (2017)
Hua, L.: Multivariate Extremal Dependence and Risk Measures, Ph. D. Thesis in the University of British Columbia (2012)
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620–630 (1957)
Joe, H.: Relative entropy measures of multivariate dependence. J. Am. Stat. Assoc. 84, 157–164 (1989)
Joe, H.: Dependence Modeling with Copulas. CRC Press, Boca Raton (2014)
Johnson, O., Barron, A.: Fisher information inequalities and the central limit theorem. Probab. Theory Relat. Fields 129, 391–409 (2004)
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the FokkerPlanck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Kalantari, B.: A theorem of the alternative for multihomogeneous functions and its relationship to diagonal scaling of matrices. Linear Algebra Appl. 236, 1–24 (1996)
Karlin, S., Rinott, Y.: Classes of orderings of measures and related correlation inequalities. I. multivariate totally positive distributions. J. Multivar. Anal. 10, 467–498 (1980)
Lancaster, H.O.: Properties of the bivariate normal distribution considered in the form of a contingency table. Biometrika 44, 289–292 (1957)
Lehmann, E.L., Casella, G.: Theory of Point Estimation, Springer (1998)
LopezPaz, D., Hennig, P., Schölkopf, B.: The randomized dependence coefficient. Adv. Neural Inf. Process. Syst. 16, 1–9 (2013)
Marshall, A.W., Olkin, I.: Scaling of matrices to achieve specified row and column sums. Numer. Math. 12, 83–90 (1968)
McCann, R.J.: A convexity principle for interacting gases. Adv. Math. 128, 153–179 (1997)
Nelsen, R. B. (2006). An Introduction to Copulas, 2nd ed., Springer
Otto, F.: The geometry of dissipative evolution equations: the porus medium equation. Comm. Partial Diff. Eq. 26, 101–174 (2001)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundations for Statistical Computing, Vienna, Austria (2010). (http://www.Rproject.org/)
Rachev, S.T., Rüschendorf, L.: Mass Transportation Problems I: Theory. SpringerVerlag, New York (1998)
Rényi, : On measures of dependence, Acta Math. Acad. Sci. Hungar. 10, 441–451 (1959)
Rüschendorf, L.: Mathematical Risk Analysis. Springer, New York (2013)
Sei, T.: An objective general index for multivariate ordered data. J. Multivar. Anal. 147, 247–264 (2016)
Sei, T., (2017). Coordinatewise transformation and Steintype densities. In: Nielsen F., Barbaresco F. (eds) Geometric Science of Information. GSI, : Lecture Notes in Computer Science, vol. 10589. Springer, Cham (2017)
Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables, Proc. Sixth Berkeley Symp. on Math. Stat. Prob. 2, 583–602 (1972)
Studený, M.: Probabilistic Conditional Independence Structures. Springer, New York (2005)
Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)
Villani, C.: Topics in Optimal Transportation. American Mathematical Society, Providence (2003)
Zeidler, E.: Applied Functional Analysis  Main Principles and their Applications, Applied Mathematical Sciencs, 109. Springer, New York (1995)
Acknowledgements
The author is grateful to the coeditor and two anonymous referees for their careful reading and insightful suggestions. He also thanks to Masaaki Fukasawa and Xiao Li for their helpful comments. The study was supported by JSPS KAKENHI Grant Numbers JP26108003 and JP26540013.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Onedimensional optimal transportation
Necessary information about onedimensional optimal transportation is summarized. Refer to [30] and [38] for further details.
Let \(\mathcal {P}^2(\mathbb {R})\) be the set of absolutely continuous probability distributions \(\mu \) on \(\mathbb {R}\) such that \(\int x{\mathrm {d}\mu }=0\) and \(\int x^2{\mathrm {d}\mu }<\infty \). For given \(\mu \in \mathcal {P}^2(\mathbb {R})\), let \(\mathcal {T}(\mu )\) be the set of nondecreasing functions \(T:\mathbb {R}\rightarrow \mathbb {R}\cup \{\infty ,\infty \}\) such that \(T_\sharp \mu \in \mathcal {P}^2(\mathbb {R})\).
For given \(\mu \) and \(\nu \) in \(\mathcal {P}^2(\mathbb {R})\), there exists \(T\in \mathcal {T}(\mu )\) such that \(\nu =T_\sharp \mu \). The map is uniquely determined \(\mu \)almost everywhere. More explicitly, T is given by \(T=G^\circ F\), where \(F(x)=\int _{\infty }^x {\mathrm {d}\mu }\), \(G(x)=\int _{\infty }^x {\mathrm {d}\nu }\), and \(G^(u)=\inf \{x\in \mathbb {R}\mid G(x)>u\}\). The map T is called the optimal transportation from \(\mu \) to \(\nu \) because this map minimizes the functional \(\int (T(x)x)^2{\mathrm {d}\mu }\) over \(\{T\mid T_\sharp \mu =\nu \}\). Since \(\mu \) and \(\nu \) are absolutely continuous, T is decomposed into an absolutely continuous part, \(T^{\mathrm{ac}}\), and a discontinuous part, \(T^{\mathrm{d}}\), without a singular continuous part. This is because \(G^\) constructed above has the same property. The decomposition is unique up to a \(\mu \)negligible set.
The following lemmas are used in Sect. 5 and Sect. E. These lemmas were originally proven for multidimensional measures but here we simplify them for the onedimensional case.
Lemma 7
(Theorem 4.4 of [26]) For given \(\mu \) and \(\nu \) in \(\mathcal {P}^2(\mathbb {R})\), let T be a unique monotone map such that \(\nu =T_\sharp \mu \). Let p and q be density functions of \(\mu \) and \(\nu \), respectively. Let \(X\subset \mathbb {R}\) denote the set of points where the derivative \(T'\) is defined and positive. Then, \(\mu (X)=1\). Furthermore,
for any measurable function A on \([0,\infty )\) with \(A(0)=0\).
Lemma 8
(Proposition 1.3 of [26]) Let \(\mu \in \mathcal {P}^2\) and \(T\in \mathcal {T}(\mu )\). Then, \((1t)\mathrm{Id}+tT\in \mathcal {T}(\mu )\) for each \(t\in [0,1]\).
Lemma 9
(Proposition 4.2 of [26]) Let \(\mu \in \mathcal {P}^2\). If \(T:\mathbb {R}\rightarrow \mathbb {R}\) is a nondecreasing function written as \(T=T^{\mathrm{ac}}+T^{\mathrm{d}}\) and the derivative \((T^{\mathrm{ac}})'\) of the absolutely continuous part is strictly positive \(\mu \)almost everywhere, then \(T_\sharp \mu \) is absolutely continuous.
B Explicit expression of Steintype distributions
We formally derive an explicit expression of the Steintype distributions.
Assume that \(\mu \in \mathcal {P}^2\) has a smooth density function p with decay at infinity. Then, \(\mu \) is Steintype if and only if there exists a function r(x) such that
where \({\mathrm {d}x}_{i}\) means \({\mathrm {d}x}_1\cdots {\mathrm {d}x}_{i1}{\mathrm {d}x}_{i+1}\cdots {\mathrm {d}x}_d\). In fact, formula (20) is rewritten as \(\partial p_i/\partial x_i + p_i(x_i)m_i(x_i)=0\), where \(m_i(x_i)\) is the conditional expectation of \(\sum _j x_j\) given \(x_i\), and this equation is equivalent to (9).
Equation (20) is explicitly solved if r(x) is given. Let Q be a fixed orthogonal matrix such that \((Q^\top x)_1=\sum _j x_j/\sqrt{d}\), where \(Q^\top \) denotes the matrix transpose of Q. Then, (20) is written as
The general solution is
where q is any probability density function on \(\mathbb {R}^{d1}\).
In particular, if \(r(x)=0\), we obtain a simple formula
Example 1 in Sect. 4 is this solution. The class of densities (21) is characterized by a stronger condition than the Steintype identity, i.e.,
for any function \(f(x)=f(x_1,\ldots ,x_d)\).
C Properties of Steintype distributions
We provide some properties of Steintype distributions that are not used in the main stream of the paper.
We first point out that Steintype distributions have finite Fisher information. The Fisher information of a density function q on \(\mathbb {R}\) is defined by
where q is assumed to be absolutely continuous, and \(q'(x)/q(x)\) is set to 0 if q is not differentiable or not positive at x. See [18] for properties implied by finite Fisher information. Note that the Fisher information we defined is that of location family \(\{q(x\theta )\mid \theta \in \mathbb {R}\}\) in statistics (e.g., [23]).
Lemma 10
For any Steintype distribution \(\mu \), the Fisher information \(I(p_i)\) of each marginal density \(p_i\) is bounded by the dimension d. In particular, \(p_i\) has bounded variation.
Proof
From (9), the score function \(p_i'(x_i)/p_i(x_i)\) is equal to \(m_i(x_i)\). Since \(m_i(x_i)\) is the conditional expectation of \(\sum _j x_j\) given \(x_i\), we obtain
where the last equality follows from the Steintype identity with \(f(x_i)=x_i\). By the CauchySchwarz inequality, we also have \(\int p_i'(x_i){\mathrm {d}x}_i \le \sqrt{I(p_i)}\). Then, \(p_i\) has bounded variation. \(\square \)
Let \(\mathcal {S}\) be the set of Steintype distributions on \(\mathbb {R}^d\). We prove that \(\mathcal {S}\) is closed under mixture, normalized convolution, and weak limit.
Lemma 11
(Mixture) If \(\mu \) and \(\nu \) are two distributions in \(\mathcal {S}\), then \((1t)\mu +t\nu \) belongs to \(\mathcal {S}\) for any \(t\in [0,1]\).
Proof
This follows from the linearity of the Steintype identity (5) with respect to \(\mu \). \(\square \)
Lemma 12
(Normalized convolution) Let \(X=(X_1,\ldots ,X_d)\) and \(Y=(Y_1,\ldots ,Y_d)\) be independent random vectors with Steintype distributions. Let a and b be real numbers with \(a^2+b^2=1\). Then, \(aX+bY\) has a Steintype distribution.
Proof
The Steintype identity with respect to X implies that
for each i, because X and Y are independent. By changing the roles of X and Y, we have
Their average is
Thus, the Steintype identity for \(aX+bY\) holds if and only if \(a^2+b^2=1\). \(\square \)
The set \(\mathcal {S}\) is also closed under weak limit in the following sense. Denote the Euclidean norm on \(\mathbb {R}^d\) by \(\Vert x\Vert \) for \(x\in \mathbb {R}^d\).
Lemma 13
(Weak convergence) Let \(\mu ^{(n)}\) be a sequence in \(\mathcal {S}\). If \(\mu ^{(n)}\) converges to \(\mu \) in law and \(\int \Vert x\Vert ^2{\mathrm {d}\mu }^{(n)}\) converges to \(\int \Vert x\Vert ^2 {\mathrm {d}\mu }<\infty \), then \(\mu \) belongs to \(\mathcal {S}\).
Proof
These conditions imply that \(\int \varphi {\mathrm {d}\mu }^{(n)} \rightarrow \int \varphi {\mathrm {d}\mu }\) for any continuous function \(\varphi \) such that \(\varphi (x)\le C(1+\Vert x\Vert ^2)\) for some \(C>0\). (Refer to Theorem 7.12 of [38].) Letting \(\varphi (x)\) be \(f(x_i)\sum _jx_j\) and \(f'(x_i)\), respectively, we obtain the Steintype identity for \(\mu \). Absolute continuity of \(\mu _i\) is shown in the same manner as in the proof of Theorem 5. \(\square \)
The condition regarding moment convergence in Lemma 13 is necessary. Indeed, we can construct a sequence \((W,U^{(n)})\) of Steintype random variables in the same manner as in Example 1 of Sect. 4 such that \(U^{(n)}\) converges in law to a random variable U with \(\mathrm{E}[U^2]=\infty \).
By Lemma 12 and Lemma 13 together with the central limit theorem, if we have independent and identically distributed samples \(X^1,\ldots ,X^n\) according to a Steintype distribution \(\mu \), then the limit distribution of \((X^1+\cdots +X^n)/\sqrt{n}\) is a Steintype normal distribution that is characterized by Lemma 1.
Note that the set of copulas satisfies the same consequence as Lemma 11 and Lemma 13. If we modify the definition of the copulas in such a way that the marginal distribution is standard normal, then the same consequence as Lemma 12 also follows.
D A property of twodimensional copulas
We prove that the simple sum \(g(x)=x_1+x_2\) satisfies the inequality (6) for any twodimensional continuous copula density function \(c(x_1,x_2)\).
Lemma 14
Let \(c(x_1,x_2)\) be a twodimensional continuous copula density. Then
holds for any \(a\in [0,1]\).
Proof
Let \(h_a(x_2)=\int _a^1 c(x_1,x_2)\mathrm{d}x_1\) and \(H_a(x_2)=\int _0^{x_2} h_a(\eta )\mathrm{d}\eta \). Then we have \(h_a(x_2)\le h_0(x_2)=1\) and \(H_a(1)=1a\) by the definition of copula. We also obtain \(H_a(x_2)\le \min (x_2,1a)\). The integralbyparts formula yields
This proves the lemma. \(\square \)
From the lemma, we have
Since \(\int _0^1\int _0^1 (x_1+x_2)c(x_1,x_2)\mathrm{d}x_1\mathrm{d}x_2=1\), we obtain \(\int _0^1(\int _0^a(x_1+x_2)c(x_1,x_2)\mathrm{d}x_1)\mathrm{d}x_2\le a\). Finally, we deduce that
E Sufficient conditions for copositivity
We present the sufficient conditions for copositivity (8) of a given distribution \(\mu \). In Sect. E.1, we first take into account the measures with a nonzero mean as well as coordinatewise transformations that are constant over an interval. We then present a lower bound of the quantity \(\beta (\mu )\) in (8). Subsequent subsections are devoted to finding sufficient conditions for copositivity. Refer to [34] for other sufficient conditions based on positive supermodular dependence.
E.1 Extension of the definition and a lower bound
Let \(\mathcal {P}_*^2\) be the set of measures on \(\mathbb {R}^d\) such that each marginal \(\mu _i\) is absolutely continuous and \(\int x_i^2 {\mathrm {d}\mu }_i<\infty \) without assuming \(\int x_i{\mathrm {d}\mu }_i=0\). The set \(\mathcal {T}_{\mathrm{cw*}}(\mu )\) for \(\mu \in \mathcal {P}_*^2\) is defined by the set of coordinatewise nondecreasing map \(T:\mathbb {R}^d\rightarrow \mathbb {R}^d\) such that \(\int T_i{\mathrm {d}\mu }_i=0\) and \(\int T_i^2{\mathrm {d}\mu }_i<\infty \) for each i.
The following lemma is useful to consider copositivity. Denote the inner product and norm of \(L^2(\mu )\) by \(\langle f,g\rangle =\int f(x)g(x){\mathrm {d}\mu }\) and \(\Vert f\Vert =\langle f,f\rangle ^{1/2}\), respectively.
Lemma 15
If \(\mu \in \mathcal {P}^2\), then
Proof
Denote the righthand side of (22) by \(\delta (\mu )\). Then, it is obvious that \(\beta (\mu )\ge \delta (\mu )\) since \(\mathcal {T}_{\mathrm{cw}}(\mu )\subset \mathcal {T}_{\mathrm{cw*}}(\mu )\). In order to prove the converse inequality, choose \(0\ne T\in \mathcal {T}_{\mathrm{cw*}}(\mu )\) such that \(\Vert \sum _i T_i\Vert ^2/(\sum _i \Vert T_i\Vert ^2)\le \delta (\mu )+\varepsilon \) for given \(\varepsilon \). It follows from Lemma 9 in Appendix A that a map \(T^\eta \) defined by \(T^\eta (x)=T(x)+\eta x\) belongs to \(\mathcal {T}_{\mathrm{cw}}(\mu )\) for each \(\eta >0\). Then, we have
implying \(\beta (\mu )\le \delta (\mu )+\varepsilon \). \(\square \)
We extend the definition of \(\beta (\mu )\) for any \(\mu \in \mathcal {P}_*^2\) by (22). In this section, \(\mu \) is a measure in \(\mathcal {P}_*^2\) unless otherwise stated.
Let \(L_0^2(\mu _i)\) be the set of functions \(T_i:\mathbb {R}\rightarrow \mathbb {R}\) such that \(\int T_i{\mathrm {d}\mu }_i=0\) and \(\int T_i^2{\mathrm {d}\mu }_i<\infty \). The set \(\mathcal {T}_{\mathrm{cw*}}(\mu )\) is a subset of \(\prod _{i=1}^d L_0^2(\mu _i)\). The space \(L_0^2(\mu _i)\) is considered to be a subspace of \(L^2(\mu )\). More precisely, \(T_i\in L_0^2(\mu _i)\) is identified with the function \(x\mapsto T_i(x_i)\) in \(L^2(\mu )\).
By relaxing the set \(\mathcal {T}_{\mathrm{cw*}}(\mu )\) in (22), we obtain a lower bound of \(\beta (\mu )\) as
Therefore, \(\mu \) is copositive if \(\beta _{\mathrm{L}}(\mu )>0\).
It is shown that \(\beta (\mu )\) and \(\beta _{\mathrm{L}}(\mu )\) are invariant under coordinatewise transformations. Thus, \(\beta (\mu )\) and \(\beta _{\mathrm{L}}(\mu )\) depend only on the copula of \(\mu \). Furthermore, they depend only on the set of twodimensional marginal copulas of \(\mu \).
If \(d=2\), then the quantity \(\beta _{\mathrm{L}}(\mu )\) is related to the HirschfeldGebeleinRényi maximal correlation coefficient (refer to [12, 31] and [24])
Lemma 16
Let \(d=2\). Then, \(\beta _{\mathrm{L}}(\mu )=1\gamma (\mu )\). In particular, \(\mu \) is copositive if \(\gamma (\mu )<1\).
Proof
Let \(\gamma =\gamma (\mu )\). For any \(T_1\in L_0^2(\mu _1)\) and \(T_2\in L_0^2(\mu _2)\), we have
Thus, we have \(\beta _{\mathrm{L}}(\mu )\ge 1\gamma \). In order to prove the converse inequality, take sequences \(T_{1n}\) and \(T_{2n}\) satisfying \(\Vert T_{1n}\Vert =\Vert T_{2n}\Vert =1\) and \(\lim _{n\rightarrow \infty }\langle T_{1n},T_{2n}\rangle =\gamma \). Then,
\(\square \)
In the literature, two subspaces \(H_1\) and \(H_2\) of a Hilbert space with the property
are said to satisfy the strengthened CauchySchwarz inequality [8]. In our setting, \(\mu \) is copositive if \(L_0^2(\mu _1)\) and \(L_0^2(\mu _2)\) satisfy the strengthened CauchySchwarz inequality.
E.2 Gaussian case
We obtain an explicit expression of \(\beta _{\mathrm{L}}(\mu )\) if \(\mu \) is a multivariate normal distribution.
Lemma 17
Let \(\mu \) be the multivariate normal distribution with mean vector 0 and covariance matrix S. Then, \(\beta _{\mathrm{L}}(\mu )\) is the minimum eigenvalue of the correlation matrix of S. In particular, \(\mu \) is copositive if S is nonsingular.
Proof
The case of \(d=2\) has been proven by [22].
Assume that all the marginal densities of \(\mu \) are the standard normal \(\phi (x)=(2\pi )^{1/2}e^{x^2/2}\) without loss of generality. Then, the covariance matrix coincides with the correlation matrix \(R=(\rho _{ij})\). We prove that \(\beta _{\mathrm{L}}(\mu )=\lambda _\mathrm{min}(R)\), where the minimum eigenvalue of a positive definite matrix A is denoted by \(\lambda _{\mathrm{min}}(A)\). Note that \(\lambda _\mathrm{min}(R)\le 1\) because \(\mathrm {tr}(R)=d\).
Denote the Hermite polynomial of order k by \(\eta _k(x)=(1)^k \phi (x)^{1}({\mathrm {d}}^k/{\mathrm {d}x}^k)\phi (x)\) for \(x\in \mathbb {R}\). Any function \(T_i\in L_0^2(\mu _i)\) is expanded as
Since \(\int \eta _k(x_i)\eta _l(x_j){\mathrm {d}\mu }= \delta _{kl}(k!\rho _{ij}^k)\), we obtain
and
For any \(k\ge 1\), we can show that
Indeed, set \(A_{ij}=\rho _{ij}\) and \(B_{ij}=c_{ik}c_{jk}\rho _{ij}^{k1}\) in an inequality \(\mathrm {tr}(AB)\ge \lambda _{\mathrm{min}}(A)\mathrm {tr}(B)\) for any positive definite matrices A and B. Thus, we have
which implies \(\beta _{\mathrm{L}}(\mu )\ge \lambda _{\mathrm{min}}(R)\).
Conversely, let \((v_1,\ldots ,v_d)\) be the eigenvector corresponding to \(\lambda _{\mathrm{min}}(R)\) and \(T_i(x_i)=v_ix_i\). Then, we have \(\int (\sum _i T_i)^2 {\mathrm {d}\mu }= \lambda _{\mathrm{min}}(R)\sum _i \int T_i^2 {\mathrm {d}\mu }_i\). Thus, \(\beta _{\mathrm{L}}(\mu )\le \lambda _{\mathrm{min}}(R)\). \(\square \)
We conjecture that \(\beta (\mu )\) coincides with (1) if \(\mu \) is Gaussian and S is its covariance matrix. See Sect. 7.
E.3 Rényi’s condition of positive copula densities
The following theorem, which has been proven by [31] for \(d=2\), provides a checkable condition for copositivity.
Theorem 7
([31] for \(d=2\)) Assume that \(\mu \) has a regular support (see Sect. 4 for the definition) and for each pair \(i\ne j\), the twodimensional marginal copula density function \(c_{ij}\) of \(\mu \) is square integrable. Then, \(\beta _\mathrm{L}(\mu )>0\). In particular, \(\mu \) is copositive.
Proof
We first prove that if \(T\in \prod _{i=1}^d L_0^2(\mu _i)\) satisfies an equation \(\sum _i T_i=0\), then \(T=0\). Assume \(\sum _i T_i=0\). Let \(I\subset \{1,\ldots ,d\}\) be the set of indices i such that \(\mu (T_i\ne 0)>0\). Next, by contradiction, assume I is not empty. Let \(A_i=\{x_i\mid T_i>0\}\) for \(i\in I\). Since \(\int T_i{\mathrm {d}\mu }_i=0\), we have \(\mu _i(A_i)>0\). However, based on the assumption about the support, we obtain \(\mu (\cap _{i\in I}A_i) > 0\), which implies that \(\mu (\sum _i T_i> 0)>0\) and contradiction. Thus, I is empty, and \(T=0\).
Now, we prove that \(\beta _{\mathrm{L}}(\mu )>0\) using elementary cocepts of functional analysis (refer to [39]). Assume that \(\mu _i\) is uniform over [0, 1], i.e., \(\mu \) is a copula distribution. Let \(H=\prod _{i=1}^d L_0^2(\mu _i)\) be a Hilbert space of \(\mathbb {R}^d\)valued functions and define the inner product of H as \(\langle T,U\rangle _H = \sum _i \int T_iU_i {\mathrm {d}x}_i\). Let \(c_{ij}\) be the pairwise copula density and define an operator \(C:H\rightarrow H\) by
Based on the assumption that \(\iint c_{ij}^2 {\mathrm {d}x}_i {\mathrm {d}x}_j<\infty \), we deduce that C is a HilbertSchmidt operator. It is easy to see that C is selfadjoint. Now, we can write
where I is the identity operator. If \((I+C)T=0\), then (23) implies \(\sum _i T_i=0\) and, therefore, \(T=0\). Thus, \(I+C\) is injective. Since the operator \(I+C\) is an injective Fredholm operator, it is surjective. By the continuous inverse theorem, we deduce that the inverse operator \((I+C)^{1}\) is bounded. Therefore, we have
which means \(\beta _{\mathrm{L}}(\mu )\ge \Vert (I+C)^{1}\Vert ^{1}>0\). \(\square \)
Corollary 3
If \(\mu \) has a positive and bounded copula density function, then \(\mu \) is copositive.
In Sect. 6, we deal with positive and piecewise uniform copula density functions.
Note that the support of \(\mu \) is not determined from the support of twodimensional marginal distributions. See the following example. Refer to [32] for related topics.
Example 5
Let \(\mu \in \mathcal {P}^2(\mathbb {R}^4)\) be the uniform measure supported on the region
where \((+,+,,)\) denotes the set \([0,1]\times [0,1]\times [1,0]\times [1,0]\), and so on. Then \(\mu \) is not copositive although each twodimensional marginal distribution is supported on \([1,1]^2\). In order to demonstrate this point, let \(T_i(x_i)=\mathrm{sign}(x_i)\) for each i. Then \(\int T_i{\mathrm {d}\mu }_i=0\) and \(\int T_i^2{\mathrm {d}\mu }_i>0\) but \(\int (\sum _i T_i)^2 {\mathrm {d}\mu }= 0\). Hence, \(\beta (\mu )=0\).
E.4 Tail dependence
Many useful copulas in application exhibit tail dependence (e.g. [14, 17, 27]). The following lemma shows that, unfortunately, Theorem 7 is not helpful for this class of copulas.
Lemma 18
Let \(d=2\) and assume that the copula density \(c(u_1,u_2)\) has lowertail dependency
Then, \(\iint c(u_1,u_2)^2{\mathrm {d}u}_1{\mathrm {d}u}_2=\infty \). Similar results hold for other types of tail dependency.
Proof
The CauchySchwarz inequality implies that
If c is squareintegrable, then the lefthand side should converge to 0 as \(\delta \rightarrow 0\), which is impossible. Thus, c is not squareintegrable. \(\square \)
We conjecture that many copulas with tail dependence are copositive. On the other hand, there is a noncopositive measure with positive copula density, as follows.
Example 6
(Tail countercomonotonic copula) It is known that there is a positive copula density function with the property
which is equivalent to \(\lambda =1\) in Lemma 18. Such a copula is referred to as a lower tail comonotonic copula (see Sect. 2.21 of [17]). Let \(\mu \) be the induced measure of \(Y_1=X_1\) and \(Y_2=1X_2\). Then, \(\mu \) is not copositive. Indeed, define a map \(T\in \mathcal {T}_\mathrm{cw*}(\mu )\) by
where \(I_A\) denotes the indicator function of a set A. Then, \(\Vert T_1\Vert =\Vert T_2\Vert =\sqrt{\delta (1\delta )}\) and
Therefore,
In a similar manner to Lemma 16, we deduce that \(\beta (\mu )=0\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sei, T. Coordinatewise transformation of probability distributions to achieve a Steintype identity . Info. Geo. 5, 325–354 (2022). https://doi.org/10.1007/s41884021000519
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41884021000519
Keywords
 Copositive distribution
 Copula
 Energy minimization
 Optimal transportation
 Steintype distribution
 Wasserstein space
Mathematics Subject Classification
 60E05
 62E10