1 Introduction

In this paper, we study properties of the Block Rearrangement Algorithm (BRA), which generalizes the standard Rearrangement Algorithm (RA) introduced by Puccetti and Rüschendorf (2012) and further developed in Embrechts et al. (2013). The RA has applications in various disciplines. Embrechts et al. (2013) use the RA in quantitative risk management to obtain sharp approximations for the maximum possible Value-at-Risk (VaR) of a portfolio sum of d dependent risks when all marginal distributions are known but no information on dependence is available; see Aas and Puccetti (2014) for a case study. The RA can also be used in situations under which partial information on dependence is available (see Bernard et al. 2017a, b; Bernard and Vanduffel 2015). The algorithm has also important applications in operations research. Consider the production of some goods requiring d different specialized operations and assume that there are n identical assembly lines. For each operation j there are n workers available; the i-th worker requires an amount of time equal to \(x_{ij}\) to execute the task. The problem of optimal Assembly Line Crew Scheduling (ALCS) is then to (re)allocate the nd workers so that all assembly lines are operating and the maximum production time is minimized (see e.g., Coffman and Yannakakis 1984). The RA attempts to achieve this situation.

We study properties of the RA in the context of the problem of inferring dependence among risks \(X_{j}\) \((j=1,2,\ldots ,d)\) under the sole knowledge of their marginal distributions as well as of the distribution of their sum \(S=\sum _{i=1}^{d}X_{i}.\) Besides some theoretical appeal, the solution to this problem is also of practical relevance in finance. Using the price information of options (written) on individual stocks as well as on the index that comprises these stocks, it is indeed possible to infer the risk-neutral distribution \(F_{j}\) of future stock returns \(X_{j}\) \((j=1,2,\ldots ,d)\) as well as the distribution \(F_{S}\) of the future index return \( S:=X_{1}+X_{2}+\cdots +X_{d}\) (Ross 1976; Breeden and Litzenberger 1978; Banz and Miller 1978; Aït-Sahalia and Lo 2000; Bondarenko 2003). The joint distribution is not observable directly. We develop an algorithm to infer dependence among variables with given marginal distributions \(F_{j}\) and for which the distribution of the sum \(F_{S}\) is known. In other words, our algorithm makes it possible to infer in a model-free and forward-looking way the dependence structure of stock returns consistent with the current market prices of options. There are various potential applications. For instance, the inferred dependence can be used to price multivariate derivatives non-parametrically. It is then straightforward to price arbitrarily weighted basket options, exchange options or spread options such that their prices are consistent with existing options written on individual assets and on a given index. Another possible application is to develop an indicator of herd behavior among stock returns using information on option prices only. Such an index can then be used as a signal of systemic risk.

The method that we propose to infer dependence is solely based on the Rearrangement Algorithm (RA) and its recent generalization, the Block Rearrangement Algorithm (BRA) (Bernard et al. 2017a; Bernard and McLeish 2016; Boudt et al. 2017). In general, there might be several compatible dependence structuresFootnote 1 and one could reasonably expect that by randomizing its initialization, BRA allows one to infer all possible dependence structures. However, this is not the case. When all distributions are Gaussian, we find that among all possible candidate dependence structures the BRA yields a solution with (almost) maximum entropy. That is, the inferred dependence among the variables \(X_{j}\) \((j=1,2,\ldots ,d)\) is approximately Gaussian and its correlation matrix has maximum determinant.

This feature is interesting for the following two reasons: First, the fact that application of the BRA yields solutions that do not really depend on its initialization ensures that the BRA can be used as a stable algorithm for inferring dependence among stock prices. Second, the stable solution is economically meaningful, as the maximum entropy (statistically proved when all distributions are Gaussian) corresponds to the intuition that in well-functioning markets maximum uncertainty must prevail. A dependence with lower entropy would mean that we use information that we do not possess. This insight is also explored in Rubinstein (1994), Jackwerth and Rubinstein (1996), and Stutzer (1996) in the context of inference of marginal risk neutral probabilities from observed option prices. In this regard, we note that the principle of maximum entropy appears in many other disciplines. For example, the second law of thermodynamics asserts that the entropy of irreversible processes can never decrease. Another example pertains to the field of statistics where the principle of maximum entropy is used to obtain prior probability distributions for Bayesian inference; see Jaynes (1957).

The paper is organized as follows. In Sect. 2 we recall the standard Rearrangement Algorithm and the Block Rearrangement Algorithm and present its use to infer dependence among random variables. In Sect. 3, we discuss the principle of maximum entropy. In Sect. 5 we show that for Gaussian margins the algorithm identifies the dependence with maximum entropy using the methodology and notation presented in Sect. 4. In Sect. 6 we perform a robustness analysis and assess the extent to which our main result remains valid in a non-Gaussian framework or when the standard RA is used. We provide evidence that when the distributions are skewed the inferred dependence is still unique and exhibits a correlation matrix with maximum determinant. Section 7 concludes.

2 Rearrangement algorithm to infer dependence

We first present the standard RA of Puccetti and Rüschendorf (2012) and Embrechts et al. (2013) and show how to modify it to infer the dependence among variables for which the marginal distributions and the distribution of the sum are known. We include the presentation of the algorithms for completeness.

2.1 Rearrangement algorithm

The RA has been originally proposed in the context of assessing the impact of model uncertainty on portfolio models and is considered as a breakthrough in this regard. Technically, it amounts to finding a dependence among risks \(X_{j}\) with known distributions \(F_{j}\) \((j=1,2,\ldots ,d)\) such that the variance of their sum \(S=X_{1}+X_{2}+\cdots +X_{d}\) becomes as low as possible. As the algorithm requires random variables that are discretely distributedFootnote 2 we further assume that each risk \(X_{j}\) \( (j=1,2,\ldots ,d)\) is sampled into n equiprobable values, i.e., we consider the realizations \(x_{{ij}}:=F_{j}^{-1}(\frac{i-0.5}{n})\) \((i=1,\ldots ,n; j=1,2,\ldots ,d)\).We obtain an \(n\times d\) matrix \(\mathbf {X:}\)

$$\begin{aligned} {\mathbf {X}}=[X_{1},\ldots ,X_{d}]=\left[ \begin{array}{cccc} x_{{1}1} &{} x_{12} &{} \ldots &{} x_{1d} \\ x_{{2}1} &{} x_{22} &{} \ldots &{} x_{2d} \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ x_{n1} &{} x_{n2} &{} \ldots &{} x_{nd} \end{array} \right] . \end{aligned}$$

The matrix can be readily seen as the representation of a d-dimensional random vector; each column corresponds to a random variable and each row is a possible joint outcome that occurs with probability 1 / n. This interpretation allows us to also apply probabilistic operators (such as the mean, the variance and the correlation) onto columns of matrices.

We consider the problem of rearranging the elements \(x_{{ij}}\) of the matrix \(\mathbf {X}\) such that after rearrangement the variance of S is as small as possible. We allow for rearrangements within columns, as doing so affects the dependence among the \(X_{j}\), \(j=1,2,\ldots ,d\), but not their marginal distributions. By contrast, swapping values among columns will affect the marginal distributions and is not allowed. Clearly, in order for S to have smallest possible variance it must hold that for all \(\ell =1,2,\ldots ,d,\) \(X_{\ell }\) is anti-monotonic with \(\sum _{k=1,k\ne \ell }^{d}X_{k}\) (Puccetti and Rüschendorf 2012, Theorem 2.1); this observation lies at the core of their rearrangement method.

2.1.1 Standard rearrangement algorithm (RA)

  1. 1.

    For \(j=1,2,\ldots ,d\), make the jth column anti-monotonic with the sum of the other columns, i.e., rearrange their values such that they have the property of being oppositely ordered to the row sums across the other columns.

  2. 2.

    If there is no improvement in \(\text {var}\left( \sum _{k=1}^{d}X_{k}\right) \), output the current matrix \(\mathbf {X}\), otherwise return to step 1.

At each step of this algorithm, we ensure that the jth column is anti-monotonic with the sum of the remaining columns, so that the columns, say \(X_{j}\) before rearranging and \(\widetilde{X}_{j}\) after rearranging, verify obviouslyFootnote 3

$$\begin{aligned} \text {var}\left( \sum _{k=1}^{d}X_{k}\right) \geqslant \text {var}\left( \sum _{k=1}^{d}\widetilde{X}_{k}\right) . \end{aligned}$$

Hence, at each step of the algorithm the variance decreases. If the lower bound of zero is attained then we have found the situation of perfect mixabilityFootnote 4 among the variables, i.e., a dependence that makes the sum constant, and a global optimum is obtained. Otherwise, the algorithm will converge to a local minimum.Footnote 5 Indeed, there is no guarantee that the local minimum yields the minimum variance of the sum as another initialization of \({\mathbf {X}}\) (i.e., a random permutation of the elements in each column) is likely to lead to another local optimum. Precisely, finding the dependence structure that minimizes the variance of the sum is a NP-complete problem (Haus 2015), which means that, most likely, there exists no deterministic algorithm with polynomial complexity. Thus, the above RA can only be an heuristic.

The standard RA has been recently generalized by considering blocks (Bernard and McLeish 2016; Bernard et al. 2017a; Boudt et al. 2017). The idea is as follows. If the distribution of \(S=X_{1}+X_{2}+\cdots +X_{d}\) has minimum variance it must actually hold that for any decomposition of \(\{1,2,\ldots ,d\}=I_{1}\cup I_{2} \) into two disjoint sets \(I_{1}\) and \(I_{2},\) the sums \( S_{_{1}}:=\sum _{k\in I_{1}}X_{j}\) and \(S_{2}:=\sum _{k\in I_{2}}X_{j}\) are anti-monotonic (and not only for sets of the form \(I_{1}=\) \(\{j\}\) and \( I_{2}=\{1,2,\ldots ,d\}\backslash I_{1}\), as in the standard RA). This observation makes it possible to generalize the standard RA by rearranging “blocks of columns” instead of one column at a time: The columns in the first set \(I_{1}\) are stacked into a matrix (block) \(\mathbf {X}_{1}\) and one rearranges its rows (i.e., one swaps entire rows) such that the row sums of \(\mathbf {X}_{1}\) (reflecting \(S_{_{1}}\)) are in increasing order. As for the matrix \(\mathbf {X}_{2}\) that is formed by stacking the remaining columns, the rows are rearranged such that the row sums (reflecting \(S_{_{2}}\)) are in decreasing order.

2.1.2 Block rearrangement algorithm (BRA)

  1. 1.

    Select a random sample of \(n_{sim}\) possible partitions of the columns \(\{1,2,\ldots ,d\}\) into two non-empty subsets \(\{I_{1},I_{2}\}.\) When \(d\leqslant 10\), we take \(n_{sim}=2^{d-1}-1\) so that all non-trivial partitions are considered.Footnote 6

  2. 2.

    For each of the \(n_{sim}\) partitions, create the matrices (blocks) \( \mathbf {X}_{1}\) and \(\mathbf {X}_{2}\) with corresponding row sums \(S_{_{1}}\) and \(S_{2}\) and rearrange rows of \(\mathbf {X}_{2}\) so that \(S_{2}\) is anti-monotonic to \(S_{1}.\)

  3. 3.

    If there is no improvement in \(\text {var}\left( \sum _{k=1}^{d}X_{k}\right) \), output the current matrix \(\mathbf {X}\), otherwise, return to step 1.

When all partitions can be assessed, Bernard and McLeish (2016) show that the BRA outperforms the standard RA by orders of magnitude in terms of the distance between the obtained local minimum for the variance of the sum and the global minimum for the variance of the sum. When the different partitions cannot be assessed in a reasonable amount of time (i.e., for matrices with many columns), a stopping rule needs to be used and it is no longer clear cut that BRA dominates the standard RA. Boudt et al. (2017) conduct a horse race and show that for typical applications in quantitative risk management the standard RA and BRA are close competitors. By contrast, for some rearrangement problems that appear in operations research, a suitably designed BRA is outperforming.

2.2 Inferring dependence

To infer the dependence among variables \(X_{1},\ldots ,X_{d}\) with known distributions and for which the distribution of the sum S is also known, we apply the BRA on the following \(n\times (d+1)\) matrix \(\mathbf {M}\) that is defined as

$$\begin{aligned} \mathbf {M}=[X_{1},\ldots ,X_{d},-S]=\left[ \begin{array}{ccccc} x_{{1}1} &{} x_{12} &{} \ldots &{} x_{1d} &{} -s_{1} \\ x_{{2}1} &{} x_{22} &{} \ldots &{} x_{2d} &{} -s_{2} \\ \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots \\ x_{n1} &{} x_{n2} &{} \ldots &{} x_{nd} &{} -s_{n} \end{array} \right] . \end{aligned}$$
(2.1)

Elements \(x_{ij}\) are defined as before and we sample the realizations \(s_{i}\) of the sum S using its distribution \(F_{S}\), i.e., \(s_{{i}}:=F_{S}^{-1}(\frac{i-0.5}{n}).\)

In this regard, we note it is not guaranteed that after running the BRA the row sums of the rearranged matrix are all equal to zero and that a compatible candidate dependence has been found. Formally, “compatibility” is defined as follows.

Definition 2.1

(Compatibility) A set of d distributions \(F_1,F_2,\ldots ,F_d\) is said to be compatible with the distribution \(F_S\) if there exists a random vector \((X_1,X_2,\ldots ,X_d)\) and a random variable S such that the \(X_i\) and S are distributed as \(F_i\) (\(i=1,2,\ldots ,d\)) and \(F_S\), respectively, and \(S=\sum _{i=1}^dX_i\) almost surely. In this case, the dependence among the \(X_i\) (\(i=1,2,\ldots ,d\)) is called a compatible dependence.

“Compatibility” is closely related to joint mixability (Wang and Wang 2016). Specifically, \(F_1,F_2,\ldots ,F_d\) is compatible with the distribution \(F_S\) if and only if \(F_1,F_2,\ldots ,F_d\) and \(F_{-S}\) are jointly mixable with null center.

Clearly, a necessary requirement for joint mixability of \(F_1,F_2,\ldots ,F_d\) and \(F_{-S}\) is that \(E(S)=E(X_{1})+\cdots +E(X_{d})\) holds. Wang and Wang (2016) study joint mixability in some generality and provide several theoretical results that ensure existence of a compatible dependence in various cases of interest. Furthermore, Puccetti and Wang (2015) propose the Mixability Detection Procedure (MDP) as a suitable method to numerically assess joint mixability for an arbitrarily set of bounded distributions. In what follows, we assume that there is at least one compatible dependence for the given distributions \(F_i\) (\(i=1,2,\ldots ,d\)) and \(F_S\) and we merely aim at inferring any such dependence.

In the ideal situation, we observe that after running the BRA the row sums of the rearranged matrix are all equal to zero and a compatible dependence has been found. In practice, however, due to discretization errors and the fact that the algorithm is a heuristic, and the solution is only approximate. In the remainder of the paper, we study the properties of this approximate solution. To do so, we repeat the experiment K times. Each time, we randomize each of the \(d+1\) columns of \(\mathbf {M}\) given in (2.1) and next we apply the BRA. For each \(k=1,2,\ldots ,K\) we denote by \(\mathbf {M}^{(k)}=[X_{1}^{(k)},\ldots ,X_{d}^{(k)},-S^{(k)}]\) the matrix after rearrangement. Note that the vectors \([X_{1}^{(k)},\ldots ,X_{d}^{(k)}]\) only differ with respect to their interdependence. As for each k we randomize the initial matrix (i.e., we randomize the initial condition for the algorithm), it appears reasonable to assert that the procedure provides a way to describe the set of all possible dependence structures (copulas) that are consistent with the given information. Nevertheless, we find that application of the procedure leads to “a thin set” of dependence structures : the solutions \(\mathbf {X}^{(k)}=[X_{1}^{(k)},\ldots ,X_{d}^{(k)}]\) all have (nearly) the same correlation matrix and this matrix has maximum determinant. The only exceptions come from the situations in which the initial dependence structure is such that the variance of the row sums is already zero. In this case, the BRA has converged up-front and stops at the initial configuration. However, our study will show that even a tiny perturbation of this initial structure will suffice for the algorithm to converge to a dependence structure with correlation matrix that has maximum determinant. Moreover, when distributions are Gaussian we show that this inferred dependence actually reflects maximum entropy. The following section introduces the notation and some theoretical results that are needed for this purpose.

3 Maximum entropy

In the remainder of the paper, all variables \(X_{1},\ldots ,X_{d}\) and their sum S have known distributions \(F_{1},\ldots ,F_{d}\) and \(F_{S}\). Let \(\mu _{1},\ldots ,\mu _{d}\) and \(\mu _{S}\) denote their respective means and \(\sigma _{1},\ldots ,\sigma _{d}\) and \(\sigma _{S}\) denote the standard deviations.

3.1 Correlation matrices with maximum determinant

In general, there could be many compatible dependence structures (copulas) for the given \(F_{1},\ldots ,F_{d}\) and \(F_{S}\). To compare them, we will focus on the properties of their corresponding (Pearson) correlation matrices. In this regard, it is convenient to formally define several sets of correlation matrices.

Definition 3.1

(All correlation matrices) Let \(\mathcal {B}\) denote the set of all \(d \times d\) symmetric semi-definite matrices, whose elements are equal to one on the main diagonal and range between \(-1\) and 1 everywhere else.

Consider a correlation matrix \(R\in \mathcal {B}\),

$$\begin{aligned} R=\left[ \begin{array}{cccc} 1 &{} r_{12} &{} \ldots &{} r_{1d} \\ r_{21} &{} 1 &{} \ldots &{} r_{2d} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ r_{d1} &{} r_{d2} &{} \ldots &{} 1 \end{array} \right] . \end{aligned}$$
(3.2)

Since the standard deviations \(\sigma _{1},\ldots ,\sigma _{d}\) and \(\sigma _{S}\) are fixed, R (corresponding to a compatible dependence) must satisfy the following additional restriction:

$$\begin{aligned} \sigma _{S}^{2}=\sum _{i=1}^{d}\sigma _{i}^{2}+2\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}{r}_{ij}. \end{aligned}$$

Equivalently, we can re-write the above condition as the constraint on average correlation:

$$\begin{aligned} \frac{\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j} r _{ij}}{\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}} = \rho ^{imp}, \end{aligned}$$
(3.3)

where constant \(\rho ^{imp}\) is defined as

$$\begin{aligned} \rho ^{imp}:=\frac{\sigma _{S}^{2}-\sum _{i=1}^{d}\sigma _{i}^{2}}{2\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}}. \end{aligned}$$

This motivates our next definition of the constrained set \(\mathcal{C}(\rho ^{imp})\).

Definition 3.2

(Constrained correlation matrices) Let \(\mathcal{C}(\rho ^{imp})\) denote the set of correlation matrices \(R \in \mathcal{B}\) satisfying the constraint in (3.3).

Figure 1 illustrates the two sets \(\mathcal{B}\) and \(\mathcal{C}(\rho ^{imp})\) for the special case when \(d=3\).

Fig. 1
figure 1

The sets \(\mathcal {B}\) and \(\mathcal{C}(\rho ^{imp})\) in Definitions 3.1 and 3.2 for \(d=3\). The example uses \(\sigma _1=1\), \(\sigma _2=\frac{2}{3}\), \(\sigma _3=\frac{1}{3}\), and \(\rho ^{imp}=0.8\). The colored volume represents the set of all correlation matrices \(\mathcal {B}\) and the (gray) plane represents the linear constraint (3.3). Their intersection forms the constrained set \(\mathcal{C}(\rho ^{imp})\). (Color figure online)

Proposition 3.3

The set \(\mathcal{C}(\rho ^{imp})\) is convex and compact.

Proof

It is well-known that set \(\mathcal {B}\) is compact and convex. On the other hand, the hyperplane corresponding to the linear constraint (3.3) is clearly convex and closed. Thus, Proposition 3.3 holds. \(\square \)

Assume that the \(F_{1},\ldots ,F_{d}\) are compatible with \(F_{S}\). Clearly, every compatible dependence must be such that its corresponding correlation matrix satisfies the constraint in (3.3). By contrast, there might be correlation matrices R that satisfy this constraint but for which no compatible dependence exists.

Definition 3.4

(Compatible correlation matrices) Let \(\mathcal{D}\) denote the set of correlation matrices \(R \in \mathcal{B}\) that result from all compatible dependence structures (copulas) for the given marginal distributions \(F_1,F_2,\ldots ,F_d\) and \(F_S\).

It is clear that \(\mathcal{D} \subset \mathcal{C}(\rho ^{imp})\) but the opposite does not readily hold. The following proposition shows that in the Gaussian case \(\mathcal{D}\) and \(\mathcal{C}(\rho ^{imp})\) coincide.

Proposition 3.5

Assume that \(F_{1},\ldots ,F_{d}\) and \(F_{S}\) are Gaussian distributions and that \(F_{1},\ldots ,F_{d}\) are compatible with \(F_{S}\), then \(\mathcal{D} = \mathcal{C}(\rho ^{imp})\).

Proof

We only need to show that \(\mathcal{C}(\rho ^{imp}) \subset \mathcal{D}\). Consider \(R\in \mathcal{C}(\rho ^{imp}):\) it is a well-defined correlation matrix satisfying (3.3). Then there exists a d-dimensional Gaussian joint distribution with a Gaussian copula G with correlation matrix R and marginal distributions \(F_1,F_2,\ldots ,F_d\). Let \((X_1, X_2,\ldots ,X_d\)) be a vector following this d-dimensional Gaussian joint distribution, then \(S:=X_1+X_2+ \cdots +X_d\) follows a Gaussian distribution identical to \(F_S\). Therefore, the \(F_1,F_2,\ldots ,F_d\) are compatible with \(F_S\). Thus there exists a compatible dependence structure among the \(X_i\) (distributed as \(F_i\)) having correlation matrix R. Thus R belongs to \(\mathcal{D}.\) \(\square \)

Remark 3.6

Note that under the framework of elliptical distributions (which includes Gaussian distributions as a special case), necessary and sufficient conditions for a set of marginals to be jointly mixable have been given in Wang and Wang (2016). In particular, Gaussian distributions \(\mathcal {N}(0,\sigma _i^2)\), \(i=1,\ldots ,d+1\) are jointly mixable if and only if \(\sum _{i=1}^{d+1}\sigma _i\geqslant 2\max _{i=1,\ldots ,d+1}\sigma _i\). In fact, the necessary part follows from subadditivity of standard deviation (triangle inequalities) and the sufficient part builds on closedness of the (multivariate) normal family under linear combinations. More details can be found in their paper.

In general, characterizing the set \(\mathcal {D}\) for any choice of marginal distributions is an open problem, as in general it will only be a strict subset of \(\mathcal {C}(\rho ^{imp})\).Footnote 7

Proposition 3.7

(Maximum determinant for Gaussian margins) There exists a unique matrix \(R \in \mathcal{C}(\rho ^{imp})\) that has maximum determinant among all matrices in \(\mathcal{C}(\rho ^{imp})\).

Proof

Since log determinant is continuous and \(\mathcal{C}(\rho ^{imp})\) is compact, there exists a matrix \(R \in \mathcal{C}(\rho ^{imp})\) which attains maximum determinant. Furthermore, since \(\mathcal{C}(\rho ^{imp})\) is a convex set and log determinant is strictly concave (see Boyd and Vandenberghe 2004, Chapter 3, Section 1, p. 74), the matrix that maximizes log determinant and thus also the determinant is unique. \(\square \)

Definition 3.8

Let \(R_{M}(\rho ^{imp})\) denote the matrix with maximum possible determinant among all \(R \in \mathcal{C}(\rho ^{imp})\) and let \(\Delta _{M}(\rho ^{imp})\) denote its determinant.

Without the linear constraint (3.3) on the correlation matrix, the maximum determinant (over the set \(\mathcal {B}\)) is equal to 1, which is achieved when all off-diagonal elements of R are zero. With this constraint, however, \(\Delta _{M}(\rho ^{imp})\) is typically strictly less than 1. To find \(\Delta _{M}(\rho ^{imp})\), we solve numerically the constrained optimization in Matlab. Specifically, we search for a symmetric correlation matrix R in (3.2) with \(\frac{1}{2}d(d-1)\) unknown parameters, which maximizes the determinant subject to the constraints that all eigenvalues are non-negative and the additional linear constraint in (3.3).

3.2 Maximum entropy dependence for Gaussian margins

Entropy refers to disorder or uncertainty of a system. It plays a fundamental role in various disciplines including information theory, computing and thermodynamics (e.g., the second law of thermodynamics). In his fundamental work (Shannon 1948) defines the entropy H of a discrete random variable X with probability mass function P(X) as \(H(X)=-\text {E}(\log (P(X))\). There are natural extensions to measure the entropy of multivariate distributions. Let f be the density of a multivariate distribution of \((X_1,\ldots ,X_d)\), then the entropy is defined as

$$\begin{aligned} H(X_1,\ldots ,X_d)=-\text {E}(\log (f(X_1,\ldots ,X_d))). \end{aligned}$$

The following result characterizes the multivariate distribution of a random vector \((X_1,\ldots ,X_d)\) with Gaussian margins. Its proof follows from the fact that among all absolutely continuous distributions with unbounded support having a given invertible covariance matrix, the multivariate Gaussian distribution has maximum entropy.

Proposition 3.9

(Maximum entropy for Gaussian margins) The entropy of the multivariate distribution of a random vector \((X_1,\ldots ,X_d)\) with Gaussian margins and invertible correlation matrix R satisfies

$$\begin{aligned} H(X_1,..,X_d)\leqslant \frac{d}{2}\left( 1+\ln (2\pi )\right) + \sum _{i=1}^{d}\ln (\sigma _{i})+ \frac{1}{2}\ln (\det (R)), \end{aligned}$$
(3.4)

where the equality holds if and only if \((X_1,\ldots ,X_d)\) is a multivariate Gaussian distribution.

This proposition (and its proof) can be found as Theorem 8.6.5 in Thomas and Cover (2006), where it is formulated in a slightly different way. Specifically, these authors prove that

$$\begin{aligned} H(X_1,..,X_d)\leqslant \frac{d}{2}\left( 1+\ln (2\pi )\right) + \frac{1}{2}\ln (\det (\Sigma )), \end{aligned}$$

in which \(\Sigma \) is the covariance matrix of the vector \((X_1,\ldots ,X_d)\). Since the determinant of the covariance matrix \(\Sigma \) writes as \(\det [\Sigma ]=\det [R]\cdot \left( \prod _{i=1}^{d}\sigma _{i}^{2}\right) \), we obtain (3.4).

As a direct consequence of Proposition 3.9, the maximum entropy distribution that is consistent with given Gaussian margins and a given average correlation \(\rho ^{imp}\) is uniquely characterized by a Gaussian dependence with a correlation matrix \(R_{M}(\rho ^{imp})\).

Our numerical experiments show that there is a striking connection between the BRA and maximum entropy in that every run of BRA leads to

  1. (i)

    a dependence close to Gaussian,

  2. (ii)

    a correlation matrix with determinant close to the maximum possible determinant.

Given that we do not formally prove the result, but rather provide evidence based on extensive simulations, we need to introduce relevant concepts that make it possible to reach such conclusion. This will be the topic of the next section.

Remark 3.10

The conclusion regarding the maximum entropy property remains unchanged when the “sum” is not put in the last column. This can be seen by relabeling the columns. For every column it indeed holds in the situation of (joint) mixability that \(X_k=-\sum _{i\ne k} X_i\). We have \(d+1\) columns for which the determinants of their correlation matrix is 0, but when we extract d columns, the determinant must be maximum. In addition, after we prove that the dependence is Gaussian, then if \((X_1,\ldots ,X_d)\) is Gaussian, then \((X_1,..,X_{k-1},X_{k+1},\ldots ,X_d,-S)\) is also a Gaussian vector.

4 Maximum entropy and BRA

We run K times the BRA. For \(k=1,2,\ldots ,K\), let \(R^{(k)}\) denote the output correlation matrix of \(\mathbf {X}^{(k)}\):

$$\begin{aligned} R^{(k)} := \left[ \begin{array}{cccc} 1 &{} \rho _{12}^{(k)} &{} \ldots &{} \rho _{1d}^{(k)} \\ \rho _{21}^{(k)} &{} 1 &{} \ldots &{} \rho _{2d}^{(k)} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho _{d1}^{(k)} &{} \rho _{d2}^{(k)} &{} \ldots &{} 1 \end{array} \right] . \end{aligned}$$

and denote its determinant by \({\Delta ^{(k)}}\). We also define the realized (or effective) average correlation \(\varrho ^{(k)}\) as

$$\begin{aligned} \varrho ^{(k)}=\frac{\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}\rho _{ij}^{(k)}}{\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}}, \end{aligned}$$
(4.5)

in which \(\sigma _{i}\) specifically denotes the standard deviation of \(X_{i}^{(k)}\) and \(\rho _{ij}^{(k)}\) is the pairwise correlation between \(X_{i}^{(k)}\) and \(X_{j}^{(k)}\), for all i and j.Footnote 8 Finally, we define the deviation \(V^{(k)}\) as the \(L^{2}\)-norm of the \(n \times 1\) residual vector \(e^{(k)} := X_{1}^{(k)}+\cdots +X_{d}^{(k)} -S^{(k)}\):

$$\begin{aligned} V^{(k)}=\left| \left| e^{(k)}\right| \right| _{2}. \end{aligned}$$
(4.6)

Similarly to all other quantities above, \(V^{(k)}\) depends on n, but, for the ease of presentation, we omit the explicit dependence on n.

Proposition 4.1

The following inequality holds:

$$\begin{aligned} \left| \varrho ^{(k)} - \rho ^{imp} \right| \leqslant C_1 V^{(k)} + C_2 \left( V^{(k)} \right) ^2, \end{aligned}$$
(4.7)

where \(C_1\) and \(C_2\) are constants depending on the standard deviations \(\sigma _1, \ldots , \sigma _d\) and \(\sigma _S\).

Proof

For simplicity, we drop the superscript in \(X_i^{(k)}\), \(S^{(k)}\), \(\varrho ^{(k)}\), \(e^{(k)}\) and \(V^{(k)}\). Denote \(U =X_1+ \cdots + X_d\), so that \(e = U-S\). Using the definitions of \(\varrho \) and \(\rho ^{imp}\):

$$\begin{aligned} \varrho - \rho ^{imp} = \frac{\sigma _{U}^{2}-\sum _{i=1}^{d}\sigma _{i}^{2}}{2\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}} - \frac{\sigma _{S}^{2}-\sum _{i=1}^{d}\sigma _{i}^{2}}{2\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}} = \frac{\sigma _{U}^{2}-\sigma _{S}^{2}}{2\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}}. \end{aligned}$$

Therefore,

$$\begin{aligned} \left| \varrho - \rho ^{imp} \right| = \frac{ \left| \sigma _{U}^{2}-\sigma _{S}^{2} \right| }{2\sum _{i=1}^{d-1}\sum _{j>i}\sigma _{i}\sigma _{j}}. \end{aligned}$$

However,

$$\begin{aligned} \left| \sigma _{U}^{2}-\sigma _{S}^{2} \right| \leqslant \sigma _e^2 + 2\sigma _e \sigma _S \leqslant V^2 + 2 V \sigma _S, \end{aligned}$$

which completes the proof of the proposition. \(\square \)

Intuitively, Proposition 4.1 states that, as the accuracy of a candidate solution improves (i.e., the deviation \(V^{(k)}\) decreases), the realized average correlation \(\varrho ^{(k)}\) generally gets closer to \(\rho ^{imp}\).

Each run of the algorithm leads to the realized determinant \({\Delta ^{(k)}}\), average correlation \(\varrho ^{(k)}\), and deviation \(V^{(k)}\) (\(k=1,2,\ldots ,K\)). We denote with \(\Delta :=(\Delta ^{(1)},\Delta ^{(2)},\ldots ,\Delta ^{(K)})\), \(\varrho :=(\varrho ^{(1)},\varrho ^{(2)},\ldots ,\varrho ^{(K)})\) and \(V:=(V^{(1)},V^{(2)},\ldots ,V^{(K)})\), respectively, and introduce measures of dispersion that make it possible to assess them in a meaningful way.

4.1 Distance to the maximum determinant matrix

To assess the similarity of the outputs of BRA, we make use of the following measure of dispersion.

Definition 4.2

(Mean Percentage Error) For any vector of interest Z taking values \(z^{(k)}\) \((k=1,2,\ldots ,K)\) and some “reference point” z, we denote by MPE(Z) its average percentage error, where percentage error (PE) of a \(z^{(k)}\) is simply given as

$$\begin{aligned} PE(z^{(k)}):=100\left[ \frac{z^{(k)}-z}{z}\right] . \end{aligned}$$

The main discrepancy measure that we consider concerns the normalized determinants \(\root d \of {\Delta ^{(k)}}\). The determinant is a measure of volume in \(\mathbb {R}^{d}\) and it satisfies the scaling property that \(\det [cR]=c^{d}\det [R]\) for constant c. Thus, \(\root d \of {\Delta }\) can be viewed as “dimension-invariant.” Specifically, we consider

$$\begin{aligned} MPE_{1}\left( \root d \of {\Delta }\right) , \end{aligned}$$
(4.8)

which measures by how much on average normalized determinants \(\root d \of {\Delta ^{(k)}}\) differ from the reference value \(\root d \of {\Delta _{M}(\rho ^{imp})}\). That is, for \(k=1,2,\ldots ,K\) we first compute the percentage errors \(PE\left( \root d \of {\Delta ^{(k)}}\right) \) with respect to \(\root d \of { \Delta _{M}(\rho ^{imp})}\) and then take their average.

In this regard, one would expect that \(\root d \of {\Delta ^{(k)}}\leqslant \root d \of {\Delta _{M}(\rho ^{imp})}\), implying that distance \(MPE_{1}\left( \root d \of {\Delta }\right) \) would be a suitable measure of discrepancy (as the average of K errors that all have the same sign). However, the interpretation of \(MPE_{1}\left( \root d \of {\Delta }\right) \) is somewhat ambiguous. Due to discretization, it is usually not even possible to obtain a dependence that yields a realized correlation equal to the target correlation \(\rho ^{imp}.\) Typically, \(V^{(k)}\ne 0\) and \(\varrho ^{(k)}\) is slightly smaller in absolute value than \(\rho ^{imp}\), which means that \(\Delta _{M}(\varrho ^{(k)})>\Delta _{M}(\rho ^{imp}).\) Hence, it may happen for some k that \(PE\left( \root d \of {\Delta ^{(k)}}\right) >0\). To accommodate for this possibility, we consider an alternative measure of discrepancy where \(PE\left( \root d \of {\Delta ^{(k)}}\right) \) is computed with respect to the reference point \(\root d \of {\Delta _{M}(\varrho ^{(k)})}\) instead of the previous \(\root d \of {\Delta _{M}(\rho ^{imp})\text { }}\). This choice of the reference point ensures that \(PE\left( \root d \of {\Delta ^{(k)}}\right) \leqslant 0.\) We denote the resulting percentage error as \(PE_{2}\left( \root d \of {\Delta ^{(k)}}\right) \) and its average as

$$\begin{aligned} MPE_{2}\left( \root d \of {\Delta }\right) . \end{aligned}$$
(4.9)

Finally, the average percentage error of \(\varrho \) to the target value \(\rho ^{imp}\) is denoted by \(MPE(\varrho ).\)

It is easy to verify that these three measures are scale invariant. In the following examples, we observe that all these measures converge to 0 when the number of discretization steps n increases.

Property 4.3

(Scale invariance) The measures \(MPE_{1}\left( \root d \of {\Delta }\right) \), \(MPE_{2}\left( \root d \of {\Delta }\right) \) and \(MPE(\varrho )\) are scale-invariant.

4.2 Testing the Gaussian dependence

We also test whether for \(k=1,2,\ldots ,K\) the different \(\mathbf {M}^{(k)}=[X_{1}^{(k)},\ldots ,X_{d}^{(k)},-S^{(k)}]\) can be seen as draws from a multivariate normal distribution. Very well-known tests in this regard are Mardias tests of multinormality or variants hereof. For general multivariate data, Mardia (1970) constructed two statistics for measuring multivariate skewness and kurtosis, which can be used for testing the hypothesis of normality (Mardia 1974, 1975; Mardia et al. 1980). Usually the tests whether kurtosis (MK) and skewness (MS) are consistent with a Gaussian model are performed separately, but there are also so-called omnibus tests that assess them simultaneously. In this paper, we use these two tests and perform them separately.

In the next section, we show that for Gaussian marginal distributions, the BRA yields a Gaussian dependence with corresponding correlation matrix that has maximum determinant.

5 Normal distribution

Assume that the components \(X_{i}\) \((i=1,2,\ldots ,d)\) and their sum S are all normally distributed, and that they are “compatible” with the distribution of the sum (as per Definition 2.1). To simplify the exposition, all variables are centered, i.e., they have zero mean. The standard deviations \(\sigma _{1},\ldots ,\sigma _{d}\) are assumed to linearly decrease from 1 to 1 / d, whereas the standard deviation of the sum \(\sigma _{S}\) is chosen in such a way that \(\rho ^{imp}=0.8\). The number of components d ranges from 3 to 10. To discretize the marginal distributions, we use either \(n=1000\) or \(n=10{,}000\). In each experiment, we run the BRA \(K=500\) times to obtain candidate approximate dependencies, as described in Sect. 2.2.

Fig. 2
figure 2

Normal distribution with \(d=3\) and \(n=1000\). For each case, there are \(K=500\) blue dots corresponding to different runs of the BRA. The shaded gray area represents the constrained set \(\mathcal{C}(\rho ^{imp})\) and the red star indicates the maximal matrix \(R_M\). The left panel shows \(K=500\) realizations of the correlations \(\rho _{12}\), \(\rho _{13}\), and \(\rho _{23}\). The right panel shows the relation \(\Delta \) versus \(\rho _{12}\). (Color figure online)

Our first experiment considers the case of \(d=3\) and is illustrated in Fig. 2. The left panel shows the relationships between the three pairwise correlations \(\rho _{12}\), \(\rho _{13}\), and \(\rho _{23}\). On this plane, the gray “ellipse” represents the constrained set \(\mathcal{C}(\rho ^{imp})\).Footnote 9 See also Fig. 1 for a graphical illustration of \(\mathcal{C}(\rho ^{imp})\). All the points in this gray area are candidate solutions and correspond to a global optimum for the variance minimization problem under consideration. With the help of Matlab, we solve the high-dimensional constrained optimization problem to find the correlation matrix \(R_{M}(\rho ^{imp}) \in \mathcal{C}(\rho ^{imp})\) with maximum determinant \(\Delta _{M}(\rho ^{imp})\) (see also Sect. 4) and we represent it with a red star. We then run the BRA 500 times and show the corresponding solutions with blue dots. It is striking to observe how for all 500 runs the candidate solutions obtained with the BRA exhibit correlations which are very close to the ones of the maximum determinant matrix. This conclusion is also confirmed by the right panel which plots the values of \(\Delta ^{(k)}\) versus \(\rho _{12}^{(k)}.\) When the level of discretization n increases from 1000 to 10,000, the cluster of the blue dots becomes even tighter and closer to the maximum determinant matrix.

The representation of the range of possible correlation through an “ellipse” is convenient when \(d=3\), but such representation needs to be adapted in higher dimensions. Therefore, in Table 1 we provide several measures that allow to interpret the properties of the output matrix when d increases up to 10 dimensions. In particular, we report the measures of discrepancy \(MPE_{1}(\root d \of {\Delta ^{(k)}})\) and \(MPE_{2}(\root d \of {\Delta ^{(k)}}).\) The last two columns of this table report the results of two statistical tests MS and MK that we performed to test for multivariate normality (see Sect. 4.2).

Table 1 Normal distribution
Fig. 3
figure 3

Normal distribution with \(d=3\) and \(d=7\). For each case, mean pricing errors \(MPE_1(\root d \of {\Delta })\) (blue line) and \(MPE_2(\root d \of {\Delta })\) (red line) are shown for various value of n. (Color figure online)

We make the following observations related to Table 1. First, we observe that the results illustrated in Fig. 2 for \(d=3\) carry over to higher dimensions. Indeed, \(MPE_{1}(\root d \of {\Delta })\) and \(MPE_{2}(\root d \of {\Delta })\) are both very small, which confirms the convergence of the BRA to a candidate dependence that has the correlation matrix with maximum determinant (MPEs are computed by averaging the values displayed in Figs. 4, 5). In addition, the rate of convergence is illustrated on Fig. 3.

Second, we observe that the algorithm converges better when n increases, as measured by the mean of the 500 squared errors \(V^{(k)}\). Somewhat surprisingly, the convergence also improves when the dimension d increases. Third, we find that the realized correlation \(\varrho ^{(k)}\) on average gets closer and closer to \(\rho ^{imp}\) when d increases and when n increases. This observation is consistent with Proposition 4.1. Third, consistently with intuition, we observe that \(\root d \of {\Delta }\) is much more stable when the discretization level n increases. Finally, for all runs the statistical tests leads to an acceptance of the null-hypothesis of multivariate normality at confidence level of 94%. From Table 1 we can thus conclude that the BRA yields a dependence structure that is Gaussian and with correlation matrix that has maximum determinant.

Fig. 4
figure 4

Normal distribution with \(d=3\) and \(d=5\). Each black dot corresponds to one of the \(K=500\) BRA runs. For each case, the first three panels show \(PE_1(\root d \of {\Delta ^{(k)}})\), \(V^{(k)}\), and \(PE_2(\root d \of {\Delta ^{(k)}})\) with respect to \(PE({\varrho }^{(k)})\). The black dashed lines indicate sample averages. The right bottom panel shows the 5th, 50th and 95th percentiles of distributions of correlations \(\rho _{ij}^{(k)} \) generated by K runs. The percentiles are shown along \(\frac{1}{2}d(d-1)\) vertical lines, corresponding to correlations \(\rho _{12},\rho _{13},\ldots ,\rho _{d-1d}\). The red dots are the corresponding correlations for the maximal matrix \(R_{M}\). (Color figure online)

Fig. 5
figure 5

Normal distribution with \(d=7\) and \(d=10\). Each black dot corresponds to one of the \(K=500\) BRA runs. For each case, the first three panels show \(PE_1(\root d \of {\Delta ^{(k)}})\), \(V^{(k)}\), and \(PE_2(\root d \of {\Delta ^{(k)}})\) with respect to \(PE({\varrho }^{(k)})\). The black dashed lines indicate sample averages. The right bottom panel shows the 5th, 50th and 95th percentiles of distributions of correlations \(\rho _{ij}^{(k)} \) generated by K runs. The percentiles are shown along \(\frac{1}{2}d(d-1)\) vertical lines, corresponding to correlations \(\rho _{12},\rho _{13},\ldots ,\rho _{d-1d}\). The red dots are the corresponding correlations for the maximal matrix \(R_{M}\). (Color figure online)

Figures 4 and 5 hereafter provide additional support for these findings by exploring further the structure of the candidate solutions generated by the BRA when the discretization level \(n=1000\).Footnote 10 Specifically, for each case of dimensionality \(d=3\), 5, 7 or 10, the first three panels plot the measures \(PE_{1}(\root d \of {\Delta ^{(k)}}),\) \(PE_{2}(\root d \of {\Delta ^{(k)}})\) and \(V^{(k)}\) with respect to \(PE({\varrho }^{(k)})\), for each of the 500 runs of the BRA. The last panel shows the distributions of pairwise correlations \(\rho _{12}^{(k)},\rho _{13}^{(k)},\ldots ,\rho _{d-1d}^{(k)}\). The plots confirm that the different candidate solutions are all very similar and exhibit close to maximum entropy.

We make additional observations related to Figs. 4 and 5. First, we note that \(\varrho ^{(k)}\) is always less than \(\rho ^{imp}\). Intuitively, to increase the determinant, correlations should be as close to zero as possible. Second, since the constraint 3.3 must be respected, the optimizer increases those correlations \(\rho _{ij}^{(k)}\), for which the weights \(\sigma _{i}\sigma _{j}\) are relatively high, and reduces those correlations, for which these weights are relatively low. This particular pattern for correlations is evident from all figures. Note also that correlations \(\rho _{ij}^{(k)}\) that correspond to higher products \(\sigma _{i}\sigma _{j}\) are recovered by the BRA more accurately. Third, we observe that V and \(\varrho \) are negatively correlated. Intuitively, when the BRA converges to an approximate solution that is relatively far from a global solution, \(\varrho \) has more “room” to deviate from \(\rho ^{imp}\), which translates in a higher mean percentage error relative to \(\rho ^{imp}\).

Remark 5.1

Under the probabilistic interpretation of matrices given in Sect. 2, the output of RA and BRA algorithms represent discrete mutually complete dependences (see Lancaster 1963) and in particular so-called multivariate Shuffle-of-Mins (see Durante and Sánchez 2012). Neverthelss, we show that the Gaussian assumption cannot be rejected. This shows once moreagai interplay between discrete optimization method that can only provide discrete results and the fact that any dependence structure can be proxied by a sufficiently fine Shuffle-of-Min.

6 Robustness

In this section, we study the robustness of our results. First, we investigate the sensitivity to the initial condition. Second, we investigate marginal distributions that are non-Gaussian. Finally, we investigate the impact when using the standard RA instead of the BRA.

Fig. 6
figure 6

Normal distribution with \(d=3\) and \(n=1000\), 3000, or 10,000. For each case, there are \(K=500\) blue dots corresponding to different runs of BRA. Each run of BRA starts at a particular solution (marked by the green star), but with 2/6/20 random rows swapped when \(n=1000{/}3000{/}10{,}000\). The shaded gray area represents the constrained set \(\mathcal{C}(\rho ^{imp})\) and the red star indicates the maximal matrix \(R_M\). The left panel shows realizations of the correlations \(\rho _{12}\), \(\rho _{13}\), and \(\rho _{23}\). The right panel shows the relation \(\Delta \) versus \(\rho _{12}\). (Color figure online)

6.1 Sensitivity to initial conditions

Recall that when the algorithm starts from a randomized matrix \(\mathbf {M}\) for which the row sums are all already equal to zero [as in (2.1)], then the algorithm has converged up-front and the output matrix is the same as the initial one (which is then just a global optimum). In this particular instance, the BRA does not converge to the maximum determinant correlation matrix. However, if we perturb slightly this initial (optimal) matrix, so that the row sums are now not exactly equal to zero, then we observe that the algorithm will converge to the maximum determinant correlation matrix.

This phenomenon is illustrated in Fig. 6, for the level of discretization \(n=1000\), 3000, and 10,000. We start with a particular global solution, marked by the green star, but with 0.2% of rows (of the matrix \(\mathbf {M}\)) swapped to create a small random perturbation. That is, 2 rows out of 1000, 6 rows out of 3000, and 20 rows out of 10,000 are randomly selected and swapped. Without the perturbation, the BRA would terminate where it starts (at the particular solution, marked by the green star). With the perturbation, however, the BRA typically converges to a solution close the maximum determinant matrix, marked by the red star. The “attraction” to the maximum determinant matrix gets progressively stronger and more striking as n increases. When \(n=1000\), one can see a “path” of blue dots between the green and red stars. When \(n=10{,}000\), all blue dots are now clustered very tightly around the red star.

6.2 Sensitivity to marginal distributions

When the marginal distributions are not Gaussian, it is not clear that the Gaussian copula is the copula that maximizes entropy. Therefore, we merely assess to which extent the property of maximum determinant for the correlation matrix, which we observed for Gaussian margins, remains valid when other distributional assumptions are made. In order to do so we resort to skewed t-distributions. To this end, we recall that a d-dimensional random vector \({\mathbf {X}}\) is said to follow a normal mean-variance mixture, if

$$\begin{aligned} \mathbf {X} \sim {\varvec{\mu }} + Y {\varvec{\gamma }} + \sqrt{Y} \mathbf {Z} \end{aligned}$$
(6.10)

where \({\mathbf {Z}} \sim N_d(0,\mathbf {W})\) is a d-dimensional multivariate normal with \(E [{\mathbf {Z}}] =0\) and \(\hbox {var}({\mathbf {Z}})= \mathbf {W}\), where \(\mathbf {W} \in \mathbb {R}^{d \times d}\) is positive semi-definite, \(Y \geqslant 0\) is a non-negative scalar random variable independent of \({\mathbf {Z}}\), and \({\varvec{\gamma }} \in \mathbb {R}^{d}\) and \({\varvec{\mu }} \in \mathbb {R}^{d}\) are constants.

Specification (6.10) implies that \(\mathbf {X} |_{Y=y} \sim N_d(\varvec{\mu }+y\varvec{\gamma },y \mathbf {W}).\) Furthermore,

$$\begin{aligned} E[\mathbf {X}] = \varvec{\mu } + E[Y] \varvec{\gamma }, \qquad \text{ and } \qquad \text {cov}(\mathbf {X}) = E[Y] \mathbf {W} + \hbox {var}(Y) \varvec{\gamma }\varvec{\gamma }^t. \end{aligned}$$

The multivariate Generalized Hyperbolic (GH) distribution is obtained when the scalar factor in the mixture Y follows a generalized inverse Gaussian distribution with parameters \(\lambda \), \(\chi \) and \(\psi \) (Chapter 3.2 in McNeil et al. 2010) and is denoted as \( \mathbf {X} \sim GH_d(\lambda ,\chi ,\psi ,\varvec{\mu },\mathbf {W},\varvec{\gamma }). \)

Fig. 7
figure 7

Histogram and QQ-plot for sum S generated with multivariate Skewed-t distribution when \(d=3\) and \(n=1000\)

Table 2 Skewed-t distribution
Fig. 8
figure 8

Skewed-t distribution with \(d=3\) and \(n=1000\) or \(n=10{,}000\). For each case, there are \(K=500\) blue dots corresponding to different runs of BRA. The shaded gray area represents the constrained set \(\mathcal{C}(\rho ^{imp})\) and the red star indicates the maximal matrix \(R_M\). The left panel shows \(K=500\) realizations of the correlations \(\rho _{12}\), \(\rho _{13}\), and \(\rho _{23}\). The right panel shows the relation \(\Delta \) versus \(\rho _{12}\). (Color figure online)

Fig. 9
figure 9

Normal distribution with \(d=3\) and \(n=1000\) or \(n=10{,}000\), using the standard RA instead of the BRA. As before, there are \(K=500\) blue dots corresponding to different runs of the RA. The shaded gray area represents the constrained set \(\mathcal{C}(\rho ^{imp})\) and the red star indicates the maximal matrix \(R_M\). The left panel shows \(K=500\) realizations of the correlations \(\rho _{12}\), \(\rho _{13}\), and \(\rho _{23}\). The right panel shows the relation \(\Delta \) versus \(\rho _{12}\). (Color figure online)

Fig. 10
figure 10

Normal distribution with \(d=3\) and \(d=10\), using the standard RA instead of the BRA. Each black dot corresponds to one of the \(K=500\) BRA runs. For each case, the first three panels show \(PE_1(\root d \of {\Delta ^{(k)}})\), \(V^{(k)}\), and \(PE_2(\root d \of {\Delta ^{(k)}})\) with respect to \(PE({\varrho }^{(k)})\). The black dashed lines indicate sample averages. The right bottom panel shows the 5th, 50th and 95th percentiles of distributions of correlations \(\rho _{ij}^{(k)} \) generated by K runs. The percentiles are shown along \(\frac{1}{2}d(d-1)\) vertical lines, corresponding to correlations \(\rho _{12},\rho _{13},\ldots ,\rho _{d-1d}\). The red dots are the corresponding correlations for the maximal matrix \(R_{M}\). (Color figure online)

A useful property of the GH family is that it is closed under linear operations, meaning, in particular, that \(S=\sum _i X_i\) has also a GH distribution. In this paper, we are interested in a special case of the GH family where \(\lambda =-\nu /2\), \(\chi =\nu \) and \(\psi =0\), so that Y follows an inverse Gamma distribution, \(Y\sim IG(\nu /2,\nu /2)\). This special case corresponds to a Skewed-t distribution, for which \( \mathbf {X} \sim Skew_d(\nu ,\varvec{\mu }, \mathbf {W}, \varvec{\gamma }) \) and

$$\begin{aligned} \text {cov}(\mathbf {X}) = \frac{\nu }{\nu -2} \mathbf {W} + \frac{2\nu ^2}{(\nu -2)^2 (\nu -4)} \varvec{\gamma }\varvec{\gamma }^t, \end{aligned}$$
(6.11)

where the covariance matrix is defined as long as \(\nu >4\). The sum S as well as the components \(X_{i}\) \((i=1,2,\ldots ,d)\)) now follow one-dimensional Skewed-t distribution. In particular,

$$\begin{aligned} S\sim Skew_1\left( \nu ,\sum _i \mu _i,{\mathbf 1}{\mathbf {W}}{\mathbf 1}^t,\sum _i \gamma _i\right) . \end{aligned}$$

We simulate from the multivariate Skewed-t distribution to obtain realizations for the the components \(X_{i}\) \((i=1,2,\ldots ,d)\) and their sum S. Next, we use this information to construct the matrix \(\mathbf {M}\) on which we repeatedly apply the BRA. As before, we assume that the number of components d ranges from 3 to 10 and the level of discretization is either \(n=1000\) or \(n=10{,}000\). For the ease of comparison, we set the parameters of the Skewed-t distribution so that the covariance matrix \(\text {cov}(\mathbf {X}) \) is the same as in the normal case. Specifically, as in the normal case, the standard deviations \(\sigma _{1},\ldots ,\sigma _{d}\) of the marginals linearly decrease from 1 to 1 / d and all pairwise correlations are equal to \(\rho ^{imp}=0.8\). For each component \(X_i\), the location parameter is zero, \(\mu _i=0\), and the skewness parameter is proportional to \(\sigma _i\), that is, \(\gamma _i = c \sigma _i = 1.5 \sigma _i\). We set \(\nu =15\) and use the relation in (6.11) to deduce the covariance matrix \(\mathbf {W}\) for the normal factors \( \mathbf {Z}\). Our choice of the parameters guarantees that \(\mathbf {W}\) is positive semi-definite.

Figure 7 illustrates the shape of the Skewed-t distribution used in the simulations. It plots the histogram and QQ-plot for the sum S when \(d=3\) and \(n=1000\). Table 2 reports the results for \(K=500\) runs of the BRA when d ranges from 3 to 10 and for two levels of discretization, \(n=1000\) and \(n=10{,}000\).

In Fig. 8, we analyze the case \(d=3\). As before, the “ellipse” in the left panel represents the constrained set \(\mathcal{C}(\rho ^{imp})\). For all 500 runs, the candidate solutions exhibit correlations that match closely the ones of the maximum determinant matrix. This conclusion only strengthens when the level of discretization increases from 1000 to 10,000. The candidate solutions generated by the BRA exhibit more variation compared to the normal distribution. This is to be expected, as components \(X_i\) now have considerable skewness and heavier tails. As a result, outliers are more frequent and the overall precision of the BRA declines noticeably. Nevertheless, the solutions remain tightly clustered around the one corresponding to the maximum determinant.

Two additional figures in “Appendix”, Figs. 11 and 12, provide further insights into the structure of the output for \(n=1000\). For each case of \(d=3\), 5, 7 or 10, the first three panels plot the measures \(PE_{1}(\root d \of {\Delta ^{(k)}})\), \(PE_{2}(\root d \of {\Delta ^{(k)}})\), and \(V^{(k)}\) with respect to \(PE(\varrho ^{(k)})\), for each of the 500 runs of the BRA. The last panel shows the distributions of pairwise correlations \(\rho _{12}^{(k)},\rho _{13}^{(k)},\ldots ,\rho _{(d-1)d}^{(k)}\). The plots confirm that the candidate solutions are all similar and attain close to maximum determinant.

6.3 Sensitivity to rearrangement algorithm

Finally, we assess to which extent our main findings change when using the standard RA, which is described in Sect. 2. Specifically, we follow the same Gaussian setup as in Sect. 5, except now we use the standard RA to infer the dependence among the variables instead of the BRA. We report the results in Figs. 9, 10, and Table 3.

It is clear that the standard RA does not recover the dependence with maximum entropy. From Fig. 10, we observe that the 500 experiments lead to correlation matrices that may have determinants very far from the maximum determinant matrix. Figure 10 illustrates the magnitude of the distance to the maximum determinant matrix and shows that pairwise correlation can be in a very wide range (fourth panel for \(d=3\) and eighth panel for \(d=10\)) compared to the results found in Figs. 4 and 5 using the BRA. Furthermore, one observes from Table 3 that both multivariate Gaussian tests fail so that the hypothesis that the dependence is a Gaussian copula is now rejected.

Table 3 Normal distribution, using the standard RA instead of the BRA

7 Conclusion

We perform an extensive study of the Block Rearrangement Algorithm (BRA) when used to infer dependence structures among variables that are consistent with information on the marginal distributions of the variables and the distribution of their sum. We find that when the variables are normally distributed the BRA always yields a very similar solution which moreover has a property of maximum entropy. This stylized fact is remarkable as we would expect a wide range of compatible dependence structures to be attained given that we start from a random initial structure and here we merely obtain just one dependence structure exhibiting maximum entropy. These findings are robust in that for non-normal distributions, BRA still yields a unique dependence that exhibits a correlation matrix with maximum determinant. Hence, we obtain strong support for using the BRA as a stable algorithm to infer an economically meaningful dependence. Providing a formal proof for the property of maximum entropy seems very challenging (as we do not observe a systematic increase of entropy at each step of the algorithm) and is left for future research.