Rearrangement algorithm and maximum entropy

We study properties of the block rearrangement algorithm (BRA) in the context of inferring dependence among variables given their marginal distributions and the distribution of their sum. We show that when all distributions are Gaussian the BRA yields solutions that are “close to each other” and exhibit almost maximum entropy, i.e., the inferred dependence is Gaussian with a correlation matrix that has maximum possible determinant. We provide evidence that, when the distributions are no longer Gaussian, the property of maximum determinant continues to hold. The consequences of these findings are that the BRA can be used as a stable algorithm for inferring a dependence that is economically meaningful.


Introduction
In this paper, we study properties of the Block Rearrangement Algorithm (BRA), which generalizes the standard Rearrangement Algorithm (RA) introduced by Puccetti and Rüschendorf (2012) and further developed in Embrechts et al. (2013).The RA has applications in various disciplines.Embrechts et al. (2013) use the RA in quantitative risk management to obtain sharp approximations for the maximum possible Value-at-Risk (VaR) of a portfolio sum of d dependent risks when all marginal distributions are known but no information on dependence is available; see Aas and Puccetti (2014) for a case study.The RA can also be used in situations under which partial information on dependence is available (see Bernard et al. 2017a, b;Bernard and Vanduffel 2015).The algorithm has also important applications in operations research.Consider the production of some goods requiring d different specialized operations and assume that there are n identical assembly lines.For each operation j there are n workers available; the i-th worker requires an amount of time equal to x i j to execute the task.The problem of optimal Assembly Line Crew Scheduling (ALCS) is then to (re)allocate the nd workers so that all assembly lines are operating and the maximum production time is minimized (see e.g., Coffman and Yannakakis 1984).The RA attempts to achieve this situation.
We study properties of the RA in the context of the problem of inferring dependence among risks X j ( j = 1, 2, . . ., d) under the sole knowledge of their marginal distributions as well as of the distribution of their sum S = d i=1 X i .Besides some theoretical appeal, the solution to this problem is also of practical relevance in finance.Using the price information of options (written) on individual stocks as well as on the index that comprises these stocks, it is indeed possible to infer the risk-neutral distribution F j of future stock returns X j ( j = 1, 2, . . ., d) as well as the distribution F S of the future index return S := X 1 + X 2 + • • • + X d (Ross 1976;Breeden and Litzenberger 1978;Banz and Miller 1978;Aït-Sahalia and Lo 2000;Bondarenko 2003).The joint distribution is not observable directly.We develop an algorithm to infer dependence among variables with given marginal distributions F j and for which the distribution of the sum F S is known.In other words, our algorithm makes it possible to infer in a model-free and forward-looking way the dependence structure of stock returns consistent with the current market prices of options.There are various potential applications.For instance, the inferred dependence can be used to price multivariate derivatives non-parametrically.It is then straightforward to price arbitrarily weighted basket options, exchange options or spread options such that their prices are consistent with existing options written on individual assets and on a given index.Another possible application is to develop an indicator of herd behavior among stock returns using information on option prices only.Such an index can then be used as a signal of systemic risk.
The method that we propose to infer dependence is solely based on the Rearrangement Algorithm (RA) and its recent generalization, the Block Rearrangement Algorithm (BRA) (Bernard et al. 2017a;Bernard and McLeish 2016;Boudt et al. 2017).In general, there might be several compatible dependence structures 1 and one could reasonably expect that by randomizing its initialization, BRA allows one to infer all possible dependence structures.However, this is not the case.When all distributions are Gaussian, we find that among all possible candidate dependence structures the BRA yields a solution with (almost) maximum entropy.That is, the inferred dependence among the variables X j ( j = 1, 2, . . ., d) is approximately Gaussian and its correlation matrix has maximum determinant.This feature is interesting for the following two reasons: First, the fact that application of the BRA yields solutions that do not really depend on its initialization ensures that the BRA can be used as a stable algorithm for inferring dependence among stock prices.Second, the stable solution is economically meaningful, as the maximum entropy (statistically proved when all distributions are Gaussian) corresponds to the intuition that in well-functioning markets maximum uncertainty must prevail.A dependence with lower entropy would mean that we use information that we do not possess.This insight is also explored in Rubinstein (1994), Jackwerth and Rubinstein (1996), and Stutzer (1996) in the context of inference of marginal risk neutral probabilities from observed option prices.In this regard, we note that the principle of maximum entropy appears in many other disciplines.For example, the second law of thermodynamics asserts that the entropy of irreversible processes can never decrease.Another example pertains to the field of statistics where the principle of maximum entropy is used to obtain prior probability distributions for Bayesian inference; see Jaynes (1957).
The paper is organized as follows.In Sect. 2 we recall the standard Rearrangement Algorithm and the Block Rearrangement Algorithm and present its use to infer dependence among random variables.In Sect.3, we discuss the principle of maximum entropy.In Sect. 5 we show that for Gaussian margins the algorithm identifies the dependence with maximum entropy using the methodology and notation presented in Sect. 4. In Sect.6 we perform a robustness analysis and assess the extent to which our main result remains valid in a non-Gaussian framework or when the standard RA is used.We provide evidence that when the distributions are skewed the inferred dependence is still unique and exhibits a correlation matrix with maximum determinant.Section 7 concludes.

Rearrangement algorithm to infer dependence
We first present the standard RA of Puccetti and Rüschendorf (2012) and Embrechts et al. (2013) and show how to modify it to infer the dependence among variables for which the marginal distributions and the distribution of the sum are known.We include the presentation of the algorithms for completeness.

Rearrangement algorithm
The RA has been originally proposed in the context of assessing the impact of model uncertainty on portfolio models and is considered as a breakthrough in this regard.Technically, it amounts to finding a dependence among risks X j with known distributions F j ( j = 1, 2, . . ., d) such that the variance of their sum S = X 1 + X 2 + • • • + X d becomes as low as possible.As the algorithm requires random variables that are discretely distributed2 we further assume that each risk X j ( j = 1, 2, . . ., d) is sampled into n equiprobable values, i.e., we consider the realizations 2, . . . , d).We obtain an n × d matrix X : The matrix can be readily seen as the representation of a d-dimensional random vector; each column corresponds to a random variable and each row is a possible joint outcome that occurs with probability 1/n.This interpretation allows us to also apply probabilistic operators (such as the mean, the variance and the correlation) onto columns of matrices.
We consider the problem of rearranging the elements x i j of the matrix X such that after rearrangement the variance of S is as small as possible.We allow for rearrangements within columns, as doing so affects the dependence among the X j , j = 1, 2, . . ., d, but not their marginal distributions.By contrast, swapping values among columns will affect the marginal distributions and is not allowed.Clearly, in order for S to have smallest possible variance it must hold that for all = 1, 2, . . ., d, X is anti-monotonic with d k=1,k = X k (Puccetti and Rüschendorf 2012, Theorem 2.1); this observation lies at the core of their rearrangement method.

Standard rearrangement algorithm (RA)
1.For j = 1, 2, . . ., d, make the jth column anti-monotonic with the sum of the other columns, i.e., rearrange their values such that they have the property of being oppositely ordered to the row sums across the other columns.2. If there is no improvement in var d k=1 X k , output the current matrix X, otherwise return to step 1.
At each step of this algorithm, we ensure that the jth column is anti-monotonic with the sum of the remaining columns, so that the columns, say X j before rearranging and X j after rearranging, verify obviously 3 Hence, at each step of the algorithm the variance decreases.If the lower bound of zero is attained then we have found the situation of perfect mixability 4 among the variables, i.e., a dependence that makes the sum constant, and a global optimum is obtained.Otherwise, the algorithm will converge to a local minimum. 5Indeed, there is no guarantee that the local minimum yields the minimum variance of the sum as another initialization of X (i.e., a random permutation of the elements in each column) is likely to lead to another local optimum.Precisely, finding the dependence structure that minimizes the variance of the sum is a NP-complete problem (Haus 2015), which means that, most likely, there exists no deterministic algorithm with polynomial complexity.Thus, the above RA can only be an heuristic.
The standard RA has been recently generalized by considering blocks (Bernard and McLeish 2016;Bernard et al. 2017a;Boudt et al. 2017).The idea is as follows.If the distribution of S = X 1 + X 2 + • • • + X d has minimum variance it must actually hold that 3 Indeed var d k=1 X k = var X j + k = j X k , and a necessary condition for var d k=1 X k to become minimum is that each X j is anti-monotonic with k = j X k . 4The concept of mixability has been introduced in Wang and Wang (2011) and further extended in Wang and Wang (2016); see also Gaffke and Rüschendorf (1981) for some early results that are connected. 5Each time a column is effectively rearranged, the variance of the sum strictly decreases.Since there is a finite number of possible rearranged matrices, the algorithm always terminates in a finite number of steps.At this point, all columns satisfy the anti-monotonicity property and a local optimum for the variance has been reached.
for any decomposition of {1, 2, . . ., d} = I 1 ∪ I 2 into two disjoint sets I 1 and I 2 , the sums S 1 := k∈I 1 X j and S 2 := k∈I 2 X j are anti-monotonic (and not only for sets of the form I 1 = { j} and I 2 = {1, 2, . . ., d}\I 1 , as in the standard RA).This observation makes it possible to generalize the standard RA by rearranging "blocks of columns" instead of one column at a time: The columns in the first set I 1 are stacked into a matrix (block) X 1 and one rearranges its rows (i.e., one swaps entire rows) such that the row sums of X 1 (reflecting S 1 ) are in increasing order.As for the matrix X 2 that is formed by stacking the remaining columns, the rows are rearranged such that the row sums (reflecting S 2 ) are in decreasing order.

Block rearrangement algorithm (BRA)
1. Select a random sample of n sim possible partitions of the columns {1, 2, . . ., d} into two non-empty subsets {I 1 , I 2 }.When d 10, we take n sim = 2 d−1 −1 so that all non-trivial partitions are considered.6 2. For each of the n sim partitions, create the matrices (blocks) X 1 and X 2 with corresponding row sums S 1 and S 2 and rearrange rows of X 2 so that S 2 is anti-monotonic to S 1 .
3. If there is no improvement in var d k=1 X k , output the current matrix X, otherwise, return to step 1.
When all partitions can be assessed, Bernard and McLeish (2016) show that the BRA outperforms the standard RA by orders of magnitude in terms of the distance between the obtained local minimum for the variance of the sum and the global minimum for the variance of the sum.When the different partitions cannot be assessed in a reasonable amount of time (i.e., for matrices with many columns), a stopping rule needs to be used and it is no longer clear cut that BRA dominates the standard RA.Boudt et al. (2017) conduct a horse race and show that for typical applications in quantitative risk management the standard RA and BRA are close competitors.By contrast, for some rearrangement problems that appear in operations research, a suitably designed BRA is outperforming.

Inferring dependence
To infer the dependence among variables X 1 , . . ., X d with known distributions and for which the distribution of the sum S is also known, we apply the BRA on the following n × (d + 1) matrix M that is defined as (2.1) Elements x i j are defined as before and we sample the realizations s i of the sum S using its distribution F S , i.e., s i := F −1 S ( i−0.5 n ).In this regard, we note it is not guaranteed that after running the BRA the row sums of the rearranged matrix are all equal to zero and that a compatible candidate dependence has been found.Formally, "compatibility" is defined as follows.
Definition 2.1 (Compatibility) A set of d distributions F 1 , F 2 , . . ., F d is said to be compatible with the distribution F S if there exists a random vector (X 1 , X 2 , . . ., X d ) and a random variable S such that the X i and S are distributed as F i (i = 1, 2, . . ., d) and F S , respectively, and S = d i=1 X i almost surely.In this case, the dependence among the X i (i = 1, 2, . . ., d) is called a compatible dependence."Compatibility" is closely related to joint mixability (Wang and Wang 2016).Specifically, F 1 , F 2 , . . ., F d is compatible with the distribution F S if and only if F 1 , F 2 , . . ., F d and F −S are jointly mixable with null center.
Clearly, a necessary requirement for joint mixability of F 1 , F 2 , . . ., F d and Wang and Wang (2016) study joint mixability in some generality and provide several theoretical results that ensure existence of a compatible dependence in various cases of interest.Furthermore, Puccetti and Wang (2015) propose the Mixability Detection Procedure (MDP) as a suitable method to numerically assess joint mixability for an arbitrarily set of bounded distributions.In what follows, we assume that there is at least one compatible dependence for the given distributions F i (i = 1, 2, . . ., d) and F S and we merely aim at inferring any such dependence.
In the ideal situation, we observe that after running the BRA the row sums of the rearranged matrix are all equal to zero and a compatible dependence has been found.In practice, however, due to discretization errors and the fact that the algorithm is a heuristic, and the solution is only approximate.In the remainder of the paper, we study the properties of this approximate solution.To do so, we repeat the experiment K times.Each time, we randomize each of the d + 1 columns of M given in (2.1) and next we apply the BRA.For each k = 1, 2, . . ., K we denote by d ] only differ with respect to their interdependence.As for each k we randomize the initial matrix (i.e., we randomize the initial condition for the algorithm), it appears reasonable to assert that the procedure provides a way to describe the set of all possible dependence structures (copulas) that are consistent with the given information.Nevertheless, we find that application of the procedure leads to "a thin set" of dependence structures : the solutions d ] all have (nearly) the same correlation matrix and this matrix has maximum determinant.The only exceptions come from the situations in which the initial dependence structure is such that the variance of the row sums is already zero.In this case, the BRA has converged up-front and stops at the initial configuration.However, our study will show that even a tiny perturbation of this initial structure will suffice for the algorithm to converge to a dependence structure with correlation matrix that has maximum determinant.Moreover, when distributions are Gaussian we show that this inferred dependence actually reflects maximum entropy.The following section introduces the notation and some theoretical results that are needed for this purpose.

Maximum entropy
In the remainder of the paper, all variables X 1 , . . ., X d and their sum S have known distributions F 1 , . . ., F d and F S .Let μ 1 , . . ., μ d and μ S denote their respective means and σ 1 , . . ., σ d and σ S denote the standard deviations.

Correlation matrices with maximum determinant
In general, there could be many compatible dependence structures (copulas) for the given F 1 , . . ., F d and F S .To compare them, we will focus on the properties of their corresponding (Pearson) correlation matrices.In this regard, it is convenient to formally define several sets of correlation matrices.Definition 3.1 (All correlation matrices) Let B denote the set of all d × d symmetric semidefinite matrices, whose elements are equal to one on the main diagonal and range between −1 and 1 everywhere else.
Since the standard deviations σ 1 , . . ., σ d and σ S are fixed, R (corresponding to a compatible dependence) must satisfy the following additional restriction: Equivalently, we can re-write the above condition as the constraint on average correlation: where constant ρ imp is defined as This motivates our next definition of the constrained set C(ρ imp ).
Figure 1 illustrates the two sets B and C(ρ imp ) for the special case when d = 3.
Proof It is well-known that set B is compact and convex.On the other hand, the hyperplane corresponding to the linear constraint (3.3) is clearly convex and closed.Thus, Proposition 3.3 holds.
Assume that the F 1 , . . ., F d are compatible with F S .Clearly, every compatible dependence must be such that its corresponding correlation matrix satisfies the constraint in (3.3).By contrast, there might be correlation matrices R that satisfy this constraint but for which no compatible dependence exists.Proof We only need to show that C(ρ imp ) ⊂ D. Consider R ∈ C(ρ imp ) : it is a welldefined correlation matrix satisfying (3.3).Then there exists a d-dimensional Gaussian joint distribution with a Gaussian copula G with correlation matrix R and marginal distributions Thus there exists a compatible dependence structure among the X i (distributed as F i ) having correlation matrix R. Thus R belongs to D.
Remark 3.6 Note that under the framework of elliptical distributions (which includes Gaussian distributions as a special case), necessary and sufficient conditions for a set of marginals to be jointly mixable have been given in Wang and Wang (2016).In particular, Gaussian distributions N (0, σ 2 i ), i = 1, . . ., d + 1 are jointly mixable if and only if d+1 i=1 σ i 2 max i=1,...,d+1 σ i .In fact, the necessary part follows from subadditivity of standard deviation (triangle inequalities) and the sufficient part builds on closedness of the (multivariate) normal family under linear combinations.More details can be found in their paper.
In general, characterizing the set D for any choice of marginal distributions is an open problem, as in general it will only be a strict subset of C(ρ imp ).7 Proposition 3.7 (Maximum determinant for Gaussian margins) There exists a unique matrix R ∈ C(ρ imp ) that has maximum determinant among all matrices in C(ρ imp ).
Proof Since log determinant is continuous and C(ρ imp ) is compact, there exists a matrix R ∈ C(ρ imp ) which attains maximum determinant.Furthermore, since C(ρ imp ) is a convex set and log determinant is strictly concave (see Boyd and Vandenberghe 2004, Chapter 3, Section 1, p. 74), the matrix that maximizes log determinant and thus also the determinant is unique.
Definition 3.8 Let R M (ρ imp ) denote the matrix with maximum possible determinant among all R ∈ C(ρ imp ) and let M (ρ imp ) denote its determinant.
Without the linear constraint (3.3) on the correlation matrix, the maximum determinant (over the set B) is equal to 1, which is achieved when all off-diagonal elements of R are zero.With this constraint, however, M (ρ imp ) is typically strictly less than 1.To find M (ρ imp ), we solve numerically the constrained optimization in Matlab.Specifically, we search for a symmetric correlation matrix R in (3.2) with 1 2 d(d − 1) unknown parameters, which maximizes the determinant subject to the constraints that all eigenvalues are non-negative and the additional linear constraint in (3.3).

Maximum entropy dependence for Gaussian margins
Entropy refers to disorder or uncertainty of a system.It plays a fundamental role in various disciplines including information theory, computing and thermodynamics (e.g., the second law of thermodynamics).In his fundamental work (Shannon 1948) defines the entropy H of a discrete random variable X with probability mass function P(X ) as H (X ) = −E(log(P(X )).There are natural extensions to measure the entropy of multivariate distributions.Let f be the density of a multivariate distribution of (X 1 , . . ., X d ), then the entropy is defined as The following result characterizes the multivariate distribution of a random vector (X 1 , . . ., X d ) with Gaussian margins.Its proof follows from the fact that among all absolutely continuous distributions with unbounded support having a given invertible covariance matrix, the multivariate Gaussian distribution has maximum entropy.Proposition 3.9 (Maximum entropy for Gaussian margins) The entropy of the multivariate distribution of a random vector (X 1 , . . ., X d ) with Gaussian margins and invertible correlation matrix R satisfies where the equality holds if and only if (X 1 , . . ., X d ) is a multivariate Gaussian distribution.
This proposition (and its proof) can be found as Theorem 8.6.5 in Thomas and Cover (2006), where it is formulated in a slightly different way.Specifically, these authors prove that in which is the covariance matrix of the vector (X 1 , . . ., X d ).Since the determinant of the covariance matrix writes as , we obtain (3.4).As a direct consequence of Proposition 3.9, the maximum entropy distribution that is consistent with given Gaussian margins and a given average correlation ρ imp is uniquely characterized by a Gaussian dependence with a correlation matrix R M (ρ imp ).
Our numerical experiments show that there is a striking connection between the BRA and maximum entropy in that every run of BRA leads to (i) a dependence close to Gaussian, (ii) a correlation matrix with determinant close to the maximum possible determinant.
Given that we do not formally prove the result, but rather provide evidence based on extensive simulations, we need to introduce relevant concepts that make it possible to reach such conclusion.This will be the topic of the next section.
Remark 3.10 The conclusion regarding the maximum entropy property remains unchanged when the "sum" is not put in the last column.This can be seen by relabeling the columns.For every column it indeed holds in the situation of (joint) mixability that X k = − i =k X i .We have d + 1 columns for which the determinants of their correlation matrix is 0, but when we extract d columns, the determinant must be maximum.In addition, after we prove that the dependence is Gaussian, then if (X 1 , . . ., X d ) is Gaussian, then (X 1 , .., X k−1 , X k+1 , . . ., X d , −S) is also a Gaussian vector.

Maximum entropy and BRA
We run K times the BRA.For k = 1, 2, . . ., K , let R (k) denote the output correlation matrix of X (k) : and denote its determinant by (k) .We also define the realized (or effective) average correlation (k) as in which σ i specifically denotes the standard deviation of j , for all i and j.8 Finally, we define the deviation V (k) as the L 2 -norm of the n × 1 residual vector e (k) := X (k) Similarly to all other quantities above, V (k) depends on n, but, for the ease of presentation, we omit the explicit dependence on n.
Proposition 4.1 The following inequality holds: where C 1 and C 2 are constants depending on the standard deviations σ 1 , . . ., σ d and σ S .
Proof For simplicity, we drop the superscript in k) , e (k) and V (k) .Denote Using the definitions of and ρ imp : However, σ 2 e + 2σ e σ S V 2 + 2V σ S , which completes the proof of the proposition.
Intuitively, Proposition 4.1 states that, as the accuracy of a candidate solution improves (i.e., the deviation V (k) decreases), the realized average correlation (k) generally gets closer to ρ imp .

Distance to the maximum determinant matrix
To assess the similarity of the outputs of BRA, we make use of the following measure of dispersion.
Definition 4.2 (Mean Percentage Error) For any vector of interest Z taking values z (k) (k = 1, 2, . . ., K ) and some "reference point" z, we denote by M P E(Z ) its average percentage error, where percentage error (PE) of a z (k) is simply given as Footnote 8 continued Thus, they all have the same standard deviations σ 1 , . . ., σ d and σ S , independent of k.When a continuous Normal distribution is approximated with n discrete values, the resulting standard deviation might slightly differ from its continuous counterpart.In this paper, we use n = 1000 and 10,000, for which the discrepancies are extremely small.Nevertheless, we re-scale the discretized values to ensure that all standard deviations match exactly their continuous counterparts σ 1 , . . ., σ d and σ S .
The main discrepancy measure that we consider concerns the normalized determinants  (k) with respect to d M (ρ imp ) and then take their average.In this regard, one would expect that would be a suitable measure of discrepancy (as the average of K errors that all have the same sign).However, the interpretation of M P E 1 d √ is somewhat ambiguous.Due to discretization, it is usually not even possible to obtain a dependence that yields a realized correlation equal to the target correlation ρ imp .Typically, V (k) = 0 and (k)  is slightly smaller in absolute value than ρ imp , which means that M ( (k) ) > M (ρ imp ).Hence, it may happen for some k that P E  (k) and its average as (4.9) Finally, the average percentage error of to the target value ρ imp is denoted by M P E( ).
It is easy to verify that these three measures are scale invariant.In the following examples, we observe that all these measures converge to 0 when the number of discretization steps n increases.

Testing the Gaussian dependence
We also test whether for k = 1, 2, . . ., K the different k) ] can be seen as draws from a multivariate normal distribution.Very well-known tests in this regard are Mardias tests of multinormality or variants hereof.For general multivariate data, Mardia (1970) constructed two statistics for measuring multivariate skewness and kurtosis, which can be used for testing the hypothesis of normality (Mardia 1974(Mardia , 1975;;Mardia et al. 1980).Usually the tests whether kurtosis (MK) and skewness (MS) are consistent with a Gaussian model are performed separately, but there are also so-called omnibus tests that assess them simultaneously.In this paper, we use these two tests and perform them separately.
In the next section, we show that for Gaussian marginal distributions, the BRA yields a Gaussian dependence with corresponding correlation matrix that has maximum determinant.

Normal distribution
Assume that the components X i (i = 1, 2, . . ., d) and their sum S are all normally distributed, and that they are "compatible" with the distribution of the sum (as per Definition 2.1).To simplify the exposition, all variables are centered, i.e., they have zero mean.The standard deviations σ 1 , . . ., σ d are assumed to linearly decrease from 1 to 1/d, whereas the standard deviation of the sum σ S is chosen in such a way that ρ imp = 0.8.The number of components d ranges from 3 to 10.To discretize the marginal distributions, we use either n = 1000 or n = 10,000.In each experiment, we run the BRA K = 500 times to obtain candidate approximate dependencies, as described in Sect.2.2.
Our first experiment considers the case of d = 3 and is illustrated in Fig. 2. The left panel shows the relationships between the three pairwise correlations ρ 12 , ρ 13 , and ρ 23 .On this plane, the gray "ellipse" represents the constrained set C(ρ imp ).9 See also Fig. 1   For each run of BRA k = 1, . . ., 500, we obtain (k) , (k) and V (k) and use these to compute sample averages.All MPEs are reported as percents.V denotes the average of the K values V (k) .The last two columns report the results of the two multivariate normality tests, MS and MK.For each test, we report the percent of rejections (with p-value less than 0.05).The table is reproduced for two levels of discretization n = 1000 and n = 10,000 and when d = 3, 4, 5, 7 and 10 graphical illustration of C(ρ imp ).All the points in this gray area are candidate solutions and correspond to a global optimum for the variance minimization problem under consideration.

for a
With the help of Matlab, we solve the high-dimensional constrained optimization problem to find the correlation matrix R M (ρ imp ) ∈ C(ρ imp ) with maximum determinant M (ρ imp ) (see also Sect.4) and we represent it with a red star.We then run the BRA 500 times and show the corresponding solutions with blue dots.It is striking to observe how for all 500 runs the candidate solutions obtained with the BRA exhibit correlations which are very close to the ones of the maximum determinant matrix.This conclusion is also confirmed by the right panel which plots the values of (k) versus ρ (k) 12 .When the level of discretization n increases from 1000 to 10,000, the cluster of the blue dots becomes even tighter and closer to the maximum determinant matrix.
The representation of the range of possible correlation through an "ellipse" is convenient when d = 3, but such representation needs to be adapted in higher dimensions.Therefore, in Table 1 we provide several measures that allow to interpret the properties of the output matrix when d increases up to 10 dimensions.In particular, we report the measures of discrepancy k) ).The last two columns of this table report the results of two statistical tests MS and MK that we performed to test for multivariate normality (see Sect. 4.2).
We make the following observations related to Table 1.First, we observe that the results illustrated in Fig. 2 for d = 3 carry over to higher dimensions.Indeed, M P E 1 ( d √ ) and M P E 2 ( d √ ) are both very small, which confirms the convergence of the BRA to a candidate dependence that has the correlation matrix with maximum determinant (MPEs are computed by averaging the values displayed in Figs. 4, 5).In addition, the rate of convergence is illustrated on Fig. 3.
Second, we observe that the algorithm converges better when n increases, as measured by the mean of the 500 squared errors V (k) .Somewhat surprisingly, the convergence also improves when the dimension d increases.Third, we find that the realized correlation (k)   on average gets closer and closer to ρ imp when d increases and when n increases.This observation is consistent with Proposition 4.1.Third, consistently with intuition, we observe that d √ is much more stable when the discretization level n increases.Finally, for all runs the statistical tests leads to an acceptance of the null-hypothesis of multivariate normality at confidence level of 94%.From Table 1 we can thus conclude that the BRA yields a dependence structure that is Gaussian and with correlation matrix that has maximum determinant.
Figures 4 and 5 hereafter provide additional support for these findings by exploring further the structure of the candidate solutions generated by the BRA when the discretization level n = 1000.10Specifically, for each case of dimensionality d = 3, 5, 7 or 10, the first three panels plot the measures k) ) and V (k) with respect to P E( (k) ), for each of the 500 runs of the BRA.The last panel shows the distributions of pairwise correlations ρ We make additional observations related to Figs. 4 and 5. First, we note that (k) is always less than ρ imp .Intuitively, to increase the determinant, correlations should be as close to zero as possible.Second, since the constraint 3.3 must be respected, the optimizer increases those correlations ρ (k) i j , for which the weights σ i σ j are relatively high, and reduces those correlations, for which these weights are relatively low.This particular pattern for correlations is evident from all figures.Note also that correlations ρ (k) i j that correspond to higher products σ i σ j are recovered by the BRA more accurately.Third, we observe that V and are negatively correlated.Intuitively, when the BRA converges to an approximate solution that is relatively far from a global solution, has more "room" to deviate from ρ imp , which translates in a higher mean percentage error relative to ρ imp .Remark 5.1 Under the probabilistic interpretation of matrices given in Sect.2, the output of RA and BRA algorithms represent discrete mutually complete dependences (see Lancaster 1963) and in particular so-called multivariate Shuffle-of-Mins (see Durante and Sánchez 2012).Neverthelss, we show that the Gaussian assumption cannot be rejected.This shows once moreagai interplay between discrete optimization method that can only provide discrete results and the fact that any dependence structure can be proxied by a sufficiently fine Shuffleof-Min.

Robustness
In this section, we study the robustness of our results.First, we investigate the sensitivity to the initial condition.Second, we investigate marginal distributions that are non-Gaussian.Finally, we investigate the impact when using the standard RA instead of the BRA.

Sensitivity to initial conditions
Recall that when the algorithm starts from a randomized matrix M for which the row sums are all already equal to zero [as in (2.1)], then the algorithm has converged up-front and the output matrix is the same as the initial one (which is then just a global optimum).In this particular instance, the BRA does not converge to the maximum determinant correlation matrix.However, if we perturb slightly this initial (optimal) matrix, so that the row sums are now not exactly equal to zero, then we observe that the algorithm will converge to the maximum determinant correlation matrix.
This phenomenon is illustrated in Fig. 6, for the level of discretization n = 1000, 3000, and 10,000.We start with a particular global solution, marked by the green star, but with 0.2% rows (of the matrix M) swapped to create a small random perturbation.That is, 2 rows out of 1000, 6 rows out of 3000, and 20 rows out of 10,000 are randomly selected and swapped.Without the perturbation, the BRA would terminate where it starts (at the particular solution, marked by the green star).With the perturbation, however, the BRA typically converges to a solution close the maximum determinant matrix, marked the red star.The "attraction" to the maximum determinant matrix gets progressively stronger and more striking as n increases.When n = 1000, one can see a "path" of blue dots between the green and red stars.When n = 10,000, all blue dots are now clustered very tightly around the red star.

Sensitivity to marginal distributions
When the marginal distributions are not Gaussian, it is not clear that the Gaussian copula is the copula that maximizes entropy.Therefore, we merely assess to which extent the property of maximum determinant for the correlation matrix, which we observed for Gaussian margins, remains valid when other distributional assumptions are made.In order to do so we resort to skewed t-distributions.To this end, we recall that a d-dimensional random vector X is said to follow a normal mean-variance mixture, if where Z ∼ N d (0, W) is a d-dimensional multivariate normal with E[Z] = 0 and var(Z) = W, where W ∈ R d×d is positive semi-definite, Y 0 is a non-negative scalar random variable independent of Z, and γ ∈ R d and μ ∈ R d are constants.Specification (6.10) implies that X| Y =y ∼ N d (μ + yγ , yW).Furthermore, The multivariate Generalized Hyperbolic (GH) distribution is obtained when the scalar factor in the mixture Y follows a generalized inverse Gaussian distribution with parameters λ, χ and ψ (Chapter 3.2 in McNeil et al. 2010) and is denoted as X ∼ G H d (λ, χ, ψ, μ, W, γ ).
A useful property of the GH family is that it is closed under linear operations, meaning, in particular, that S = i X i has also a GH distribution.In this paper, we are interested in a special case of the GH family where λ = −ν/2, χ = ν and ψ = 0, so that Y follows an inverse Gamma distribution, Y ∼ I G(ν/2, ν/2).This special case corresponds to a Skewed-t  (6.11) where the covariance matrix is defined as long as ν > 4. The sum S as well as the components X i (i = 1, 2, . . ., d)) now follow one-dimensional Skewed-t distribution.In particular, We simulate from the multivariate Skewed-t distribution to obtain realizations for the the components X i (i = 1, 2, . . ., d) and their sum S. Next, we use this information to construct the matrix M on which we repeatedly apply the BRA.As before, we assume that the number of components d ranges from 3 to 10 and the level of discretization is either n = 1000 or For each run of BRA k = 1, . . ., K , we use (k) , ρ (k) , ξ (k) , and V (k) to compute sample averages.MPEs are in percents.The table is reproduced for two levels of discretization n = 1000 and n = 10,000 and when d = 3, 4, 5, 7 and 10 n = 10,000.For the ease of comparison, we set the parameters of the Skewed-t distribution so that the covariance matrix cov(X) is the same as in the normal case.Specifically, as in the normal case, the standard deviations σ 1 , . . ., σ d of the marginals linearly decrease from 1 to 1/d and all pairwise correlations are equal to ρ imp = 0.8.For each component X i , the location parameter is zero, μ i = 0, and the skewness parameter is proportional to σ i , that is, γ i = cσ i = 1.5σ i .We set ν = 15 and use the relation in (6.11) to deduce the covariance matrix W for the normal factors Z.Our choice of the parameters guarantees that W is positive semi-definite.Figure 7 illustrates the shape of the Skewed-t distribution used in the simulations.It plots the histogram and QQ-plot for the sum S when d = 3 and n = 1000.Table 2 reports the results for K = 500 runs of the BRA when d ranges from 3 to 10 and for two levels of discretization, n = 1000 and n = 10,000.
In Fig. 8, we analyze the case d = 3.As before, the "ellipse" in the left panel represents the constrained set C(ρ imp ).For all 500 runs, the candidate solutions exhibit correlations that match closely the ones of the maximum determinant matrix.This conclusion only strengthens when the level of discretization increases from 1000 to 10,000.The candidate solutions generated by the BRA exhibit more variation compared to the normal distribution.This is to be expected, as components X i now have considerable skewness and heavier tails.As a result, outliers are more frequent and the overall precision of the BRA declines noticeably.Nevertheless, the solutions remain tightly clustered around the one corresponding to the maximum determinant.
Two additional figures in "Appendix", Figs.11 and 12, provide further insights into the structure of the output for n = 1000.For each case of d = 3, 5, 7 or 10, the first three panels plot the measures k) ), and V (k) with respect to P E( (k) ), for each of the 500 runs of the BRA.The last panel shows the distributions of pairwise correlations ρ

Sensitivity to rearrangement algorithm
Finally, we assess to which extent our main findings change when using the standard RA, which is described in Sect. 2. Specifically, we follow the same Gaussian setup as in Sect.5,  For each run of RA k = 1, . . ., 500, we obtain (k) , (k) and V (k) and use these to compute sample averages.All MPEs are reported as percents.The last two columns report the results of the two multivariate normality tests, MS and MK.For each test, shown are the the percent of rejections (with p-value less than 0.05).The table is reproduced for two levels of discretization n = 1000 and n = 10,000 and when d = 3, 4, 5, 7 and 10 except now we use the standard RA to infer the dependence among the variables instead of the BRA.We report the results in Figs. 9, 10, and Table 3.It is clear that the standard RA does not recover the dependence with maximum entropy.From Fig. 10, we observe that the 500 experiments lead to correlation matrices that may have determinants very far from the maximum determinant matrix.10 illustrates the magnitude of the distance to the maximum determinant matrix and shows that pairwise correlation can be in a very wide range (fourth panel for d = 3 and eighth panel for d = 10) compared to the results found in Figs. 4 and 5 using the BRA.Furthermore, one observes from Table 3 that both multivariate Gaussian tests fail so that the hypothesis that the dependence is a Gaussian copula is now rejected.

Conclusion
We perform an extensive study of the Block Rearrangement Algorithm (BRA) when used to infer dependence structures among variables that are consistent with information on the marginal distributions of the variables and the distribution of their sum.We find that when the variables are normally distributed the BRA always yields a very similar solution which moreover has a property of maximum entropy.This stylized fact is remarkable as we would expect a wide range of compatible dependence structures to be attained given that we start from a random initial structure and here we merely obtain just one dependence structure exhibiting maximum entropy.These findings are robust in that for non-normal distributions, BRA still yields a unique dependence that exhibits a correlation matrix with maximum determinant.Hence, we obtain strong support for using the BRA as a stable algorithm to infer an economically meaningful dependence.Providing a formal proof for the property of maximum entropy seems very challenging (as we do not observe a systematic increase of entropy at each step of the algorithm) and is left for future research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Fig. 1
Fig. 1 The sets B and C(ρ imp ) in Definitions 3.1 and 3.2 for d = 3.The example uses σ 1 = 1, σ 2 = 2 3 , σ 3 = 1 3 , and ρ imp = 0.8.The colored volume represents the set of all correlation matrices B and the (gray) plane represents the linear constraint (3.3).Their intersection forms the constrained set C(ρ imp ).(Color figure online) d √ (k) > 0. To accommodate for this possibility, we consider an alternative measure of discrepancy where P E d √ (k) is computed with respect to the reference point d M ((k) ) instead of the previous d M (ρ imp ) .This choice of the reference point ensures that P

Fig. 2
Fig. 2 Normal distribution with d = 3 and n = 1000.For each case, there are K = 500 blue dots corresponding to different runs of the BRA.The shaded gray area represents the constrained set C(ρ imp ) and the red star indicates the maximal matrix R M .The left panel shows K = 500 realizations of the correlations ρ 12 , ρ 13 , and ρ 23 .The right panel shows the relation versus ρ 12 .(Color figure online)

Fig. 3
Fig. 3 Normal distribution with d = 3 and d = 7.For each case, mean pricing errors M P E 1 ( d √ ) (blue line) and M P E 2 ( d √ ) (red line) are shown for various value of n. (Color figure online) . The plots confirm that the different candidate solutions are all very similar and exhibit close to maximum entropy.

Fig. 4 Fig. 5
Fig. 4 Normal distribution with d = 3 and d = 5.Each black dot corresponds to one of the K = 500 BRA runs.For each case, the first three panels show P E 1 ( d √ (k) ), V (k) , and P E 2 ( d √ (k) ) with respect to P E( (k) ).The black dashed lines indicate sample averages.The right bottom panel shows the 5th, 50th and 95th percentiles of distributions of correlations ρ (k) i j generated by K runs.The percentiles are shown along 1 2 d(d − 1) vertical lines, corresponding to correlations ρ 12 , ρ 13 , . . ., ρ d−1d .The red dots are the corresponding correlations for the maximal matrix R M .(Color figure online)

Fig. 6 Fig. 7
Fig. 6 Normal distribution with d = 3 and n = 1000, 3000, or 10,000.For each case, there are K = 500 blue dots corresponding to different runs of BRA.Each run of BRA starts at a particular solution (marked by the green star), but with 2/6/20 random rows swapped when n = 1000/3000/10,000.The shaded gray area represents the constrained set C(ρ imp ) and the red star indicates the maximal matrix R M .The left panel shows realizations of the correlations ρ 12 , ρ 13 , and ρ 23 .The right panel shows the relation versus ρ 12 .(Color figure online)

Fig. 8
Fig. 8 Skewed-t distribution with d = 3 and n = 1000 or n = 10,000.For each case, there are K = 500 blue dots corresponding to different runs of BRA.The shaded gray area represents the constrained set C(ρ imp ) and the red star indicates the maximal matrix R M .The left panel shows K = 500 realizations of the correlations ρ 12 , ρ 13 , and ρ 23 .The right panel shows the relation versus ρ 12 .(Color figure online) 1)d .The plots confirm that the candidate solutions are all similar and attain close to maximum determinant.

Fig. 9
Fig. 9 Normal distribution with d = 3 and n = 1000 or n = 10,000, using the standard RA instead of the BRA.As before, there are K = 500 blue dots corresponding to different runs of the RA.The shaded gray area represents the constrained set C(ρ imp ) and the red star indicates the maximal matrix R M .The left panel shows K = 500 realizations of the correlations ρ 12 , ρ 13 , and ρ 23 .The right panel shows the relation versus ρ 12 .(Color figure online)

Fig. 11
Fig. 11 Skewed-t distribution with d 3 and = 5.Each black dot represents one K = 500 BRA runs.For each case, the first three panels show P E 1 ( d √ (k) ), V (k) , and P E 2 ( d √ (k) ) with respect to P E( (k) ).The black dashed lines indicate sample averages.The right bottom panel shows the 5th, 50th and 95th percentiles of distributions of correlations ρ (k) i j generated by K runs.The percentiles are shown along 1 2 d(d − 1) vertical lines, corresponding to correlations ρ 12 , ρ 13 , . . ., ρ d−1d .The red dots are the corresponding correlations for the maximal matrix R M .(Color figure online)

Fig. 12
Fig. 12 Skewed-t distribution with d = 7 and d = 10.Each black dot represents one of K = 500 BRA runs.For each case, the first three panels show P E 1 ( d √ (k) ), V (k) , and P E 2 ( d √ (k) ) with respect to P E( (k) ).The black dashed lines indicate sample averages.The right bottom panel shows the 5th, 50th and 95th percentiles of distributions of correlations ρ (k) i j generated by K runs.The percentiles are shown along 1 2 d(d − 1) vertical lines, corresponding to correlations ρ 12 , ρ 13 , . . ., ρ d−1d .The red dots are the corresponding correlations for the maximal matrix R M .(Color figure online) The determinant is a measure of volume in R d and it satisfies the scaling property that det[cR] = c d det[R] for constant c.Thus, d d √

Table 1
Normal distribution

Table 2
Skewed-t distribution

Table 3
Normal distribution, using the standard RA instead of the BRA