Parameter estimation for a bivariate beta distribution with arbitrary beta marginals and positive correlation

We discuss a bivariate beta distribution that can model arbitrary beta-distributed marginals with a positive correlation. The distribution is constructed from six independent gamma-distributed random variates. While previous work used an approximate and sometimes inaccurate method to compute the distribution’s covariance and estimate its parameters, here, we derive all product moments and the exact covariance, which can be computed numerically. Based on this analysis we present an algorithm for estimating the parameters of the distribution using moment matching. We evaluate this inference method in a simulation study and demonstrate its practical use on a data set consisting of predictions from two correlated forecasters. Furthermore, we generalize the bivariate beta distribution to a correlated Dirichlet distribution, for which the proposed parameter estimation method can be used analogously.

and environmental science. For binary events these forecasts take the form of probability estimates that are either provided by humans or machine learning algorithms. In order to model these probability estimates one can use an arbitrary distribution on the interval [0, 1], such as the beta distribution, beta-generated distributions, the Kumaraswamy distribution, or any distribution on the real numbers transformed through the logistic function. In many cases the probabilities will be correlated, e.g. if different experts forecast the outcome of an election or the probability of rain. In order to be able to model such correlated probabilities, therefore, a bivariate distribution is needed. For example, one can use bivariate generalizations of the Kumaraswamy distribution [1], bivariate beta-generated distributions [2], or a multivariate Gaussian with logistic transformations [3]. However, the most common choice for modeling such probabilities is the beta distribution, since it is the standard distribution for probabilities in Bayesian statistics and is simply more familiar to practitioners than the other distributions [4,5]. Therefore, in this paper we focus on bivariate beta distributions.
While a multitude of constructions for bivariate beta distributions have been proposed, they have different constraints and properties, which limit their applicability. We will first review previous constructions of bivariate beta distributions together with their respective properties and then examine the most promising construction that can model arbitrary beta marginals with a positive correlation [5]. For this construction of modeling arbitrary beta marginals with positive correlation, so far, there has not been an exact method for parameter inference and it has thus rarely been used. Here, we therefore introduce a new estimation method for this bivariate beta distribution with arbitrary beta marginals and positive correlation.
Many bivariate beta distributions have been proposed in the literature [1,2,[5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Among them, some approaches have been derived from general families of bivariate distributions, such as the Farlie-Gumbel-Morgenstern family of distributions [e.g. 7] and the Sarmanov family of distributions [e.g . 8]. Others derived a bivariate beta distribution from a bivariate extension of the F distribution [9,13], whereas Nadarajah and Kotz [11] proposed different bivariate beta distributions constructed from products of univariate beta-distributed random variables. In general, using different copulas one can construct different bivariate distributions with the same beta marginals [2,20]. However, as beta-distributed random variables can easily be constructed from normalized gamma-distributed random variables, it is natural to try and generalize this construction to the bivariate case. In this vein, several authors have introduced correlations through shared gamma-distributed random variables [1,5,10,14,18]. The most straightforward case of this construction has been studied by Olkin and Liu [10], building on the work of Libby and Novick [6]. They use three independent gamma-distributed random variables U 1 , U 2 , U 3 with respective shape parameters υ 1 , υ 2 , υ 3 and same scale parameter to construct Using the standard construction of beta variates from gamma variates, the joint distribution of the random variables X and Y is a bivariate beta distribution with marginal distributions Beta(υ 1 , υ 3 ) for X and Beta(υ 2 , υ 3 ) for Y . The correlation between X and Y , which is obtained through the shared latent variable U 3 and its parameter υ 3 , is in the range [0,1].
For high values of υ 3 , the correlation tends to 0 whereas for low values of υ 3 , it tends to 1. However, if υ 3 is high, the values of X and Y also tend to 0 and if υ 3 is low they tend to 1 accordingly. This behavior severely limits the usefulness of the distribution for most applications. A further limitation is the constraint that the marginal distributions share the same second parameter υ 3 . Thus, the bivariate beta distribution proposed by Olkin and Liu does not allow for arbitrary beta marginals, which limits its flexibility in modeling probability forecasts. Arnold and Ng [14] proposed a more flexible construction for a bivariate beta distribution. They use five independent gamma-distributed random variables U 1 , . . . , U 5 with shape parameters υ 1 , . . . , υ 5 and scale parameter 1 to define two correlated random variables with marginal distributions Beta(υ 1 + υ 3 , υ 4 + υ 5 ) for X and Beta(υ 2 + υ 4 , υ 3 + υ 5 ) for Y . Compared to Olkin and Liu [10], this construction of a bivariate beta distribution can generate all correlations in the range [-1,1] and marginal distributions with differing second parameters. Nevertheless, because of how the two marginals share parameters, not all combinations of parameters of the marginal beta distributions are possible. For example, the marginals Beta(10, 4) for X and Beta(1, 1) for Y cannot be obtained. Olkin and Trikalinos [18] base their construction of a bivariate beta distribution on the Dirichlet distribution. U = (U 00 , U 01 , U 10 , U 11 ) is drawn from a 4-dimensional Dirichlet distribution with parameters υ 1 , . . . , υ 4 . By just using three of its components, U 00 , U 01 , U 10 , new random variables X = U 00 + U 10 and Y = U 01 + U 10 (3) are constructed, with marginal distributions Beta(υ 1 + υ 3 , υ 2 + υ 4 ) for X and Beta(υ 2 + υ 3 , υ 1 + υ 4 ) for Y . As Dirichlet-distributed random variables can also be constructed from gamma random variables, we can equivalently construct X and Y in Eq. (3) from four independent gamma-distributed random variables U 1 , . . . , U 4 with shape parameters υ 1 , . . . , υ 4 and equal scale parameter 1, with As can easily be seen from this construction, all correlations in the range [-1,1] can be generated. In particular, the correlation tends to -1 if υ 3 and υ 4 tend to 0 and U 3 and U 4 will be negligible compared to U 1 and U 2 . In this case X ≈ U 1 U 1 +U 2 ≈ 1 − Y . Similarly, the higher the values of υ 3 and υ 4 relative to υ 1 and υ 2 , the more negligible U 1 and U 2 will be and the correlation increases to 1 until X ≈ U 3 U 3 +U 4 ≈ Y . Less obviously, a correlation of 0 is obtained in case υ 1 · υ 2 = υ 3 · υ 4 [18]. Still, this construction of a bivariate beta distribution does not allow arbitrary beta marginal distributions. Since all υ i are constrained to be positive, for some combinations of marginal distributions the resulting system of linear equations for the parameters υ i has no solution. For example, the two marginals Beta(2, 2) for X and Beta(1, 1) for Y cannot be generated, regardless of their correlation.
Magnussen [5] introduced yet another construction based on six gamma variates. While all the constructions thus far constrain the parameters of the marginal beta distributions, this construction does allow for arbitrary beta marginals with positive correlation, thus providing the necessary flexibility to model probability forecasts. Magnussen's distribution is a special case of a more general 8-parameter bivariate beta distribution introduced by Arnold and Ng [14] and reviewed in Arnold and Ghosh [1], which even allows for positive and negative correlations. However, in many applications, it is enough to model positive correlations, for which the less complex 6-parameter distribution is sufficient. For example, if X and Y are probability estimates elicited from two skilled forecasters, we do not expect negative correlations. But we do want to allow for the possibility that their marginal forecasts have different distributions that should not be tied together by parameter constraints on the marginals. Hence, the bivariate beta distribution proposed by Magnussen [5], which can model arbitrary beta-distributed marginals with a positive correlation, is an appropriate distribution for modeling correlated probability forecasts.
However, just like any other distribution, the bivariate beta distribution can only be used if its parameters can be estimated correctly. While Magnussen [5] proposes a moment matching approach for fitting the distribution's parameters, this approach relies on a rough and sometimes inaccurate approximation for the covariance. Also, Magnussen [5] did not discuss the fact that very similar data can be generated with different parameter values, which makes it hard to statistically infer the parameters of the bivariate beta distribution from data without constraining the distribution.
Therefore, in this work we introduce an alternative approach for estimating this bivariate beta distribution's parameters. First, we will derive the full joint distribution, which is missing in the work of Magnussen [5], probably because it is intractable. We will then clarify the relationship between Magnussen's distribution and the Olkin-Liu distribution [10]. Using this relationship with the Olkin-Liu distribution we derive all product moments and in particular the exact covariance function (and in passing we correct a small mistake in the product moments from Olkin and Liu [10]). For parameter inference, we propose to match moments numerically using the exact covariance we derived. While other estimation methods such as Bayesian inference could be used [22], here we focus on moment matching due to its simplicity and efficiency. In order to make parameter inference unambiguous, we additionally show how to reasonably constrain the distribution's parameters. We evaluate the proposed parameter estimation method in a simulation study and demonstrate its practical use on a real data set consisting of predictions from two correlated forecasters. We discuss the relationship between the distribution's parameters and the correlation and then extend the bivariate beta distribution to a correlated Dirichlet distribution, for which the proposed parameter estimation method can be used analogously.
The remainder of the paper is structured as follows. In Sect. 2 we discuss the bivariate beta distribution with arbitrary beta marginals, including its joint distribution in Sect. 2.1, its moments in Sect. 2.2, correlation and covariance in Sect. 2.3, and parameter inference in Sect. 2.4. Section 3 shows how to generalize the bivariate beta distribution to a correlated Dirichlet distribution. Magnussen [5] uses six independent gamma-distributed random variables A 1 ,

A bivariate beta distribution with arbitrary beta marginals
to construct two bivariate-beta-distributed random variables The resulting marginal distributions of X and Y are Beta(a 1 , a 2 ) and Beta(b 1 , b 2 ) with The marginals follow immediately from the definition because the sum of gamma random variables of the same scale is gamma-distributed with the same scale but with the original shape parameters summed. In contrast to other constructions that were discussed above [10,14,18], this construction allows for arbitrary marginal distributions. In particular, when δ 1 and δ 2 tend to zero, we can model arbitrary independent marginal distributions Beta(α 1 , α 2 ) and Beta(β 1 , β 2 ). Since all parameters α 1 , α 2 , β 1 , β 2 , δ 1 , δ 2 need to be positive by definition, for fixed marginal distributions Beta(a 1 , a 2 ) for X and Beta(b 1 , b 2 ) for Y it must hold that δ 1 < δ max 1 = min(a 1 , b 1 ) and δ 2 < δ max 2 = min(a 2 , b 2 ). Therefore, for most marginal distributions the maximum correlation that can be generated is below 1. The higher the difference between two marginal distributions, the lower the possible maximum correlation. A perfect correlation approaching 1 can, of course, only be generated for equal marginal distributions, i.e. if a 1 = b 1 and a 2 = b 2 and α 1 , α 2 , β 1 , and β 2 tend to 0, as also noted by Magnussen [5]. Note that this limitation applies to other bivariate distributions that do not allow for arbitrary marginal beta distributions as well [e.g. 18].
The construction of this bivariate beta distribution can also be seen as a pairwise combination of three beta distributions. First transform the six independent gamma-distributed random variables (5) into three independent gamma-and three independent beta-distributed random variables, with With these definitions we can then rewrite construction (6) as where X and Y are defined as in (1) but with υ 1 , υ 2 , and υ 3 as in (9). Furthermore, X and Y are independent of W 1 , W 2 , W 3 . If parameters δ 1 and δ 2 and with them U 3 tend to 0, X ≈ W 1 and Y ≈ W 2 are independent with marginal distributions Beta(α 1 , α 2 ) for X and Beta(β 1 , β 2 ) for Y . Mixing in the shared component W 3 by increasing the values of parameters δ 1 and δ 2 increases the correlation between X and Y . If U 1 and U 2 are negligible compared to U 3 because δ 1 and δ 2 dominate the parameters, the correlation will be close to 1 with X ≈ W 3 ≈ Y and hence X and Y have the same marginal distribution Beta(δ 1 , δ 2 ), as mentioned before.

Joint distribution
X and Y in (10) are linear transformations of X and Y . Given W 1 , W 2 , W 3 it is easy to recover X and Y from observed X and Y , Note that we can ignore the sign because according to (10) X is always between W 1 and W 3 and Y between W 2 and W 3 , so that the numerator and denominator always have the same sign.
As X and Y jointly follow the Olkin-Liu distribution [10], where with x between w 1 and w 3 and y between w 2 and w 3 according to (10). We have not been able to integrate out w 1 , w 2 , w 3 from their joint distribution with x and y. However, we suspect that even if the joint density for X and Y could be expressed in terms of special functions, computing those might not be efficient enough for parameter inference for which we will resort to moment matching. Example plots with smoothed samples for the joint density are shown in Fig.1 for several parameter settings showing different marginal distributions for X and Y and different correlations between X and Y . Sampling from the bivariate beta distribution is realized with JAGS [23].

Moments
As the marginal distributions for X and Y are beta-distributed, their moments are readily available, even in closed form. Computation of the product moments E(X k Y l ) is more challenging but can be realized with help of the work of Olkin and Liu [10]. Looking at construction (10), X and Y are a linear combination of independent beta-distributed random variables W 1 , W 2 , W 3 with weights X and Y . Thus, we can express the product moments Since X and Y are independent of W 1 , W 2 , W 3 , it is possible to compute the expectation if the moments of W 1 , W 2 , W 3 , X , Y and the product moments of X and Y are known. W 1 , W 2 , W 3 as well as the marginals of X and Y are beta-distributed, so their moments can be computed straightforwardly in closed form. Furthermore, X and Y are jointly Olkin-Liu distributed according to (12), and Olkin and Liu [10] have shown how to compute their product moments. However, note that the derivation of E((X ) k (Y ) l ) in equation (2.2) in the work of Olkin and Liu [10] is incorrect and should read with where ϒ = υ 1 +υ 2 +υ 3 and p F q is the generalized hypergeometric function. Equations (15), (18), and (19) are taken directly from Olkin and Liu [10] with a = υ 1 , b = υ 2 , c = υ 3 .

Correlation and covariance
The correlation r between X and Y is with the known variances of the beta marginals Var(X ) = a 1 a 2 (a 1 + a 2 ) 2 (a 1 + a 2 + 1) For the covariance of X and Y Magnussen [5] gives an approximate solution, namely where a 1 , a 2 , b 1 , and b 2 are defined based on α 1 , α 2 , β 1 , β 2 , δ 1 , δ 2 as in (7). This approximation is inaccurate for small values of these parameters, e.g. for a 1 = a 2 = b 1 = b 2 = 4, δ 1 = δ 2 = 3, the approximated covariance is Cov(X , Y ) = 0.024, while the true covariance computed from 10 6 samples of the bivariate beta distribution is Cov(X , Y ) = 0.020. This might seem like a small difference but it results in an overestimated correlation of r = 0.854 as opposed to the true correlation of r = 0.730. Even more worryingly, for a 1 = a 2 = b 1 = b 2 = 1, δ 1 = δ 2 = 4 5 , the approximated covariance is Cov(X , Y ) = 1 9 , which results in an estimated correlation of r = 4 3 , which is greater than 1 and therefore wrong by definition.
Given the connection to the Olkin-Liu distribution [10], which we derived in Sect. 2.2, we therefore proceed to compute the exact covariance between X and Y : where are readily available as the means of the beta marginals. We can compute E(XY ) from (14) with k = l = 1, which results in with the moments of the beta marginals from (8) and (10) with υ 1 , υ 2 , and υ 3 as defined in (9). E(X Y ) can be specialized from the moments (17) above with k = l = 1: and ϒ = υ 1 + υ 2 + υ 3 as before. According to Olkin and Liu [10] there is no closedform solution for (17) and (29). However, using the generalized hypergeometric function the product moments and thus the covariance between X and Y can be computed numerically (e.g. by using the hyper function in sympy or the HypergeometricPFQ function in Mathematica).
Note that with Raabe's test one can show that the generalized hypergeometric function 3 F 2 in (29) converges. According to Raabe's test the series converges if where c j is the j-th element of the series described by the generalized hypergeometric function 3 F 2 in (29), which is Since lim j→∞ ρ j = 1 + υ 3 > 1, the generalized hypergeometric function 3 F 2 in (29) converges. However, this analysis also suggests that convergence of the series might be very slow for small υ 3 = δ 1 + δ 2 .

Parameter inference
Magnussen [5] used the method of moments to infer the parameters a 1 , a 2 , b 1 , b 2 for the marginal distributions. Given these parameters he then matched the empirical correlation to the correlation for the parameters δ 1 , δ 2 (given marginal parameters a 1 , a 2 , b 1 , b 2 ) using the approximate solution for the covariance given in (23). However, as shown in Sect. 2.3 this approximation can lead to very inaccurate correlation estimates. An additional problem with the distribution is that very similar data can be generated with different parameter values, as one can see in Fig. 2(a) and (b). Increasing δ 1 and simultaneously decreasing δ 2 or vice versa while keeping the marginal parameters a 1 , a 2 and b 1 , b 2 fixed, can result in very similar correlations, which is shown in Fig. 2(c). Since two distributions with very different parameters can lead to extremely similar data, it is hard to statistically infer the parameters δ 1 and δ 2 from data: The empirical correlation alone does not provide enough constraints and differences in higher moments can be subtle. Therefore, in order to make parameter inference unambiguous, we decided to constrain the 6-parameter bivariate beta distribution to five parameters: two for each marginal and one parameter to control the correlation. A reasonable way to constrain δ 1 and δ 2 is to set with δ max 1 = min(a 1 , b 1 ), δ max 2 = min(a 2 , b 2 ), because this enables the maximum possible correlation between X and Y when the maximum values for δ 1 and δ 2 are attained and the shared component between X and Y is as big as it can be without violating the marginal constraints.

Moment matching
Using this constraint (32), the model-inherent constraints δ 1 < δ max 1 , δ 2 < δ max 2 , and the formula for the correlation derived in Sect. 2.3, we can now optimize the parameters numerically to match the empirical moments. First, the marginal parameters a 1 , a 2 , b 1 , b 2 are obtained using the standard procedure of moment matching for the beta distribution. Given the estimated marginal parameters, an estimate of δ 1 (and with it δ 2 ) can be obtained numerically by minimizing the quadratic deviation between the theoretical correlation and the empirical correlation. To avoid the undefined cases δ 1 ≤ 0 and δ 1 ≥ δ max 1 we bound the optimization between ε and δ max 1 − ε with ε = 0.001. Unless the empirical correlation is bigger than the maximum correlation that can be attained with the matched marginals or smaller than 0, it can be matched exactly for some δ 1 . Otherwise δ 1 will take on its maximum value δ max 1 − ε or its minimal value ε. We implement inference in Python using the package mpmath [24] for the numerical computation of the generalized hypergeometric function and the package scipy [25] for optimization.
As an example we used this numerical moment matching approach on 5000 data points generated with parameters a 1 = 8, a 2 = 6, b 1 = 6, b 2 = 5, δ 1 = 3.28, δ 2 = 2.73, equivalent to the data shown in Fig. 2(a). We inferred the parameter valuesâ 1 = 8.143,â 2 = 6.193,b 1 = 5.882,b 2 = 4.931,δ 1 = 3.286,δ 2 = 2.754. Figure 3(a) shows the correlations implied by different values for parameter δ 1 compared to the desired correlation ofr = 0.47. As one can see, the difference between r andr is zero for the inferredδ 1 = 3.286. Due to our constraint (32),δ 2 = 2.754 can be computed from δ 1 . In this case there is an almost linear relationship between δ 1 and r but this is not true in general, especially for smaller parameter values.
This can be seen in a second example where we applied our moment matching approach on 5000 data points generated with parameters a 1 = 0.2, a 2 = 0.9, b 1 = 0.4, b 2 = 1, δ 1 = 0.1, δ 2 = 0.45 and received the inferred parameter valuesâ 1 = 0.197,â 2 = 0.903,b 1 = 0.403,b 2 = 1.017,δ 1 = 0.101,δ 2 = 0.462. As seen in Fig. 3(b), for this second example the relationship between δ 1 and the correlation is not well approximated by a linear function, in contrast to the first example in Fig. 3(a). Still, inference works as for the first example shown above.  a 1 , a 2 , b 1 , b 2 . δ 2 is not displayed since it can be computed from δ 1 using constraint (32). r max is the maximum correlation that can be reached for the given marginal parameters. (a) shows the first inference example with a 1 = 8.143, a 2 = 6.193, b 1 = 5.882, b 2 = 4.931,r = 0.47, and r max = 0.864. The difference between r andr is zero for δ 1 = 3.286. In (b) we show the second inference example with a 1 = 0.197, a 2 = 0.903, b 1 = 0.403, b 2 = 1.017,r = 0.314, and r max = 0.707. The difference between r andr is zero for δ 1 = 0.101. The dotted linear reference line additionally shows that the relationship between the correlation and δ 1 is non-linear

Simulation study
The performance of the proposed approach for parameter inference was evaluated in a simulation study. We generated data from a bivariate beta distribution using different marginal parameters, correlation parameters, and different numbers of generated samples N and inferred δ 1 from these data using the proposed moment matching approach. For half of all considered simulations, X and Y were chosen to have the same marginal parameters to be able to generate the full range of correlations from 0 to 1, hence a 1 = b 1 , a 2 = b 2 . All possible combinations of the values in [0.5, 1, 2, 3, 4, 5] for a 1 and a 2 were tested while omitting symmetric cases with a 1 ≤ a 2 , resulting in 21 different marginal distributions. For the remaining simulations we considered differing marginal distributions. We chose a subset of 7 marginal distributions from the set of marginals above as [[0.5, 0.5], [1,1], [1,4], [2,2], [2,4], [3,3], [4,5]] and tested inference for all 7 2 = 21 combinations of different marginals in this set. Thus, in total we inferred parameters for 42 combinations of marginals. The correlation parameters δ 1 were chosen as p · δ max 1 with p = 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99 respectively, in order to show how inference works for data with different correlations between 0 and the maximum possible correlation for the respective marginals. δ 2 was obtained using the constraint in (32). To test the effect of the number of samples N on the accuracy of parameter inference we evaluated with N = 100, 500, 1000. For each of the 42 · 7 · 3 = 882 parameter settings, we repeated the data simulation and inference process 50 times, resulting in 44100 inference results. These results are displayed in Fig. 4, for N = 100, 500, 1000 in (a), (b), (c) respectively. We can see that the inferredδ 1 match the true value of δ 1 , the better the higher the number of available data samples N . The average standard deviations of all inferredδ 1 are 0.171 for N = 100, 0.078 for N = 500, and 0.055 for N = 1000. Note that although the generalized hypergeometric function 3 F 2 in (29) is guaranteed to converge, as we showed in Sect. 2.3, it can happen that Fig. 4 Results of a simulation study for evaluating the proposed moment matching approach for parameter inference. We generated data from a bivariate beta distribution using different marginal distributions with different correlations and different numbers of generated samples N and inferred δ 1 from these data. We show inferredδ 1 against true δ 1 for N = 100 samples (a), N = 500 samples (b), and N = 1000 samples (c).δ 1 matches δ 1 , the better the higher the number of available data samples N its numerical computation fails due to very slow convergence. In our simulation study, this error occurred for 2.5% of all inference computations, only for low values of δ 1 .

Application on a real data set of correlated forecasts
The bivariate beta distribution has broad applicability in many fields as diverse as Bayesian analysis, where it can model the correlation among priors for Binomial distributions [14], the modeling of proportions of hardwood forests over time, where it serves to estimate decadal changes in the relative land use of a region [5], the modeling of proportions of electorate voting in a two candidate election, proportions of substances in mixtures, or brand shares [16], and utility assessment [6]. Furthermore, the bivariate beta distribution can be used for modeling probabilities produced by two correlated forecasters. Correlations between forecasters are quite common, e.g. two bookmakers who base their odds on common information will produce correlated odds. For the same reason experts in risk assessment will often produce correlated forecasts. Similarly, different machine classifiers produce correlated predictions when trained on the same data [26]. These correlations should be taken into account when their predictions are combined, e.g. in different techniques for classifier fusion [27], since combining correlated classifiers can otherwise lead to overconfidence and high generalization error [28]. Here, we use such a data set consisting of the predictions of two classifiers as an illustrative example for the application of the proposed inference method. Two classifiers, a Bayes Net and a Random Forest, were trained on Alen Shapiro's chess (King-Rook vs. King-Pawn) data set [29,30]. 1 The task is to predict if King+Pawn will win a chess match against King+Rook based on 36 categorical features of a chess position. For training both classifiers, we used 10-fold cross-validation. The two classifiers' predictions on the respective 10 test sets form the data set we evaluate on. We only considered the predicted probabilities of winning King+Pawn for all 1527 match instances actually won by King+Pawn. The predicted probabilities of winning King+Rook for the matches actually won by King+Rook might follow a different distribution and are therefore excluded in our example. With the bivariate beta distribution we now model the winning probabilities the two classifiers predicted for King+Pawn.  Figure 5 shows the data and the inferred distribution. X is the predicted probability for King+Pawn winning of classifier 1, a Bayes Net, and Y the predicted probability of classifier 2, a Random Forest. In Fig. 5(a) and (b) we show histograms of the predictions of X and Y together with marginal beta densities that were inferred with moment matching: the parameters areâ 1 = 2.094,â 2 = 0.64 for (a) andb 1 = 4.44,b 2 = 0.288 for (b). Figure 5(c) jointly shows X and Y with a correlation of approximatelyr = 0.483. Matching this correlation, too, as described in Sect. 2.4.1, we obtainδ 1 = 1.723 and δ 2 = 0.237. The corresponding correlation is r = 0.483, which matches the data's empirical correlation up to numerical precision. Figure 5(d) shows a simulated data set consisting of 1527 samples drawn from a bivariate beta distribution with the inferred parameters, a 1 = 2.094,â 2 = 0.64,b 1 = 4.44,b 2 = 0.288,δ 1 = 1.723 andδ 2 = 0.237. As can be seen, the generated data set is similar to the real data set in Fig. 5(c) and the classifiers' predictions can thus be modeled reasonably well with this bivariate beta distribution.

Relationship between ı 1 and the correlation
We found empirically that over a large range of parameters the following approximate relationship holds: while δ max 1 = min(a 1 , b 1 ) and r max is the maximum possible correlation for the given marginals. While this relationship could be used for approximate inference of δ 1 , we still recommend the moment matching approach proposed in Sect. 2.4.1 that gives exact results. However, the shown relationship allows interpreting the correlation parameter δ 1 . Particularly for equal marginal distribution with a 1 = b 1 and a 2 = b 2 , for which r max = 1, this interpretation of δ 1 is very simple: The fraction of δ 1 δ max 1 approximately matches the generated correlation. For example, if a 1 = b 1 = 2 and a 2 = b 2 = 4, for δ 1 = 1 we generate a correlation of r = 0.468 ≈ 1 2 r max = 0.5 with r max = 1. For differing marginal distributions, interpreting δ 1 is more difficult because r max must be computed numerically using the formulas given above. If, e.g., a 1 = a 2 = 2 and b 1 = 1, b 2 = 4, with δ 1 = 0.5, we generate a correlation of 0.3 ≈ 0.5 1 r max = 0.313 with r max = 0.627. In Fig. 6, we plot the exact correlation and the approximated correlation computed with (33) for all parameter configurations used in the simulation study in Sect. 2.4.2. As can be seen, the relationship in (33)

Generalization to the correlated Dirichlet distribution
The bivariate beta distribution can be generalized to a correlated Dirichlet distribution [27] in order to model two positively correlated random vectors X = (X 1 , . . . , X k ) and Y = (Y 1 , . . . , Y k ) with the two marginal vectors being Dirichlet-distributed. A k-dimensional correlated Dirichlet distribution can be constructed from 3k gamma-distributed random variables A 1 , . . . , A k , B 1 , . . . , B k , D 1 , . . . , D k with 3k parameters α 1 , . . . , α k , β 1 , . . . β k , δ 1 , . . . δ k distributed according to These random variables are used to construct the correlated Dirichlet-distributed random variables X = (X 1 , . . . , X k ) and Y = (Y 1 , . . . , Y k ) with The two resulting marginal distributions are Dirichlet(X; α 1 + δ 1 , . . . , α k + δ k ) and Dirichlet(Y ; β 1 + δ 1 , . . . , β k + δ k ). Analogous to the example for the bivariate beta distribution in Sect. 2.4.3, this correlated Dirichlet distribution can be used for modeling non-binary probabilistic predictions of experts, sensors, or classifiers. This is particularly useful for Bayesian approaches to classifier or expert fusion, which are the reason why we started working on this distribution in the first place. For example, in a companion paper we apply it to classifier fusion, but instead of using moment matching-as developed here-we use rather inefficient Markovchain methods to sample from the posterior distribution over the parameters [27]. Being able to explicitly model the correlation between probabilistic classifiers or probability estimates given by human experts with the correlated Dirichlet distribution allows Bayes optimal fusion of classifiers or experts, avoids overconfidence of the ensemble and thereby improves its performance. Applications of classifier fusion are widespread. Popular examples are intrusion detection, fake news detection, detection of diseases in medicine, or recognition of human states such as emotions [31]. Another application of the correlated Dirichlet distribution is the generation of stochastic matrices with individual rows or columns being Dirichletdistributed and correlated, which can be beneficial for Markov processes, in optimal control, or reinforcement learning.
The derivations of the product moments and the exact covariance of the correlated Dirichlet distribution are analogous to the derivations for the bivariate beta distribution shown in this work. Thus, the parameters of the correlated Dirichlet distribution can also be estimated using the proposed moment matching approach, extended to the higher dimensionality of the Dirichlet distribution. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.